What do you do with your missing data?

Discussion in 'Strategy Building' started by nijshar28, Jun 25, 2020.

  1. 931

    931

    Similarly to earlyer commenter i first tryed algo to fix gaps and flaws in historical data,
    even considered NN based data fixing that learns based on artificially made missing/bad data.

    But trading algos started to find too many predictable advantages in those fixes and i dropped the idea.

    Now only marking data as bad.
    It is possible fit 8 boolions in 1 byte using bit packing.
    Enough for faulty data info and more.

    Even so i have had many problems with too many high probabilty patterns in data so the algos to consider data bad had to get more complex.

    Atm considering working on algo that takes multiple sources of data and puts additional sources to similar spread level as main data.
    Purpose is to fill gaps or correct faulty data.
     
    Last edited: Jun 26, 2020
    #11     Jun 26, 2020
    nijshar28 likes this.
  2. Hey. Thanks for your input.

    I think these approaches may work great for some applications. However, for backtesting, specifically, I think there're some issues with them. Though I am not sure a better alternative exists that is also easy to implement.

    For backtesting, I think forward filling makes sense in cases other than delisting. In an event that the company was delisted, however, you assume that you are still able to buy or sell it at any time in the future at the last observed price. This is obviously incorrect. Although I think forwardfilling may still work here with some additional rules and controls in place.

    Backfilling, to me, doesn't make sense for backtesting even for unlisted equities. It is a time machine, any way you slice it.

    Incidentally, I am starting to suspect that having time-series of equal duration may not be important for backtesting. So maybe I am overthinking it -- trying to solve a problem that doesn't exist, or is relatively minor, which I think others have alluded to in this thread (recommending to just leave them as NaNs).
     
    #12     Jun 26, 2020
  3. %%
    NOT trading for a period of time?? Don't trade that stuff/ low liquid;
    but it may work well for an investment.
    Some do use zeros for a non trading day; but i would not use zero; what would you do if DAL goes to zero again/bankrupt again??:D:D:D:D:D:D:D
     
    #13     Jun 26, 2020
    nijshar28 likes this.
  4. 931

    931

    All those aspects matter, best to think all trough as much as possible.
    In the end complex things are bunch of simple. And details matter.

    I dont use filling, just label missing as bad.
    Those filled parts would give useless info to algos in my case.
     
    Last edited: Jun 26, 2020
    #14     Jun 26, 2020
    murray t turtle likes this.
  5. dholliday

    dholliday

    Plug holes:
    If before your data, fill with NaN. Either there was no data for this symbol or you don't have it (maybe newly listed).
    Once you have data, fill ohlc with the last close you have data for. The volume is 0. You can write your algos to either use the data with the volume set to zero or check and not use it.
    After your data, fill with NaN. Either there was no data for this symbol or you don't have it (maybe delisted).

    There are a great number of companies that have no trading volume on some days. It's not bad data. If you are working with intraday bars, even very large liquid companies have many bars during the day with no volume.

    Your goal when working with data is to maintain all the information available in the original data and maintain flexibility.
     
    #15     Jun 26, 2020
    nijshar28 likes this.
  6. rkr

    rkr

    No, don't forward fill delistings.

    In practice the ticker would just continue trading OTC and if you carry a position into the close your position still has price risk. Forward fill will artificially assume the price stays the same.

    Forward filling halts is a bad idea, too. Think about a realistic edge case where you have a position in LK carrying into Apr 7, 2020. In this case, it's true that the actual price is unchanged at 4.xx, however you'll see that the calls are getting exercised around 2.5-3.x during the halt despite no underlying, so the implied price is still changing.

    For representing NaNs, if your language of choice has a library or first class object supporting NaNs, I would use them. If not and you have to use a sentinel value to represent NaN, I would recommend max or minimum value like 0xffffffff and possibly write your own lightweight class to represent NaN based on that value. If that's not possible, I would keep a separate bitmap vector to flag the fictitious indices. If that's not possible, I would recommend using a negative value for equities (this is what the CRSP database does, and it does a fairly good job of handling all edge cases and corporate actions on daily data back to 1960s). 0 is probably one of the worse choices for sentinel values because a stock can actually go to 0.
     
    Last edited: Jun 27, 2020
    #16     Jun 27, 2020
    jtrader33 and nijshar28 like this.
  7. Hey. I feel this is a really informative response.

    I did not realize such stocks continue trading OTC after delisting.

    Do you know what happens if I am holding a position through a broker (e.g. IB) and the company gets delisted? Does its market value go to zero? Both me and my broker forget about it and we move on? Or there is a liquidation process of some sort? There might be an OTC counterparty that I can still trade with? If it is a short? Do I have to buy it OTC to close that position?

    The reason I started the thread is I am trying to figure out what to do about delistings in my backtest. Right now I see 2 options:

    1) Assume the price goes to 0 the day of the delisting and force the closing of my positions in delisted stocks at the 0 price.

    2) Assume I closed the position at the last observed price, i.e. I close out my positions in delisted stocks at the last available EOD close.

    Thank you.
     
    Last edited: Jun 27, 2020
    #17     Jun 27, 2020
  8. kriskon

    kriskon

    Trying to analyze why this happened, how to avoid it, and what to do right now.
     
    #18     Jul 10, 2020
  9. Can you be more specific? Did you encounter a new issue today? If new issue, can you identify your data source?
     
    #19     Jul 10, 2020