cleaning intraday data

Discussion in 'Data Sets and Feeds' started by ej420, Mar 11, 2009.

  1. ej420


    I am wondering what people do in order to clean their intraday data. In particular, I use a lot of intraday (1-minute) data from IB, which is full of bad prints, especially on SPY. Bad prints are especially problematic since they will severely bias any strategy that relies on lows/highs, rather than simply vwap.
    I have developed my own code to filter outliers, which I think does a reasonable job at removing bad prints (at least passes visual inspection). I am wondering if anyone else has encountered similar problems, and what their thoughts are.
  2. (1) Any strategy that "relies" on lows/highs...
    Is so simplistic that it's certainly worthless.

    (2) Free data will always be sub-optimal...
    People who expect quality for free need to grow up.

    (3) A lot of those "bad prints" and "outliers"...
    Are actual trades, usually a small trade a dollar or two or three away.
    Most low volume stocks see freakish trades every day...
    Probably an order error fully exploited by a sleazy Market maker.
    These obviously need to be parsed out.

    (4) In general, market data MUST be error checked and "cleaned"...
    How you do it depends entirely on your trading strategy.
  3. There's a number of research papers on the internet about cleaning data. It's often academically called, time series data. So that's key.

    I wrote some software to do that maybe can be shared but still decline to discuss it publicly. PM.

    The main challenge of good tick filtering is that there will naturally be gaps in the data whether it's because of a daily or weekend close or due to temporary outage.

    So you can't simply look for ticks that are "out of band" with respect to prices you have to use other parameters including time.

    Of course, it takes a lot of time and experimentation to get it working properly. And some humanly obvious bad ticks can get through because they come in slightly "under the radar" but at least you can screen out the total insanity.

    As I said, I might be willing to share my work. Reluctantly.
  4. taotree


    I'm in the process of developing some custom software that will work with tick/quote futures data (currently considering zen-fire, but haven't decided yet) and would be interested in finding any information about algorithms, techniques, etc. regarding "cleaning" the data.