Bogus data?

Discussion in 'Data Sets and Feeds' started by trader225, Jun 29, 2007.

  1. Here is some data collected today as the ES went to 1505.25 for the last time, then began going up. It was captured from IB's data feed, and it is a sequence of price and size events. This is the order in which they occurred.

    Does anyone know why:

    1. "last size" occurs twice

    2. Sometimes one "last size" added to the previous "volume" will equal the next "volume," but not always.

    Oh! And one very important question: What price did the "last size" occur at?

    last price: 1505.500000, 0
    last size: 198
    last size: 198
    volume: 1425299
    bid price: 1505.250000, 1
    bid size: 824
    ask price: 1505.500000, 1
    ask size: 52
    bid size: 824
    ask size: 52
    last size: 103
    volume: 1425402
    bid size: 748
    ask size: 20
    last size: 1
    volume: 1425416
    bid size: 715
    ask size: 513
    last price: 1505.250000, 0
    last size: 101
    last size: 101
    volume: 1425517
    bid size: 606
    ask size: 487
    bid size: 573
    last price: 1505.500000, 0
    last size: 347
    last size: 347
    volume: 1425904
    bid size: 561
    ask size: 101
  2. Whoops! Lowest price 1504.50.
    Data still questionable.
  3. From what I understand, IB data feed does not provide every single tick; which can result in some strange trade sequences.

    I know this to be true under the API toolkit, so I would also assume it is true under their traders workstation.
  4. I got the data from the API.
  5. Figured out what to do: I'll watch the IB matrix when the market is slow, take notes, then compare with the data captured from the API.
    I can tell what is happening on the matrix, when it
    is slow, so I should be able to figure out what the sequence is in the API data by comparing with my matrix notes. I'll report back on the results. May try Sunday night.
  6. The duplicate size problem is discussed here (IB login required):

    Here is an expert from Richard King:

    The IB datafeed is optimised to ensure that it keeps up with the market no matter how busy the market is.

    To accomplish this, it effectively sends a price snapshot for each instrument at regular intervals. This interval seems to be about 300 milliseconds. For each of bid, ask, and last it compares the current price and size with the values at the last snapshot. If the price is different it sends both price and size. If the price is the same, but the size is
    different it sends only the size. If both price and size are the same, it doesn't send either. If there have been any trades since the last snapshot, it sends the (accumulated) volume (so where the price and size haven't changed but there have been one or more trades, this can be detected from the increased volume).

    A word of caution though: this is not an exact science. It would be nice if what I said in my post was an exact description of how it works, but you'll find odd things happening occasionally, such as a volume update without a prior size message where the increase in volume is not an exact multiple of the most recent size message, or multiple last price/size messages sent at the same time, or volume messages with a smaller volume than the previous one! But most of the time my description is accurate.

    By the way, one gotcha is that when both price and size messages are sent (in a single TICK_PRICE socket message), TWS also sends the size again in a separate TICK_SIZE message, but the volume is correctly updated only once. I think the reason for this duplication is that before the version 2 TICK_PRICE message was introduced, it didn't contain a size field, so prices and sizes were always sent separately: if TWS didn't send the duplicate size, then programs that relied on the separate TICK_SIZE message would no longer work properly unless they were amended and recompiled.

    This mechanism eables IB to know the maximum bandwidth required for each ticker, and hence for each customer (since the number of tickers is limited), and so it can size its servers to be able to cope with that load. If a market becomes very busy, it makes no difference because it will still only send an update three times a second or thereabouts, even if there have been 100 trades during that second. This avoids the problem that every other data feed seems to have, where the data will sometimes lag way behind the market at busy times (with every other vendor I've used, I've had occasions where the data could be anything up to two or three minutes behind the market).

    There is an irritating side effect of this technique, which is that price movements between shapshots may not be reported at all: for example if the last price at snapshot 1 is 100, and then price moves up to 102 and then
    back to 101 by snapshot 2, the price reported at snapshot 2 will be 101, and the 102 price will not be reported at all. This leads to occasional incorrect highs and lows of bars, but rarely by more than one tick: whether that is significant depends very much on the trading strategy used.

    The above isn't a complete description, but it covers the basic mechanism.
  7. Thanks for that information. I registered there.
    In spite of the problems, I hope to mine the data I am collecting. Am working with gnuplot -- heard of it in Bollinger's book -- which has a finance.dem script from Bollinger himself!
  8. Here is some data from this evening's session, right after connecting via the API. It appears there may be two "problems":

    1. A volume increase without a "last size" event.

    2. A volume increase with two "last size" events, one of which should be ignored, if the "volume" event is accurate.

    The order of the fields is:

    Timestamp Event Ticker Value AutoExecute (if applicable)
    18:45:32:213    ask size        8       70
    18:45:32:215    last price      8       1516.500000     0
    18:45:32:216    last size       8       5
    18:45:32:218    bid size        8       71
    18:45:32:219    ask size        8       70
    18:45:32:222    last size       8       5
    18:45:32:226    volume          8       3548
    18:45:32:228    high price      8       1516.750000     0
    18:45:32:229    low price       8       1514.750000     0
    18:45:32:231    close price     8       1515.500000     0
    18:45:46:330    ask size        8       75
    18:45:56:856    last size       8       1                   <- one last size
    18:45:56:858    volume          8       3549                
    18:45:56:859    bid size        8       70
    18:46:07:855    ask size        8       95
    18:46:19:606    bid size        8       71
    18:46:40:330    bid size        8       70
    18:47:01:116    volume          8       3550                <- no last size
    18:47:01:118    bid size        8       69
    18:49:08:351    last size       8       6                   <- one last size
    18:49:08:387    volume          8       3556
    18:49:08:428    bid size        8       63
    18:49:56:142    last price      8       1516.750000     0
    18:49:56:144    last size       8       2                   <- 1st last size
    18:49:56:145    last size       8       2                   <- 2nd last size
    18:49:56:146    volume          8       3558
    18:49:56:148    ask size        8       93
    18:50:19:893    ask size        8       98
    18:50:26:893    bid size        8       68
    18:50:33:395    last size       8       5
    18:50:33:397    volume          8       3563
    18:50:33:398    ask size        8       93
    18:50:33:895    last size       8       10
    18:50:33:896    volume          8       3573
    18:50:33:898    ask size        8       73
    18:50:38:398    bid size        8       69
    18:50:39:647    ask size        8       84
    18:50:40:147    bid size        8       74
    18:50:46:370    bid size        8       75
    18:50:58:401    ask size        8       85
    18:50:58:650    last size       8       1                  <- one last size
  9. If you just ignore the LAST_SIZE events and work from the VOLUME_EVENT, things will be fine. The cumulative volume is accurate.
  10. Yeah, I think that is the only way.
    #10     Jul 1, 2007