Dealing with data feed diversity...

Discussion in 'Data Sets and Feeds' started by NetTecture, Feb 1, 2012.

  1. Hello,

    as a question. I am integrating a couple of different feeds into one infrastrcture and a little at a loss regarding dealing with different data. With a strong programming background I went all "structs & objects" and now I have problems handling updates properly because different feeds deliver different data, possibly at different times (i.e. feed a delivers instrument infromation in 1 callback, feed b in 2).

    The main issue is that it makes a lot of objects more complex because I need to merge them and track which fields have values contrary to "default values" which I ignore for an update.

    I consider moving the downstream part of the infrastructure (i.e. the one for incoming / outgoing data) into a field list type of approach with a number of defined messsages that are not objects but basically contain a list of "field updates". Basically an "InstrumentUpdate" would contain the fields that determine the key (symbol, exchange, additional feed data comes from) and then have a list of fields with data. Every field has an identity field (what data) and basicalyl the data. The engine then can identify the current state (in example of the instrument) and then merge the fileds into the object. Advantage: Fields must not distinguish between "no data" and "no update" (no update = field is just missing from the list) and merging partially filled out fields.

    The application would then utilize objects internally - they still may have not configured data (missing) but at that point it is a lot easier to deal with it than in the concept of an update stream.

    Anyone any ideas about that? Looking for comments here ;)
     
  2. Is this realtime or historical data? Are you trying to build a database or write decision logic based on two different feeds?

    Is your order logic dependent on both feeds? Can you fork and then treat them as separate systems? Do you have a position management API that both systems could then reference (for risk, hedging, etc)?

    Overhead on combining the two feeds real time is going to be high I would imagine. I don't have any experience with this so hopefully someone more knowledgeable will chime in. I have only combined multiple feeds into historical databases - and that's a big brutal undertaking (albeit only once though).
     
  3. It is both and trade data makes no difference for now. The real tiem data is stored into a fully replayable file continaing all updates. Trade information is simplistic in itself (positions, orders, fills). I am more concerned about fundamental daa.

    Example:

    I have a struct containing instrument info:
    * Description
    * Strike (for ioptions)
    * Product type

    One feed gives me all in one event, the other not. So how do I deal with a missing product type in the struct? The server can say "ok, that is an udpate to 'unknown'" or 'there is no product type for this update'.

    A struct makes the second approach hard- it is nice in the system and database, but i tend to go towards a field update stream for the connection/server interface.

    In this case, the following would happen:
    * InstrumentUpdate event
    * Key: instrument, exchange (and feed)
    * Data: Description (text)

    The server thn knows the event does NOT contain ANY information for a strike (no change) and no information for product type (no change).

    More flexible, less problems.
     
  4. byteme

    byteme

    Hi, I'm not sure I completely followed your approach but I'll tell you how I tackled it and how I know many other platforms tackle it.

    1) Internally, the platform uses feed/vendor/broker agnostic objects/structs and these get passed around to whoever needs them. E.g. Instrument, Trade, Quote etc.

    2) At the boundaries with the different feeds an adapter is responsible for converting from the external feed data to the normalized object. Once the object is normalized and put on the bus, it becomes impossible to tell where the data came from and it all looks the same (unless you want to tag with origin somehow). If two different feeds provide the same data in different ways then the adapter acts as the coordinator/normalizer.

    3) The adapters are also responsible for converting/mapping the normalized Instrument to the broker/feed specific symbol.
     
  5. byteme

    byteme

    OK, perhaps I'm still missing the point but it still sounds like a normalization issue to me.

    One feed adapter gets the data all in one event and re-broadcasts it to your system in some internal standard format.

    The other feed adapter gets the data in separate events - it has to hold on to the partial data and wait till the remaining data arrives and then combines the two before re-broadcasting it to your system in the same internal standard format as the first feed adapter.

    Once the data is in the internal standard format (Object/struct) you can't tell where it came from (unless you want to).

    If I'm way off, just ignore my comments - finding it hard to decipher what you're getting at.
     
  6. Yes I agree, it's VERY hard to understand what the original poster's questions are or what the organization of the program is.

    Essentially what you need to do is break down both feeds into primitives so that they are both giving you the same information. This means normalizing the data into some lowest common denominator, as byteme said. It sounds like one data feed is giving big chunks of data and the other is giving small bits of data? The description from the OP is pretty bad.

    You need to break down both data streams into some common form, so that your engine doesn't care what your data source is, it's seeing a single data format. Then it doesn't matter if you have 2 or 10 data streams, the engine will behave the same way. As it gets information about a particular stock, it can adjust the value, either in memory or in a database. This way your data model and your engine aren't tied to a particular data source.
     
  7. For quite some time I am using now realtime data and using it for trading and further strategy development.

    There is one major lesson I learned on the way:
    Every bit of data that can be saved should be saved.
    (Because at some point in time you will develop an idea how you could make use of this additional information.)

    Therefore my main idea would be NOT to merge data but to keep it separate.

    When receiving data from different feeds you might somewhen discover that one of them is not so reliable or only in part.
    The software should then be able to recognize this (the source of quotes) and act accordingly.

    On the other hand trying to merge streams from different sources might create error conditions that are very hard to find.
     
  8. Agreed. I acutally also do not relaly merge it - every data item is tied to teh feed. The main question was the API. I do not like the connection code to keep copies of the current states around - mostly because I isolate those sometimes into separate processes and I track around 200.000 objects from one feed. Initializing this is painfull. and redundant.

    What I will do is go to field delta feeds - not publish the (partially filled out) object, but publish update notes noting which fields have data. Then the central processing code can look at the copy, update it and store / publish it. This means if one feed gives me another granulatiry I do not send around partially filled out objects witth additional tracking code which fields actually contain data - it simplifies the logic significantly ;)

    The core processing code then updates as I said the objects and notifies onsumers something has changed.

    Regards - great help.