Live data feed options and processing

Discussion in 'Data Sets and Feeds' started by 931, Jul 3, 2020.

  1. 931


    Im wondering what it takes to gather and process 1000-3000 stocks tick data.
    In terms of computing hardware-software-networking and datafeed.

    SP500 list for example is mandatory due to its low spread stocks.
    Basically maximal ammount of low spread stocks would be welcome until system bottlenecks.

    What data rate could i expect to see if collecting tick data of all SP500 stocks and market is very active?
    Also good if you can tell your data provider to know format overhead.

    Today i did some simulations by using historical tick data as incoming stream to simulate and test how much platform can handle.
    It was done over LAN network and i think software is optimized enough to handle 1000+ stocks using regular workstation pc , or the tick data was not consisting of all ticks...

    Packet serialization overhead was minimal compared to data feed providers formats that need json parsers etc. , that part was not considered during the test.

    But i have API connections and incoming data parsers running on separate computer from other parts of software.
    It is to spread workload around and to support hardware firewall with advanced rules, without need to make new rules for each new API that gets tested.

    I still suspect that at this stage i dont have the networking-computing-software infrastructure to process this ammount of real incoming tick data with low delays.

    Biggest bottleneck atm with 1000+ streams is that all incoming data gets evaluated on same pc and at same intervals.
    Causing 100% spikes in cpu usage and delays for orders.
    Not impossible to fix if collecting with all data timings shifted and process at different times. But at some point will still need to separate more.

    Also im wondering if any data provider sends out ~1sec interval bid and ask bar data as live feed instead of tick data and covers 1000+ stocks?

    High quality historical bid-ask data is also very important.
    It is good if it goes back at 10+ years, but it could be kept as seperate feed from live.

    What provider could be useful in this scenario?
    Last edited: Jul 3, 2020
    PlatformFX likes this.
  2. ZBZB


  3. 931


    Checked out Iq and Poly before.
    Does Nanex also offer historical bid/ask data?
    If , How far back?
  4. 2rosy


    you're processing historical tick data correct? then it doesn't matter the rate at which messages arrived in the real world, all you need to do is keep everything in sync. you can slow everything down and use excel or do something like debugger style and step through
  5. Ponmo


    Nanex currently offers historical market data from January 2004 to the present day.
  6. 931


    Is algoseek's historical data valid as it comes or still needs later processing?
    How much do they ask?
  7. If I understand you correctly, any old machine can do this. The amount of data that comes across the wire is trivial.
    I use IQFeed.
    Watching every tick and every update (bid ask change etc.) for over 1,000 of the most liquid stocks (that's my filter) on an old i7 desktop (3.4GHz, 16GB ram), I run about 3% cpu usage after the first 30 seconds at the open. Approximately 99% of that is parsing the data coming in (I ran a profiler). The rest (1%) is running those 1,000 systems. I have very few graphics (It is possible to graph anything internally, but I have not done so for many years).
    The second 2,000 most liquid symbols will take much less cpu time than the first 1,000. Many won't even trade some days.
    Just make sure you write your code in C++, C#, Java, etc. or anything that runs on the CLR or JVM.
    I don't have experience with scripting languages so maybe someone else can let you know if that would work.

    It's rare, these days, for me to run 1,000 systems. I usually run a setup program the night before to narrow down my symbol list. This might be something to consider.
    DiceAreCast likes this.
  8. 931


    You clearly have more efficient algos. ~3% usage is great, id assume you wrote in C/C++ as you recommend it first.

    It seems viable that text based incoming data stream parsing is most of the pre-processing bottleneck.
    Next steps i take after parsing are ->sending over TCP in more efficient format->to hi-lo bars->data validity checks-> +some math before feeding for evaluations.
    Those are quite cheap as well.

    In my situation biggest bottleneck is while evaluating stocks. Caused by constant cache misses due to relatively random memory access patterns.
    Running every tick trough is not possible atm due to lag it would create.

    What do you look for when filtering symbol list?
    Past volume?
    Last edited: Jul 18, 2020
    #10     Jul 19, 2020