Live data feed options and processing

Discussion in 'Data Sets and Feeds' started by 931, Jul 3, 2020.

  1. dholliday

    dholliday

    My platform was written for speed and flexibility. I want to be able to write anything I can dream up. That said, it's so fast that I have never bothered to optimize anything.

    I switched to Java from C++ in 1999. I would not recommend C/C++ for this type of project. C# (CLR languages) and Java (JVM languages) unless you already program in C/C++. Very little difference in speed. The real use for C is in modifying the Linux kernel for HFT.

    Like you, I put the data in a different format and distribute it to where it is needed. The only data checking I do is delete any trade data with a total daily volume less than the previous trade. I've never checked to see if this does anything, but my goal is to delete any trades reported late (they are not actionable).

    How is your data stored that it is not all in memory? I just keep what I need in memory. Very fast. You can store an awful lot of doubles and longs in a megabyte of memory and access can be very fast. Also, some programs and libraries require you to put your data into a specific data structure; very bad design. I think TA-Lib was this way and maybe AmiBroker. Anyway, my data is not in their format, and no way I can put the data in their format every time I want to do a calculation and expect it to be fast.

    Many years ago a friend and I implemented the same system. Me on my platform and him on AmiBroker (a very nice program which I have bought several licenses for over the years). Since AmiBroker was not built for real-time trading his code did calculations every 1 to 3 minutes, depending on how long the calculations took. I suspect that his code was single-threaded. Throwing all that work at it at one time took a while. Spreading the work out over time and threads makes a big difference.

    Filtering the symbol list:
    Average daily volume
    Price above a certain level
    Stock only
    No REITs
    ETF only (leveraged / not leveraged)
    Industry Groups (haven't actually done this)

    Different systems may be best with different symbol lists

    Writing your own code will allow you to do whatever you want, but it is a lot of work.
    Take care and best of luck.
     
    #11     Jul 20, 2020
  2. algoseek

    algoseek Sponsor

    algoseek's historical data is valid as it comes and needs no further processing.
     
    #12     Jul 20, 2020
  3. 931

    931

    Same design philosophy.
    Using proprietary platforms would have created many limitations with awful workarounds.
    Understood it early enough and never coded anything other than bridges leading out from proprietary apps.

    Then your ideas were optimal from beginning or you know how to work around problems.

    I build using Qt5/C++ and maintain as cross platform app for linux/windows/mac
    Lately using mostly linux due to better compiler and profiling support.

    All performance critical parts utilize parrallel+simd.
    Learning/implementing new optimization tricks from time to time.

    My ideas are quite inefficient ML related, to find non random features and separate chaos.
    Memory access patterns are not cache friendly and optimizations wont help much unless core ideas and access patterns change.


    If loading less data than memory then all backtesting data in memory.
    If more data then utilizing M.2-SSD based cache system that autooptimizes from past access patterns.

    https://www.elitetrader.com/et/threads/tick-data-storage.346878/
    I discuss some ideas related to that in that thread, alot is still open and undone.
    Current plan is to use motherboard with 2-3 M.2 drives and create software based raid 0, in hopes of getting closer to ram speed with sequential read.
    Maybe you can recommend alternative ideas.

    Lately i thought to also test if can get decent performance when just setting up large linux-swap drive using fastest M.2 drives.
    But then no control how it gets buffered ,as OS decides.
    To my knowledge linux also can utilize lz4 compression for swap drive but with 5000mb/s speed CPU is bottleneck and much simpler algo would be needed.

    Some modern cpus have 60mb+ caches now, probably can run 50mb very fast as well.

    Do you mean to say it creates better memory access patterns and performance if data is in separate containers instead of single container with structs containing all?

    How long have you been working on your platform? I spent ~6 years and still alot could be optimized better and improved. Lately getting time off due to long test periods.
     
    Last edited: Jul 20, 2020
    #13     Jul 20, 2020
  4. algoseek has live data for institutional users but as far as I can remember it cost thousands per month for co-located low latency feed
     
    #14     Jul 20, 2020
  5. rkr

    rkr

    Actually, not much at all.

    The "gathering" part is usually an I/O bound operation.

    To simulate what you should experience: I randomly picked a day on the higher side, just some vol spikes I recalled, and the opening +/-30 minutes of the SIPs. You are probably listening to a merged feed from your vendor so I just counted the A side interface: Combining all ~10k tickers and every event (status messages, BBO, trades) on UTDF (95M), UQDF (1.4G), CTA (290M), CQS (5.7G) is only only about 7.48 GB.

    For added measure, I combined this with the same ~10k tickers for the direct feeds with full order book and status messages of most lit venues: BYX Depth (648M), BZX Depth (1.2G), EDGA Depth (509M), EDGX Depth (581M), BX ITCH 5.0 (641M), PSX ITCH 5.0 (287M), NSDQ ITCH 5.0 (2.5G), ARCA XDP (1.5G), ARCA XDP Trades (40M), NYSE OpenBook Ultra (2.4G), NYSE Trades (29M). I don't have IEX's feed. So a total of 18.23 GB.

    Now I stream that 18.23 GB from that from 1 machine to my laptop as fast as my WiFi network would allow losslessly, and it registers 0.0% CPU on my 4 year old Intel Core i7-7500U CPU to store it.

    It's a different story if you are storing these on a colocated machine and need to buffer up for the bursts. There the minimum viable solution is to patch the default kernel network socket, but you can make do with standard Intel 10G ethernet adapter which you can buy for less than $100, and I've seen people do this on a broker's rack with fairly generic Cisco 3064 switches that are past EOL today, and you can buy that for like $700. You shouldn't have this requirement because your data vendor's gateway should effectively act like a messaging queue to offload this for you.

    Now, the "processing" part is completely a function of your data provider's protocol or SDK and your own application and it depends entirely on you how efficient you want to be. It doesn't matter quite as much what language you're using as how efficient is your implementation. And usually the core platform logic has very little variance or tail in latency, it's just the signals.

    I get the sense from your posts, from this thread and the one you've linked, that you enjoy the optimization and are just looking for reference benchmarks. (That's OK and an admirable choice, but not always important in my opinion.) So if it helps, here's a few earnest benchmarks of single signals I've seen... The worst offenders I've seen in production include integer programming for an execution trajectory at runtime, Monte Carlo approximations to variational PDEs that spat out a whole day's execution trajectory on each call, and some exotic pricing models. Naive implementations of ML-type signals I've seen are usually O(n^2) in the data points and mostly do a bunch of matrix operations or heavy resampling on the whole dataset on every evaluation.

    All in all, the slowest I've ever seen it cost in production was about 6-15 seconds per update and this was a state-of-the-art fastest model with one of the largest cross-sections (global multi-asset, multiple portfolios, messy client constraints from different fund classes, individual notional gross positions probably as large as 4B and 1.2T off-balance sheet contractuals in credit counterparty trades). The next slowest I've seen was 1-2 milliseconds mostly moving the entire array across PCIe bus to a GPU on a Java library. Other record slow examples I've seen take 5-10 mics per evaluation using Intel MKL with .NET wrapper or some highly customized econometric model that takes 2-8 mics per evaluation using Eigen C++. These were each on different strategies printing in excess of $100k per day with multi-million dollar annual development budgets.

    You can multiply these by the number of signals you have to see how you compare with the slowest.
     
    Last edited: Jul 20, 2020
    #15     Jul 20, 2020
    Occam, DiceAreCast, 931 and 3 others like this.
  6. 931

    931

    I dont understand all the data/speed terms you use but i get at least O(n^2) complexity up to very steep exponential complexity , depending on settings.

    Just tested and ~820ms avg per evaluation of 498 stocks(sp500 list) if forcing all to run trough.
    ~1.64ms per instrument with near minimal acceptable settings.

    Using about 5 year old PC. 5960X@4.6ghz, 25%+ overclocked CPU&cache , reduced memory timings etc.

    I have not tested alot with timeframes where spreads eat all away.
    But for retail level conditions even 100ms+ per instrument evaluations are fine IMO if new data can come in same time on networking threads.
    It takes much longer time to even break even with spread.
     
    Last edited: Jul 21, 2020
    #16     Jul 21, 2020
  7. Could you please tell me what data stream provider did you use to obtain full book for that many exchanges? Most only provide top of book quote streams and one or two exchanges with full book, usually just NASDAQ.
     
    #17     Mar 17, 2021
  8. rkr

    rkr

    I subscribed to all of the exchanges' prop feeds directly.
     
    #18     Mar 17, 2021
    arbs-r-us and howdoyouturnthison like this.
  9. Do you use an intermediary for provisioning? And I can't imagine what the fees and costs in total for this monthly would be...
     
    #19     Mar 21, 2021
  10. It's not relevant in this context whether you implement in C++ or C# or Java. They all perform as @dholliday suggested. I have similar profiling results. I stream around 500+ symbols into my engine on a tick basis.

     
    #20     Mar 29, 2021
    dholliday likes this.