affordable historical futures price data providers

Discussion in 'Commodity Futures' started by fm_88, Sep 16, 2022.

  1. fm_88

    fm_88

    Which affordable historical futures price data providers do you use / recommend? Not only commodity futures. All including financial, indices, rates futures
     
  2. TrAndy2022

    TrAndy2022

    anfutures(.)com and kibot(.)com if you are looking for cheap tick data (or 1min). Norgate or Metastock (Xenith) [which is basically Reuters data] for daily data.
     
    fm_88 likes this.
  3. NorgateData

    NorgateData Sponsor

    MarkBrown likes this.
  4. MarkBrown

    MarkBrown

  5. Databento

    Databento Sponsor

    Feel free to ping me. We have historical data for the whole gamut:
    - Financial, interest rates, ags, metals, energies etc. on CME.
    - Any kind of granularity. Daily, 1 second, 1 minute, full order book, tick, and depth snapshots.
    - About 500-700k symbols including the spreads, combination products, and options on futures.

    The main difference from most of the providers listed is that we actually source the data from CME directly (their hand-off is 1 switch hop, 4 nanoseconds away) in their Aurora I data center. We also have extremely fast internal and external compression - one of our engineers was a collaborator with Daniel Lemire, of SIMD bitpacking fame. The historical service streams historical full exchange order book at 1.4M data points per second to my laptop on home WiFi. :)

    I can send you an invite for the next test group around Oct 3.
     
    MarkBrown likes this.
  6. M.W.

    M.W.

    Confused, how does your latency and bandwidth relate to historical data? Both are meaningless in the context of historical data which this thread is about.

     
  7. Databento

    Databento Sponsor

    Only bandwidth, not latency.

    Full exchange order book data is about 98G (median) to 201G (peak) per day this year even after normalization and compression, and that figure will only increase over time, so it takes a significant amount of bandwidth to stream it even for historical use. That's a primary reason only a handful of data providers provide historical full order book data over internet; because IP transit is expensive, and even if the vendor is capable of storing the data, it's impractical for the client to receive it over the internet if the bandwidth is low.

    The 1 switch hop also matters - not because of latency, but rather data quality. That's because of timestamping determinism. If your timestamps are injected too far downstream in your infrastructure, the network introduces a lot of variance to the timestamps. The timestamp can fluctuate a lot and render some use cases pointless. And many of our customers do care a lot about this.

    Without namedropping anyone, one of the reasons we started Databento was that the most popular service for historical order book data would (not sure if they still do this to this day) actually transport their historical data to the London Docklands data center - even if the data originates from ASX - before applying a microsecond resolution timestamp. This gives a false sense of precision, because the timestamp actually includes a non-deterministic offset that varies over milliseconds when you're going over such long distances.

    And this same effect matters, though at smaller time scales, even if you're comparing two data providers that are colocated! For example, we've seen some switches display a dispersion of anywhere from 0.35~0.7 us in their port-to-port latencies when under significant load. If the timestamping takes place after a switch hop on such a switch, then that can throw off your backtesting because that kind of dispersion is hard to parameterize and replicate in your production environment.
     
    Last edited: Sep 16, 2022
  8. M.W.

    M.W.

    Bandwidth - I disagree as I am sure you guys have taken care of your bandwidth issues. The real bottleneck for such large datasets is the trip from your servers to your customers. And that's not in your hands regardless of what bandwidth you have.

    Re timestamps, that's crazy. Timestamps should originate at the order matching engine on the exchanges. Nowhere else. Attaching timestamps downstream is pure and simply wrong on so many ends.

     
  9. Databento

    Databento Sponsor

    Compression, serialization, load balancer, router/firewall throughput, I/O from storage to API server, congestion control algorithm on whatever's terminating TCP in the path etc. all have a significant effect on the message rate you're experiencing on client side, and these are more likely to be the bottleneck than the IP transit provider. The moment your data provider opts to provide the data in JSON, it's likely going to transfer much fewer records per second than another provider that provides it in CSV or binary.

    Even if the IP transit provider were the bottleneck as you're stipulating, that's still within our control:
    • Ever wondered why if you pay your home internet (for that matter a commercial internet provider at a data center) for 1 Gbps, you can almost never reach that 1 Gbps in a speed test? Often your provider doesn't directly peer with other providers or has a slow route to another provider, and so your traffic will take many public hops to get between source and destination, and that's where you're experiencing the bandwidth decay.
    • This gets worse when you have a network provider that's in peering disputes with others - this is typically the case with "lower cost" providers like HE and Cogent - in one extreme case we've even seen traffic from Chicago to Aurora take a round trip over the Atlantic.
    The solution is precisely as you've said - to take matters within our hands and aggregate multiple transit providers and several private peering arrangements, and run the full routing table on our side so there are diverse routes between us and our customers. That's partly why we operate our own AS (AS400138). If you run a traceroute to our historical gateway (hist.databento.com), you'll see that we're directly connected to almost every major network (Lumen, Telia, NTT, Google, Amazon, Akamai, Comcast etc.) - most of our competitors are connected only to one or two. Let me know if there's another provider that is as densely connected. :)

    If your use case only requires the matching timestamp, that's great: We do provide the matching and sending timestamps, alongside with our own receive and (for real-time) egress timestamps. So a total of four timestamps.

    That said, in most of the cases we've come across at large trading firms, you actually need the receive timestamp and some firms are willing to pay 10x as much just for the receive timestamp:
    1. The trivial case is when the venue only provides a low resolution timestamp or only synchronizes against a NTP source that exhibits high jitter against UTC. This fortunately isn't the case with CME, but we see this often on ATS's and ECNs. In those situations, you might be only getting timestamping precision within 1-5 ms using the native timestamps provided by the venue, whereas our receive timestamps would be accurate within sub-1 us of UTC.
    2. You need the receive timestamp to calibrate the delay between the matching timestamp and when the client can realistically expect to act on that information, and know how stale the data is. This varies with matching engine and load on data gateway, and can run in the milliseconds.
    3. When you are multiplexing different prop feeds from different venues like us and our target users, the receive timestamp is arguably more important because the different venues are synchronizing against different clock sources and timestamping their data at different frames of reference. Without the receive timestamp it's very hard to interpret muxed feeds.
    4. Some users prefer to backtest against a receive timestamp with guarantees of monotonicity, since not all venues guarantee monotonic timestamps. If properly synchronized against a GPS source provided by the venue, it can be just as accurate as using the matching timestamp.
    5. CME does guarantee monotonicity, but you only need to look back to July 17-22, 2016 to where they had a system error that caused the tag 60 timestamps to become inaccurate - having receive timestamps safeguards against these errors.
     
    Last edited: Sep 17, 2022
  10. M.W.

    M.W.

    Can you pls elaborate on your second point and use an example?

     
    #10     Sep 17, 2022