Databento - Direct market data feeds for everyone, built by former HFT traders

Discussion in 'Announcements' started by Databento, Jul 12, 2022.

  1. Databento

    Databento Sponsor

    Yes. The reason being that we see our main value-add as the infrastructure/technology layer, not as a data licensor.

    To be specific, we don't impose any licensing restrictions on our user agreements. For almost every market, once any of the data gets older than 24h, there's nothing on the market's end that restricts a user from redistribution or redistributing derived data on that. I can only think of 1 market operator (CME) that has a restriction on historical redistribution. If your current vendor does impose any restriction on historical redistribution, there's a high chance that those restrictions are coming from the vendor rather than the market.

    As for redistribution on live data (24h or newer), that's where markets would generally require you to pay for a redistribution license. We simply facilitate the process for a user to attain the redistribution license from the market operator and pass through those fees. If a user has a redistribution license from the market, we don't impose any limitations on them redistributing the data. It's quite cost-prohibitive though, often in excess of $5k MRC per feed for external redistribution. If the purpose is merely internal redistribution, we recommend most of our users to use us as a vendor-of-record and break down the usage by subscriber count.
     
    #21     Jul 14, 2022
    shuraver likes this.
  2. Databento

    Databento Sponsor

    Do you mean metrics data? Or metrics about the service itself (latency, amount of data etc.)?

    On exchanges that do provide static data like EOD volume, open interest etc., we do pass those on. On top of that we compute daily liquidity metrics (e.g. event counts, percentiles of touch depth, average spread etc.) like this and provide that.
     
    #22     Jul 14, 2022
    shuraver likes this.
  3. Databento

    Databento Sponsor

    Haven't yet finalized a launch date publicly but it's likely to be between Aug 23 to Sep 13.

    It's quite hard to time our announcements and I understand it causes some frustration for some folks (@M.W.) and we hear you. Keep in mind it's mainly because of the sheer scale of moving pieces. (1) We plan to do a mass-regeneration of a large chunk of data (about 6+ PB) to make sure everything's clean, and that just takes a long time and has a lot of variance because of the amount of data involved. (2) We do have a fairly large waitlist and don't want the service to be underprovisioned for users we've already onboarded. All of our storage and IP transit is self-hosted, so it's a little trickier to scale everything than if we were on a cloud platform.
     
    #23     Jul 14, 2022
    jtrader33 likes this.
  4. 2rosy

    2rosy

    I mean latency metrics on the service. Also, does databento subscribe to any competitors feeds to compare metrics. For the realtime feed, does the sdk handle reconnects, missed messages, etc. will udp be offered in future. how large are the messages and what bandwidth is recommended?
     
    #24     Jul 14, 2022
  5. Databento

    Databento Sponsor

    Here's a diagram breakdown of our latency. It's currently about 41~ mics through the stack till our load balancer and firewall. This diagram is a little outdated and shows <2 mills for the load balancer and firewall, but the min/mean/max through that is now 0.7/63.7/310 us after an upgrade we made 2 weeks ago.

    [​IMG]

    Most of the latency outside our stack will be dominated by distance and your network provider. You can ping/traceroute the public loopback for our Aurora I load balancer (dc3.databento.com) for example. If you're hitting our CME gateway from Aurora I/II or 350 E Cermak, it will likely be sub-mill or 1 ms one-way respectively.

    If you're hitting our NY4 gateway from anywhere in the Equinix campus in Secaucus, it will likely be sub-mill one-way. (You'll probably get the best numbers in NY4 and NY2, where we terminate IP transit.)

    We haven't really optimized for latency since it's still targeted at internet users initially. We'll probably push for about 5 us (2 switch hops, 2 PCIe hops, and some time in userland on host CPU) when we open up to colo users as well.

    Not really, but our engineers have mostly worked with: Activ, Bloomberg B-PIPE, Celoxica, MayStreet, Redline, QuantHouse, SR Labs (now Exegy). It's not in good spirit to critique their stacks.

    It uses TCP as the transport protocol, so you won't miss messages unless you lose connection outright. We provide an intraday replay up till the last 24 hours that you can opt in upon subscription, so that can be used for recovery.

    Unfortunately too far away for us to commit right now.

    Most of our binary-encoded messages are 28~ bytes compressed in real-time and 13~ bytes compressed for historical. The uncompressed messages are mostly 56 bytes per event - it's designed so each message can fit within a cache line.

    Most equities exchanges are probably fine with 5 MB/s. But say full order book on our beefiest feed is about 4B events per day, maybe 30% of it clustered around a 1 hour period, so you'd need about 20 MB/s so as to not fall behind. Or you can just pick a subset of symbols to manage the bandwidth (we allow up a combination of any arbitrary 1,000 instruments per subscription if you decide not to listen to every symbol on the market).

    We also expose CSV and JSON for convenience but those take up more bandwidth and our client libraries just use binary encoding all of the time.
     
    #25     Jul 14, 2022
    shuraver likes this.
  6. Robert Morse

    Robert Morse Sponsor

    Sophia-Can we assume your live streaming data is subject to non-display fees for equities, option and futures?
     
    #26     Jul 14, 2022
  7. Databento

    Databento Sponsor

    That depends on the exact market in question.

    Our users will often incur non-display fees for US equities, options and futures feeds. But what we've seen is that there's a common misunderstanding here among many retail traders (and even vendors): Several exchanges interpret non-display by whether you have a designated session port, and not by the means of data consumption. This means if you're feeding the data via API to a charting software and automating your execution through that, or you're running an autospreader, or you're consuming the data via API in your own custom application but you're not connected to the exchange on its extranet, there's a fair chance that those non-display fees do not apply.

    Counterintuitively, this also means if you're running an execution gateway software that requires you to register a session port, even if you're not actually automating your execution, you might well be subject to non-display fees.

    We take each of these idiosyncratic rules into account rather than adopting a uniform policy when we process our users' applications for real-time licensing.

    Also, for US equities, keep in mind that we only provide prop feeds and don't deal with the UTP/CTA SIPs, so the policies will vary slightly from what you might see from most retail data vendors.

    Lastly, for OTC markets like cash FX, in some cases we have a special agreement with the ECN that lets us bypass typical non-display fees.
     
    #27     Jul 14, 2022
    shuraver and cruisecontrol like this.
  8. Thanks for your reply.

    Will it be possible to select / download ONLY certain types of messages and not pay for the other types?
    In historical? in real time?

    Regarding RAW data: At least for historical, RAW packets still compress very well with a big enough block size and a decent codec. If raw is not provided then it's on you to make sure every element that any user cares about is in the schema.
     
    #28     Jul 14, 2022
    Databento likes this.
  9. Robert Morse

    Robert Morse Sponsor

    Thank you. Is it also fair to say that for live data, this is more of an Institutional offering and not for the retail market? and, the back testing data and futures live and back testing data might be more apt to be targeting both retail and Institutional clients from a cost stand point?

     
    #29     Jul 14, 2022
  10. Databento

    Databento Sponsor

    No problem.

    Yes you can select for only a specific schema, in both historical and real-time. That's how we homogenize the solution for both institutional and retail users: a retail user can subscribe to only very few things that they need to keep costs low, whereas our institutional users often want everything.

    As a side note if you're curious about the technical details: That's one of the reasons why our gateway is currently so slow (40+ us) actually... the parsing and book building is actually happening in sub-300 ns, but we're doing a lot of bookkeeping to export all of the schemas, do all of the transcoding, and filter out separate channels and streams customized for each user.

    Oh I think I understand what you're suggesting. I misunderstood and thought you meant the raw multicast packets. But the raw payload or simply one half of the packets after A/B arbitration might be possible. That's a workable idea, I'll let the team know and see if it's possible. Thanks for suggesting!
     
    Last edited: Jul 14, 2022
    #30     Jul 14, 2022
    shuraver and cruisecontrol like this.