I'm curious to hear how you are storing and retrieving your tick data. I recently did a write-up on the method I settled on: Managing massive amounts of tick data with Python, simply and efficiently Do you do something similar, or just use flat files or a relational database?
My current opinion is that from the persistence point of view, collecting data should use one strategy, and analyzing it datastore should sit on top of the system that collects it. In other words, I think that the interface to read data for analysis of data should be done in the most user friendly manner possible, with the collecting it in realtime or reading it from e.g., a TAQ file, should be done using different "database" systems. I don't care that much how fast something is offline. I have come to the conclusion that this division hits the right balance, at least for me. Extreme speed and compression databases are too hard to use as a researcher given existing open source tools. While the friendly stuff can't keep up in realtime environments. I don't want to use Kdb to do research. I want to use python. On the other hand, I probably don't want to use python to store massive amounts of realtime data.
GAT, Arctic does look very cool. It's column-oriented and compressed. I wonder at what granularity they store the data. I also wonder if they support queries with arbitrary start/end times, and if so, how they do it. If I can figure out how to access the data with Matlab, then I'll install it and give it a try. nitro, I agree. Collecting realtime ticks for many/all instruments is daunting. Even just maintaining 100% uptime seems hard. My next step will be to buy daily TAQ updates that I can process overnight, to keep my backtests up-to-date. Then I can just discard whatever ticks I process throughout the day. Butterfly, perhaps the title is misleading. I'm not talking about collecting realtime ticks. I'm talking about creating a disk-based data store that makes it fast to read in all trades for a ticker on a given day. For that purpose, I don't believe there is a faster method. The data can be read using C/C++, but getting it into Python is also fast, since zlib is a compiled library. Of course, applying backtest logic to the ticks once they're in Python will not be fast, but that isn't the point. Has anyone here tried HDF5? I tried once a few years ago, and it was very slow. I must have done something wrong, though, because others say it is fast.
I certainly wasn't responsible for writing it, as I am not by any means a professional programmer! But I probably did do some beta testing of a very early version (it wasnt called arctic then, so I can't be sure). GAT