I am collecting historical data, market data and order book from Interactive Broker (IB), currently just focus on E mini S&P 500. Planning to collect all shares and commodities, also will probably get another data feed, potentially DTN.IQfeed. My current implementation stores all data in PostgreSQL. I am planning to load them to a Java class. For example to hold all 1 minutes bar of historical data of all ES contracts in the recent 2 years. They are about 3.5 million rows, which have date_time, open, high, low, close and volume. My current thinking, to have a class, e.g. EsHist which has - array of a class called Row (Row will have public variables to store date_time, open, etc.., public variables will allow faster access to them) - hashtable of a class called Row, which will refer to the same object as the array above as the value of the hashtable and the key is date_time. - a rolling window of 4 million rows, I will define 5 million array element to to store the 4 million rolling windows, when the market is active, I will let the array grow pass 4 million, however when the market close, I will drop the oldest rows from the array, shift the array to fill the dropped elements, so the beginning of the rolling window is shifted to array[0] The advantages of using this structure - I can iterate sequentially bar by bar using the array index - I can find location of a date_time quickly using hashtable which then refer to the specific row, then I can get index of the row which allow me to iterate sequentially. - The historical data will still in PostgreSQL, however the most recent data, which the 4 million rows rolling windows are always in memory, reshifted when the market is closed. I will use the in memory data for the feed of the algorithmic trading module. I am running this Java code on a box with i7, 6 cores with 64GB ram/linux Ubuntu. Any thoughts or suggestion on how to improve this data structure? Thank you.
You can partition your rows into separate securities by id or symbol instead for faster access. I do not know much Java environment but usually there are ready to use containers like vector, queue, map that make your life easier. Some of them have hashing implemented already or you can provide your custom hash function. I would suggest map keyed by symbol id containing map keyed by time or vector (vector is faster for sequential access than map) containing bar data structure. Map container should be sorted by default and that speeds up search. If it is too slow then use container that uses build in or custom hash table. Partition of data in memory into smaller chunks maybe necessary due to OS limitations. For example Windows was limiting max single memory object size to 2GB last time I checked.
2 years is about 1 million bars?? Anyway instead of hash table you could use array instead int[year][months][day][time] , where the ints in this array are the index into your sequential array, that would be faster than a hashtable lookup. Actually instead of ints you could just have an direct reference to your Row objects. Row[year][months][day][time];
I am currently doing something very similar in Java using IB API. I collect incoming futures tick data and order book data. I have two simple solutions which work for me. Firstly, as a real-time data comes in, I snapshot it and create an object to represent that snapshot of data. I append the object into a 'writer' queue which is processed by another thread which writes it to a memory mapped file. This is simple multi-threaded programming. I ensure that the memory mapped files are limited in size, and create new ones as necessary. I can index and interate over these data structures as I would with an array, so it's very fast and memory efficient. See here for a similar technique. Secondly, there are a couple of tricks to ensure that you can process this data in memory, so I heavily utilise Java soft references to data structures, this ensures that I don't exhaust the heap by loading too many objects into memory, instead I rely on the Java memory management model to ensure that objects are GC'd (garbage collected) when not required, thus freeing up memory. I find this works really well. The upshot of this is that I can analyse Gb's of data simultaneously. I execute quite heavy mathematical processing and have found that nothing can get close the performance of memory mapped files and hand-crafted data structures. There are simple 3rd party memory mapped databases you could use for persistence like Guauva cache or memdb - and you should explore these. However, the key is all this ensuring your data structures are correctly segmented, for example, in hourly or daily chunks using the appropriate data structures using soft references, this way you can iterate over your data structures and rely of soft references and the OS (via memory mapped files) to take of memory management. So - remember to segment your data correctly. Stick to simple objects containing chunks of arrays. This will allow you to scale massively. Also - don't over optimise, removing the getter and setter methods and relying on direct public access to a variable, is over-optimisation. Modern JVM's do this level of optimisation for you. You are compromising the testability of a simple object by exposing inner variables - its just not required and is hacky at best. I would go ballistic if I found a software developer doing this in my team.
This is for order execution module, applying the algo to a rolling window of data. For algo discovery, I will use python or R. Thanks, I will need this when my data size more than my RAM size. I am using Linux ubuntu, I believe it can manage 16.8 million Terabytes, I will use a simple array of float to store the prices. I will put them inside a Java class, e.g. called HistRow. Aha, good idea! so I can aggregate by dimensions, thanks.
I agree. Instead of JAVA, I would choose python or R, for finding algo and order execution with IB connection. It is since plenty of IB people use python/R, a lot more than JAVA people.
there doesn't seem to be much thought put into a real strategy, collecting market data is easy, structuring in a meaningful way so it can be used for a strategy, that's another question. It has more to do with Analytics than simply storage of data and optimization of running data.
thanks for your response, two weeks ago I was collecting data only, now I have to think how to process this data to create indicators. Agreed with your post, collecting one-off raw data is easy updating raw data with real time new feed is the next step (optimization of running data) preprocessing the data such to make it ready for analytic, and doing the update from raw in real time is harder. One simple example is to populate the rows where the data is missing, e.g. for 1 minute ES bar, when a particular one minute data is missing.