Are you practically saying that your test used to take 2 hours to load the data into memory? Does your new server take 2 hours to start now?
Sorry, I glanced at the wrong figure. 2 hours was for when the socket was taking 10 seconds to pass objects. That said, it does take ~75 minutes for the new server to load. It's roughly 750 files with 2.5M lines per file.
This is a very rough estimate (and I can calculate the figure if it's helpful) but the raw files contain a total of ~1.9B lines. The data that makes it into the ArrayLists as OptionQuotes would be approximately 1/5 of that.
Anyone know what this guy is talking about with his hacked comment? Has me a little concerned. Edit: referring to emg above
Your numbers make sense, you can cut it down to 30 minutes, to a rate of 1M lines per second. I see now how you arrived at this solution. It looks good from the first glance. However as I said it won't scale nicely. You mentioned you have not started looking at vol surfaces yet so I guess you are not doing any complicated computations. If you are not planning on doing complicated stuff then your current solution is ok. Otherwise you will see your backtest taking few hours to complete after you loaded the data. You will need to look at grid computing to solve this - completely different approach to having a single server with all the data in memory. On the other note I see that there might be a problem with how your data is stored. As I'm storing tick data the far OTM options with low delta have only a few updates a day. Your files most probably repeat their prices every minute. So there's a lot of bloat in your files. If you reformat your data into a stream of updates you will cut down your backtest time drastically.
Lol, you consider your suggestion "simple"? Your idea: * code up an entirely new app to load data * and handle caching of data * as well as serve the data through a REST API. * Code up a REST client, multithreaded/async in order to not freeze up the main UI thread. My idea: * Run a RedisDB instance with data persisted/loaded to disk each time the RedisDB server instance is fired up. (required time to setup 10 minutes plus another 10 minutes to initially load and persist the csv based data into the db.) * Access the data through a lightweight java client (time required: 20 minutes, fully functional API exists already, all one needs is to follow the basic examples on the website. Now that is efficiency.
Again, use a binary flat file structure and implement a smart binary search query mechanism. In that way you can easily request data in between and start and end time stamp with minimal overhead ( faster than even KDB could do it). It keeps the memory footprint to a minimum because you can partition the total request and get packages, small enough to be handled in memory. I wrote such data store and it easily manages around 24 million ticks per second (which includes deserialization, which also includes the merging and sorting by time stamp of mixed data, meaning, when requesting a snapshot from 5 different files, for example. The algorithm takes advantage of multiple threads especially when merging and sorting the data ( typical divide and conquer type of situation). Keep it lightweight and you can make it faster than even the top contenders in the enterprise business (such as KDB). I never understood the need for KDB to handle queries that should clearly be the domain of the data requesting process. All that overhead they added makes it slower than a (well) self-written binary data solution and lets keep in mind that KDB is one of the top performers in this enterprise segment. Of course I understand why KDB did it, because they now sell their back test algorighms that run within KDB for many hundreds of thousands to banks with incapable IT and project management teams with too big budgets. (Not to mention that you need to train up or better hire a highly captable Q programmer, lol. I have seen it first-hand and could not refrain from smiling because each time a quant needed data in a slightly different way a frantic search started to locate the Q-capable guy because nobody else understood the ridiculously terse code semantics.
somehow my earlier posts got lost, so here again the gist of it: * Use Redis if you must for some reason load the complete set of data into memory. * Write your own binary data store and read from flat files on request to handle data sets that do not fit all at once into memory. Way faster than most commercial in memory data bases.