Java - Storing data in memory for post-runtime access

quatron · Jul 17, 2013

Quote from jtrader33:

It works great - a test that took over two hours previously is now finished in a few minutes (save for the initial data loading into the server).
More...

Are you practically saying that your test used to take 2 hours to load the data into memory? Does your new server take 2 hours to start now?

Businessman · Jul 17, 2013

How many bars are you reading in total?

jtrader33 · Jul 17, 2013

Quote from quatron:

Are you practically saying that your test used to take 2 hours to load the data into memory? Does your new server take 2 hours to start now?
More...

Sorry, I glanced at the wrong figure. 2 hours was for when the socket was taking 10 seconds to pass objects. That said, it does take ~75 minutes for the new server to load. It's roughly 750 files with 2.5M lines per file.

jtrader33 · Jul 17, 2013

Quote from Businessman:

How many bars are you reading in total?
More...

This is a very rough estimate (and I can calculate the figure if it's helpful) but the raw files contain a total of ~1.9B lines. The data that makes it into the ArrayLists as OptionQuotes would be approximately 1/5 of that.

jtrader33 · Jul 17, 2013

Anyone know what this guy is talking about with his hacked comment? Has me a little concerned.

Edit: referring to emg above

quatron · Jul 17, 2013

Quote from jtrader33:

Sorry, I glanced at the wrong figure. 2 hours was for when the socket was taking 10 seconds to pass objects. That said, it does take ~75 minutes for the new server to load. It's roughly 750 files with 2.5M lines per file.
More...

Your numbers make sense, you can cut it down to 30 minutes, to a rate of 1M lines per second. I see now how you arrived at this solution. It looks good from the first glance. However as I said it won't scale nicely.

You mentioned you have not started looking at vol surfaces yet so I guess you are not doing any complicated computations. If you are not planning on doing complicated stuff then your current solution is ok. Otherwise you will see your backtest taking few hours to complete after you loaded the data. You will need to look at grid computing to solve this - completely different approach to having a single server with all the data in memory.

On the other note I see that there might be a problem with how your data is stored. As I'm storing tick data the far OTM options with low delta have only a few updates a day. Your files most probably repeat their prices every minute. So there's a lot of bloat in your files. If you reformat your data into a stream of updates you will cut down your backtest time drastically.

caementarius · Jul 26, 2013

How about storing your data in a RAMDisk?

CalVolibrator · Jul 28, 2013

Lol, you consider your suggestion "simple"?

Your idea:

* code up an entirely new app to load data
* and handle caching of data
* as well as serve the data through a REST API.
* Code up a REST client, multithreaded/async in order to not freeze up the main UI thread.

My idea:
* Run a RedisDB instance with data persisted/loaded to disk each time the RedisDB server instance is fired up. (required time to setup 10 minutes plus another 10 minutes to initially load and persist the csv based data into the db.)
* Access the data through a lightweight java client (time required: 20 minutes, fully functional API exists already, all one needs is to follow the basic examples on the website.

Now that is efficiency.

Quote from lwlee:

Are you saying you are a complete beginner with Java?

Just keep it simple. First program simply reads your data into memory and stays alive. Your 2nd program that you will be constantly bouncing, just needs to figure out how to communicate with first program. Basic core java libraries should have what you want.

I would suggest google "RESTful" for the simplest examples you can find. Using html request/response protocol is nice and simple approach.

Edit: let me qualify that, RESTful isn't part of the core java libraries but it's a standard that is prevalent so if you want to spend a little extra time, it would be good to know. Otherwise, basic html req/resp is part of the java libraries.
More...

CalVolibrator · Jul 28, 2013

Again, use a binary flat file structure and implement a smart binary search query mechanism. In that way you can easily request data in between and start and end time stamp with minimal overhead ( faster than even KDB could do it). It keeps the memory footprint to a minimum because you can partition the total request and get packages, small enough to be handled in memory. I wrote such data store and it easily manages around 24 million ticks per second (which includes deserialization, which also includes the merging and sorting by time stamp of mixed data, meaning, when requesting a snapshot from 5 different files, for example. The algorithm takes advantage of multiple threads especially when merging and sorting the data ( typical divide and conquer type of situation).

Keep it lightweight and you can make it faster than even the top contenders in the enterprise business (such as KDB). I never understood the need for KDB to handle queries that should clearly be the domain of the data requesting process. All that overhead they added makes it slower than a (well) self-written binary data solution and lets keep in mind that KDB is one of the top performers in this enterprise segment. Of course I understand why KDB did it, because they now sell their back test algorighms that run within KDB for many hundreds of thousands to banks with incapable IT and project management teams with too big budgets. (Not to mention that you need to train up or better hire a highly captable Q programmer, lol. I have seen it first-hand and could not refrain from smiling because each time a quant needed data in a slightly different way a frantic search started to locate the Q-capable guy because nobody else understood the ridiculously terse code semantics.

Quote from jtrader33:

As lwlee suggested, I ended up caching the data in a persistent server app that reads the csv files and then stores the filtered 5min data into an ArrayList<OptionQuote> per day. The backtest app requests a day's worth of data from the server and receives it via objectinputstream* over the socket. It works great - a test that took over two hours previously is now finished in a few minutes (save for the initial data loading into the server). However, to quatron's point about scaleability, the issue I am up against now is managing the memory consumption. One month of csv files (with all 1min data) is 1.0GB and yet my server loaded with just the 5min data is 1.2GB (I only have 24GB of RAM and need to test 36 months of data). I've tried to be careful with my serialized OptionQuote class: OptQuote(long quoteDateTime, double undBid, double undAsk, long expiryDateTime, double strike, char right, double optBid, double optAsk) ...but perhaps I will need to keep each quote as a single String on the server side and then convert the ArrayList<String> to ArrayList<OptionQuote> on the client side. I'd rather not have to do that though since:
1) it will require splitting the string on the server side anyway (to determine if its a 5min interval)
2) an additonal iteration through the entire day of quotes on the client side to convert to ArrayList<OptionQuote> (this is necessary for methods that determine earliest expiry, closest strike, etc.)
3) I'm not certain that it will actually be more efficient from a memory standpoint

Any suggestions on the memory aspect would be helpful. Regardless, thanks for all the input - memory issue aside, I'm pretty pleased that I was able to get something that works well together in a single afternoon + evening.

*In case anyone finds this post in a search and attempts something similar:

There was a huge difference in performance between this...

out = new ObjectOutputStream(connection.getOutputStream());
in = new ObjectInputStream(connection.getInputStream());

and this...

out = new ObjectOutputStream(new BufferedOutputStream(connection.getOutputStream()));
in = new ObjectInputStream(new BufferedInputStream(connection.getInputStream()));

Without the inclusion of BufferedOutputStream/BufferedInputStream, it took ~10 seconds to push my ArrayList objects through the socket (which was no faster than reading from disk). After adding them, the transmission is virtually instantaneous.
More...

CalVolibrator · Jul 28, 2013

somehow my earlier posts got lost, so here again the gist of it:

* Use Redis if you must for some reason load the complete set of data into memory.

* Write your own binary data store and read from flat files on request to handle data sets that do not fit all at once into memory. Way faster than most commercial in memory data bases.

Quote from CalVolibrator:

Again, use a binary flat file structure and implement a smart binary search query mechanism. In that way you can easily request data in between and start and end time stamp with minimal overhead ( faster than even KDB could do it). It keeps the memory footprint to a minimum because you can partition the total request and get packages, small enough to be handled in memory. I wrote such data store and it easily manages around 24 million ticks per second (which includes deserialization, which also includes the merging and sorting by time stamp of mixed data, meaning, when requesting a snapshot from 5 different files, for example. The algorithm takes advantage of multiple threads especially when merging and sorting the data ( typical divide and conquer type of situation).

Keep it lightweight and you can make it faster than even the top contenders in the enterprise business (such as KDB). I never understood the need for KDB to handle queries that should clearly be the domain of the data requesting process. All that overhead they added makes it slower than a (well) self-written binary data solution and lets keep in mind that KDB is one of the top performers in this enterprise segment. Of course I understand why KDB did it, because they now sell their back test algorighms that run within KDB for many hundreds of thousands to banks with incapable IT and project management teams with too big budgets. (Not to mention that you need to train up or better hire a highly captable Q programmer, lol. I have seen it first-hand and could not refrain from smiling because each time a quant needed data in a slightly different way a frantic search started to locate the Q-capable guy because nobody else understood the ridiculously terse code semantics.
More...