HDF5 Layout for Multiple Stocks

ModulusFE · Sep 13, 2011

Well, ok but how do you read and write to files with memory alone?

The OP was strictly talking about databases, not in-memory arrays or such. That's another topic completely.

This thread is about databases (such as HDF5).

clearinghouse · Sep 13, 2011

Quote from ModulusFE:

Well, ok but how do you read and write to files with memory alone?

The OP was strictly talking about databases, not in-memory arrays or such. That's another topic completely.

This thread is about databases (such as HDF5).
More...

I'm open to ideas. I'm trying to make incremental improvements to my data management system, but without going so far as to making it a full-time project. I think data management is maybe 10-15% of my business effort right now and it's higher than I'd like it to be, but as a 1-man operation I'm unable to push this forward in leaps and bounds. But other than kdb and Q, very few systems really solve this problem. Certainly no affordable systems.

ModulusFE · Sep 13, 2011

vikana has a valid point of course. RAM is cheap and you certainly need lots of RAM either way.

Personally, I wouldn't want to suggest how much RAM you need, because what I say might be used as a joke in the future, like in the year 2020.

christianhgross · Sep 14, 2011

Quote from clearinghouse:

I was skimming this article here: http://www.puppetmastertrading.com/blog/2009/01/04/managing-tick-data-with-hdf5/

Anyone have any very strong opinions on how to layout data for stocks? One file per day, or one day with lots of data-sets per day, one symbol per file, etc... Any preferences? I'm more interested in hearing stories about how you did it one way, then realized what a f* up it was before deciding on reorganizing another way.
More...

I have gone through many different variations (files, databases, etc).

The answer IMO is to use relational databases, but in very specific ways. For example when receiving ticks I don't process I just write to the database, and keep enough of them in memory as needed for processing.

For analysis I employ data mining and map reduce type analysis. It is the only way to navigate huge streams of data.

I don't recommend files because files are hard to back up, move around, and navigate.

I use a tuned MySQL and MongoDB combination.

ModulusFE · Sep 14, 2011

Files are hard to backup? What do you think a RDBS like MySQL or MSSQL use? They use files

christianhgross · Sep 14, 2011

Quote from ModulusFE:

Files are hard to backup? What do you think a RDBS like MySQL or MSSQL use? They use files
More...

SQL databases have procedures that allow me to do fail over, master slave, redundancy! When I backup a SQL database or a NoSQL database I don't actually backup files, I backup data...

Files on the other hand need to be copied, checked for consistency. And if you move from one platform to another you better make sure that they are not binary, but text based, and properly text based.

I also don't have checks for corruption and inconsistency in the data when using files. A SQL database gives me that automatically as that is their purpose.

So I hope you meant that cynically since backup procedures for files and databases are not even in the same league...

ModulusFE · Sep 14, 2011

SQL makes it easy, that's for sure. The speed is sufficient for most applications.

But as a software developer, you can of course have fail-over and redundancy built into your application and you should.

When you backup an SQL database, you are backing up data to a file (where else).

With a linear file you can also check for consistency. Of course your data is stored in binary format in SQL databases. Everything about a computer is binary, even text files. Endianness doesn't matter when you have your own source code, you can write a conversion.

Checks for corruption should also be handled in your software.

Of course SQL gives you all these nice things automatically. That IS the nice part about SQL.

But you are not going to get the same speed out of a RDBMS. It's a trade off. You can have relational data management and all those built-in niceties or you can roll your own. Sometimes it's not worth it, sometimes it is.

christianhgross · Sep 14, 2011

Quote from ModulusFE:

SQL makes it easy, that's for sure. The speed is sufficient for most applications.

But as a software developer, you can of course have fail-over and redundancy built into your application and you should.

When you backup an SQL database, you are backing up data to a file (where else).

With a linear file you can also check for consistency. Of course your data is stored in binary format in SQL databases. Everything about a computer is binary, even text files. Endianness doesn't matter when you have your own source code, you can write a conversion.

Checks for corruption should also be handled in your software.

Of course SQL gives you all these nice things automatically. That IS the nice part about SQL.

But you are not going to get the same speed out of a RDBMS. It's a trade off. You can have relational data management and all those built-in niceties or you can roll your own. Sometimes it's not worth it, sometimes it is.
More...

Yes SQL gives all of this automatically and quite frankly I would rather spend my time writing code solving my trading issues than writing code to keep my program running.

Now about backing up a SQL database, I don't back up to a file. I use the NoSQL approach and just replicate the data to an offsite machine for potential restore.

You do get the speed. Granted if you want to compare raw to raw sure the file approach is faster, but you give up safety. I can use a number of in-memory databases and the system will fly very fast.

Having worked at a major investment bank all of their data, which comes much faster and harder than retail they used in-memory Sybase databases.

I can understand why people might be tempted to use a file, but my experience on longer term algo development I prefer a database. Of course it is important that you tune the database.

sma202 · Sep 14, 2011

This thread is going off-topic. The question was how to best structure hdf5 for stock time-series. Whether you use a custom built file or commercial db is irrelevant, that depends on your own mix of strategy and timeframe.

Frankly, i'd like to hear how other people are using hdf5.

christianhgross · Sep 14, 2011

Quote from sma202:

This thread is going off-topic. The question was how to best structure hdf5 for stock time-series. Whether you use a custom built file or commercial db is irrelevant, that depends on your own mix of strategy and timeframe.

Frankly, i'd like to hear how other people are using hdf5.
More...

Actually I did answer the question in the first response. You receive the data into memory as raw data or very simple structure. You need enough ram to keep a days worth of data. I use run two servers for pure data collection.

Then you need another server or servers to persist the data. These servers then run map reduce programs to put the data into a form I need for data mining purposes.

Finally I have servers that run the algos themselves and data mining routines.