HDF5 Layout for Multiple Stocks

Discussion in 'Programming' started by clearinghouse, Aug 31, 2011.

  1. I was skimming this article here: http://www.puppetmastertrading.com/blog/2009/01/04/managing-tick-data-with-hdf5/

    Anyone have any very strong opinions on how to layout data for stocks? One file per day, or one day with lots of data-sets per day, one symbol per file, etc... Any preferences? I'm more interested in hearing stories about how you did it one way, then realized what a f* up it was before deciding on reorganizing another way.
  2. sma202


    Depends on type of data IMHO, if just daily closes then you could just put it in one file and one table. Tick I think would be better in one file then a table for each day. It all depends on how you access the data, if you need multiple datasets then opening multiple files might be slow.
  3. Craig66


    One directory per day with each column broken out into a separate file.
  4. Try to read and write to 50,000 symbols as individual files on NTFS. The process can take several minutes.

    Then try a memory mapped file, which should take less than a second.

    This was discussed here before:

    HDF5 or not, if you are constantly reading and writing many symbols, the best option is one large memory mapped file (can be several terabytes). Otherwise your file system becomes a bottleneck. You will only need to maintain the linear database once every so often by "growing" it.

    A single memory mapped file (or "linear database") is also much faster than a relational database such as MSSQL, MySQL, etc.

    More on that:

    "Having an RDBMS doesn't mean instant decision-support nirvana. As enabling as RDBMSs have been for users, they were never intended to provide powerful functions for data synthesis, analysis, and consolidation (functions collectively known as multidimensional data analysis)." - Ted Codd, inventor of the relational database model, 1993.

    A look at traditional data storage

    SQL databases consist of a set of row/column-based "tables", indexed by a "data dictionary". A table is a “container” that stores data. In reality, a table looks a lot like a spreadsheet as it is composed of rows (records) and each row is composed of columns (fields). A collection of related tables are known as a database.

    Using the very flexible SQL (structured query language), you can retrieve data from any table, or groups of related tables, and have that data presented to you as a “view”.

    This basic functionality, and the flexibility to store and relate almost anything, is what makes the RDMS model so powerful and so widely used for nearly every serious business application.

    Unfortunately, this “one size fits all” approach to data storage and retrieval is why the RDBMS model fails for financial applications.

    The RDBMS model produces substantial overhead due to its inherent multiple row and table record structures. When you heap indices, clusters, and procedures on top, you create even more overhead which slows down performance considerably.

    Since all RDBMS records are equally “important” to the database, they are not optimized for speed.

    Also, since an RDBMS has no inherent data compression methods, they are usually combined with exception reporting and averaging techniques, which may result in data loss and inaccurately reproduced data.

    RDBMS are too slow

    The speed of writing to an RDBMS is quite slow (from the prospective of the computer). Major RDBMS vendors often claim benchmarks that include very high transactions per second (TPS). What they don’t say is that the TPS speed refers to actions performed on the data after it is already in the database, and not to the speed at which it is written to the database or the data retrieval speed. What goes on inside of the database is of little interest to the end user. The data acquisition speed, and the actual time that it takes to put a set of results onto the screen, is where money is made and lost.

    An additional SQL drawback, from the prospective of any financial data application, is that statistics are not automatically calculated by the RDBMS because SQL mathematics is limited to sums, minimums, maximums, and averages.


    C# supports memory mapped files in .net 4.0. Let me know if you need any pointers (no pun intended).
  5. So, say I use one huge file. Are you suggesting just one huge file for each symbol, just tack on data one by one? If I want to slice the data from times between t_0 and t_n and "join" it with another symbol in the same time interval t_0 and t_n, I'd need some kind of indexing scheme. How do you propose doing this in a way that doesn't cost me an eternity of development time?

    Right now, I have lots of little files everywhere organized by day, one giant linear file. I was thinking with HDF5, I could put all the data into 1 file, but have each day be a different data set, then maybe write some basic indexing scheme to let me assemble slices together across different sets.

    Right now I have some primitive indexing scheme, but the work still takes several minutes to assemble the data together correctly. I was hoping once I built the indexing I could also record the results of matching/indexing scheme back into the file somehow, maybe as its own dataset.

    More comments would be useful.
  6. I'm referring to writing your own database engine in C++ or C#. One file for all symbols. You need a header in your file with an index that points to the offsets for each symbol. You will have empty space after each symbol that is allocated for new data. You will simply grow the file when space runs low. Could be once every few months.. depends on how much space you allocate. Joining is unnecessary. Use the header to find the symbol. To find a specific record, keep your header up to date and triangulate when needed. You should be able to open, navigate, read/write and close the file within a few milliseconds even if your file is several terabytes. Make sure to use BigInteger or _int64 for the offsets.
  7. Interesting. Thanks for your comments.
  8. vikana

    vikana Moderator

    I suggest one file by symbol on disk and in RT you simply keep as much as needed in memory (I think of it as a in-memory cache). Even low-end desktops these days will do 32 GB of RAM, and most server class equipment does 128GB or 256 with multiple cores. There is no reason to go fancy - just get the memory.
  9. That's why we use memory mapped files :)
  10. vikana

    vikana Moderator

    Mem mapped files are also a good choice. One just has to factor in the cost of the I/O interfaces and the mmaping. Performance is better with directly addressable memory in my experience.
    #10     Sep 13, 2011