HDF5 vs. text file for storing data?

Discussion in 'Data Sets and Feeds' started by Batman28, Jan 29, 2009.

  1. I hear alot about using HDF5 for tick stoing data for back-testing etc. but is it not far simpler reading/writing to a simple .txt file? one can easily load data into memory for analysis and save data too, so the question is what are the advantages of HDF5 vs. normal text file?

    appreciate any serious input, thanks
     
  2. 377OHMS

    377OHMS

    HDF5 is a binary format, nice small files. Very good for huge data.

    Nothing wrong with Ascii files but HDF5 is more appropriate for really large data.
     
  3. could you pls elaborate? I mean how significant would this be in storing tick data, let's say 1 year 1min data?

    Also are there any functionality advantages?

    thanks
     
  4. 377OHMS

    377OHMS

    Well, the files read into Matlab with just a few lines of code and read in almost instantaneously. Some of my ascii processes can take a minute or two to open large files (~750Meg).

    HDF5 files are easier to amend as each entry is a field with attributes. Changing a few rows in an ascii file using Matlab is difficult and usually people resort to perlscript if they are autonomously amending large ascii data files.

    Dunno, thats all I can think of right now cause I just woke up. :D

    Just search google for hdf5 and you'll see some good format descriptions. You might also consider SQL as an alternative. Nice to be able to query into the data from automated processes.
     
  5. erdewit

    erdewit

    I've played with HDF5 through it's Python
    bindings (called PyTables) and found it to be
    unsuitable for a tick database.

    First of all, it is slower then just working with
    regular files.

    Second, the HDF5 database is one huge file that
    is easily corruptible. For example, if you accidently
    let two processes have write access to
    the database then it will become corrupted.
    Repairing didn't always work for me so then all
    data would be lost.

    Third, HDF5 is a hierarchical database where
    objects are retrieved via a path-like key. This is
    the same as with a regular filesystem and offers
    no advantage whatsoever over a just a plain
    regular filesystem.

    The only usecase for HDF5 is when working with
    datasets that are too large to fit into memory.
    Tick data does not fall into this catagory.

    What I am using is just simple files. One tick file
    per instrument per day. It's easy to see what's
    going on, it's easy to compress files and make
    incremental backups. Reading speed is 1 M ticks/s
    for text files and 10-20 M ticks/s for binary files.
     
  6. 377OHMS

    377OHMS

    Mostly nonsense.

    I work with terrabytes of 20Hz data. HDF5 is universally preferred for large data. It is not suitable as a database file. You read the entire thing into high core and then work on it. As I said, you might be better off with an SQL database but if you are just bulk storing large data most people in the scientific community use HDF5.
     

  7. That sounds good - can you describe your set-up, what language you use to read/write the files? do you any specific frameworK/IDE?

    Thanks in advance


    377OHMS, I'm totally not interested in SQL. It's just useless for time series data the way I see it.. anyways what do you exactly do using terrabytes of data? is this tick data? what sort of analysis do you do if you don't mind me asking, thanks
     
  8. 20Hz??! what on earth are you doing ?
     
  9. dsss27

    dsss27

    Based on the significance of 377 Ohms, likely Beta waves or maybe a close harmonic of the Schumann Resonance:D
    sorry I couldn't resist!
     
  10. erdewit

    erdewit

    That sounds good - can you describe your set-up, what language you use to read/write the files? do you any specific frameworK/IDE?

    As language I'm using Python, with performance critical parts implemented in C. The C extensions are not written by hand but generated by courtesy of Cython. The general design and ideas do not depend on it though.

    In my design there is an abstract DataStore that gets implemented by a FileStore, a Hdf5Store and a SqlStore. The FileStore has two modes: text and binary. I use the text filestore for capturing ticks during the day because simple text files are just so damn reliable. I can do a tail -f on a file and see the ticks scroll by. The text files are zipped at the end of the day and backed up.

    The zipped text files are too slow for backtesting though so I use another filestore with binary files for caching. The binary files are memory mapped (mmap) for ultimate speed.

    I don't use the SqlStore because it's too slow and I don't use the Hdf5Store because it gets corrupted too easily.

    k377: Not sure what you think is nonsense. I'm not dissing HDF5 in general, only it's use in a tick database.
     
    #10     Jan 29, 2009