Python advice on CME datafiles...

Discussion in 'Data Sets and Feeds' started by brokershopping, Oct 12, 2005.

  1. Maria,

    For lazy types like me, in spite of my Python, relational databases suit me rather well. Of course don't let the tail wag your dog!

    Be good,
    nono
     
    #11     Oct 17, 2005
  2. I have "Python in a Nutshell" also by Alex Martelli and have found that it is all I need. On the advice here, though, I'll have to check out the Python Cookbook.

    On the topic of relational databases, I think they have their place. They are good for storing stock metadata and providing data locking for a multiprocess or clustered application. However they are not at all good for storing large quantities of time series data. The performance is just inadequate.

    I tried pytables and was very disappointed. HDF5 has no locking and does not guarantee data consistency. If your program fails with an HDF file open the entire file may be destroyed. The table format was awkward to use and the array format provided no advantages over pickle. Finally, performance was miserable. With a modest size database (IIRC a few gigabytes) just opening and closing the file took on the order of a minute.

    I have found pickle (or cPickle) to be both the easiest and most efficient way to store large quantities of time series data. Keeping each time series in its own file uses the filesystem to perform free space management, saving the programmer the bother. pickle is very fast and the automatic marshalling makes life easy. However locking and consistency are the responsibility of the programmer.

    Martin
     
    #12     Oct 17, 2005
  3. Yeah the cookbooks are :cool:
    I wasn't aware of any inconsistancies. If corruption does occur (never has) I do have my backup copies.
    How so?
    I've always organized structures in smaller formats (ie. raw data are: 1day; 1market; tick data / per file..) more files but fastest access... or used bound kernel access methods.

    Have you tried the most recent version? Improvements were made for the init and binding process. Their db model design is for extremely large file and/or node size, which they are constantly working on improving. I do wish though, that they already supported (natively) image binary data, as hdf does .
    As for comparing with cPickle, this may be true depending on your app. and access rate, BTW, I've never tried pickle w/ nested numerical array's... Do they need to be flattened? Also I wonder if pickling numarray tables is possible (ie: calling columns by field "name"). Therefore I would imagine that your pickle db structure would also require dicts to organize flat arrays into a table like structure. It's getting more costly now. Of couse there's shelve as well (pickle & rdb).

    I haven't really played with these other avenues. Just like with anything else... "there's definately more than one way to skin a cat".

    I guess just like w/ nonon and his rdb, it depends on what your accustomed to and what your needs are.

    ktm'r
     
    #13     Oct 17, 2005
  4. IIRC deleting or inserting into a table was problematic. I also found that reading in a table was rather inefficient, IIRC each row was a dict? I don't remember the details. I haven't tried any more recent version.

    pickle can marshal just about any data structure you can use in Python.

    I agree that everyone's needs are different, and HDF5/pytables is well designed and well constructed as long as its data model is consistent with your needs.

    Martin
     
    #14     Oct 18, 2005
  5. Hi Martin,

    My impression of your experience with it, was more of a familiarization situ. Just like w/ anything else in the grand scheme of things... to each their own.

    Have great day. :)
    ktm'r
     
    #15     Oct 18, 2005
  6. I'm tempted to just let that go but... no.

    It was not a matter of familiarization.

    Pytables was and still is inadequate for my needs.

    Incidentally, the straw that broke the camel's back, it was highly space inefficient. An EArray based HDF5 representation of my data was far, far larger than the pickled version. IIRC several times larger. That's inexcusable.

    Martin
     
    #16     Oct 18, 2005
  7. Isn't cPickle great ? - I like pyTables for certain things as well. It really is amazing how well this all works with no payment to anyone required. It seems to me that $soft just keeps making it more difficult to develop for their platforms - roadblocks, continung fee increases etc - and the open source solutions just keep getting better and better. If I have the choice, I no longer choose $soft.
     
    #17     Oct 18, 2005
  8. #18     Oct 18, 2005
  9. Stored in these dark caverns you may find rich veins of Python code, collected caches of Python information, and all manner of sundry Python passageways to explore.

    http://www.vex.net/parnassus/
     
    #19     Oct 18, 2005
  10. Great words of wisdom. You couldn't sum it up better prt.

    I'm not sure wether $soft can do any better. They are simply too lazy and dumb scrooges compared to the army of nimble smart developpers working and improving the OpenSoftware tools like Python.
     
    #20     Oct 19, 2005