Python advice on CME datafiles...

nononsense · Oct 17, 2005

Quote from bali_survivor:

Dad was a database specialist. He reckoned that there was too much hype about the relational databases and that in a lot of cases there is no need for a relational database. Performance wise the relational database is a dog. His notes state that in trading - with the sequential stream of data - there is no need for a relational database and that you'll handicap yourself severely (performance wise) if you were to utilise one.

Maria
More...

Maria,

For lazy types like me, in spite of my Python, relational databases suit me rather well. Of course don't let the tail wag your dog!

Be good,
nono

Sparohok · Oct 17, 2005

I have "Python in a Nutshell" also by Alex Martelli and have found that it is all I need. On the advice here, though, I'll have to check out the Python Cookbook.

On the topic of relational databases, I think they have their place. They are good for storing stock metadata and providing data locking for a multiprocess or clustered application. However they are not at all good for storing large quantities of time series data. The performance is just inadequate.

I tried pytables and was very disappointed. HDF5 has no locking and does not guarantee data consistency. If your program fails with an HDF file open the entire file may be destroyed. The table format was awkward to use and the array format provided no advantages over pickle. Finally, performance was miserable. With a modest size database (IIRC a few gigabytes) just opening and closing the file took on the order of a minute.

I have found pickle (or cPickle) to be both the easiest and most efficient way to store large quantities of time series data. Keeping each time series in its own file uses the filesystem to perform free space management, saving the programmer the bother. pickle is very fast and the automatic marshalling makes life easy. However locking and consistency are the responsibility of the programmer.

Martin

ktmexc20 · Oct 17, 2005

Quote from Sparohok:

I have "Python in a Nutshell" also by Alex Martelli and have found that it is all I need. On the advice here, though, I'll have to check out the Python Cookbook.
More...

Yeah the cookbooks are

I tried pytables and was very disappointed. HDF5 has no locking and does not guarantee data consistency. If your program fails with an HDF file open the entire file may be destroyed.
More...

I wasn't aware of any inconsistancies. If corruption does occur (never has) I do have my backup copies.

The table format was awkward to use...
More...

How so?

...and the array format provided no advantages over pickle. Finally, performance was miserable. With a modest size database (IIRC a few gigabytes) just opening and closing the file took on the order of a minute.

More...

I've always organized structures in smaller formats (ie. raw data are: 1day; 1market; tick data / per file..) more files but fastest access... or used bound kernel access methods.

Have you tried the most recent version? Improvements were made for the init and binding process. Their db model design is for extremely large file and/or node size, which they are constantly working on improving. I do wish though, that they already supported (natively) image binary data, as hdf does .

I have found pickle (or cPickle) to be both the easiest and most efficient way to store large quantities of time series data. Keeping each time series in its own file uses the filesystem to perform free space management, saving the programmer the bother. pickle is very fast and the automatic marshalling makes life easy. However locking and consistency are the responsibility of the programmer.
Martin
More...

As for comparing with cPickle, this may be true depending on your app. and access rate, BTW, I've never tried pickle w/ nested numerical array's... Do they need to be flattened? Also I wonder if pickling numarray tables is possible (ie: calling columns by field "name"). Therefore I would imagine that your pickle db structure would also require dicts to organize flat arrays into a table like structure. It's getting more costly now. Of couse there's shelve as well (pickle & rdb).

I haven't really played with these other avenues. Just like with anything else... "there's definately more than one way to skin a cat".

I guess just like w/ nonon and his rdb, it depends on what your accustomed to and what your needs are.

ktm'r

Sparohok · Oct 18, 2005

IIRC deleting or inserting into a table was problematic. I also found that reading in a table was rather inefficient, IIRC each row was a dict? I don't remember the details. I haven't tried any more recent version.

pickle can marshal just about any data structure you can use in Python.

I agree that everyone's needs are different, and HDF5/pytables is well designed and well constructed as long as its data model is consistent with your needs.

Martin

ktmexc20 · Oct 18, 2005

Quote from Sparohok:

IIRC deleting or inserting into a table was problematic. I also found that reading in a table was rather inefficient, IIRC each row was a dict? I don't remember the details. I haven't tried any more recent version.

pickle can marshal just about any data structure you can use in Python.

I agree that everyone's needs are different, and HDF5/pytables is well designed and well constructed as long as its data model is consistent with your needs.

Martin
More...

Hi Martin,

My impression of your experience with it, was more of a familiarization situ. Just like w/ anything else in the grand scheme of things... to each their own.

Have great day.
ktm'r

Sparohok · Oct 18, 2005

Quote from ktmexc20:

My impression of your experience with it, was more of a familiarization situ.
More...

I'm tempted to just let that go but... no.

It was not a matter of familiarization.

Pytables was and still is inadequate for my needs.

Incidentally, the straw that broke the camel's back, it was highly space inefficient. An EArray based HDF5 representation of my data was far, far larger than the pickled version. IIRC several times larger. That's inexcusable.

Martin

prt_systems · Oct 18, 2005

Quote from Sparohok:

I'm tempted to just let that go but... no.

It was not a matter of familiarization.

Pytables was and still is inadequate for my needs.

Incidentally, the straw that broke the camel's back, it was highly space inefficient. An EArray based HDF5 representation of my data was far, far larger than the pickled version. IIRC several times larger. That's inexcusable.

Martin
More...

Isn't cPickle great ? - I like pyTables for certain things as well. It really is amazing how well this all works with no payment to anyone required. It seems to me that $soft just keeps making it more difficult to develop for their platforms - roadblocks, continung fee increases etc - and the open source solutions just keep getting better and better. If I have the choice, I no longer choose $soft.

Madison · Oct 18, 2005

in addition to the others listed, here's a good online reference:

http://diveintopython.org/

osorico · Oct 18, 2005

Stored in these dark caverns you may find rich veins of Python code, collected caches of Python information, and all manner of sundry Python passageways to explore.

http://www.vex.net/parnassus/

nononsense · Oct 19, 2005

Quote from prt_systems:

... It really is amazing how well this all works with no payment to anyone required. It seems to me that $soft just keeps making it more difficult to develop for their platforms - roadblocks, continung fee increases etc - and the open source solutions just keep getting better and better. If I have the choice, I no longer choose $soft.
More...

Great words of wisdom. You couldn't sum it up better prt.

I'm not sure wether $soft can do any better. They are simply too lazy and dumb scrooges compared to the army of nimble smart developpers working and improving the OpenSoftware tools like Python.