storing and accessing option quotes

sle · Feb 3, 2013

Up until now I've only worked with historical data for small set of underlying assets, mostly indices. Right now I am storing them in the file-system based hierarchy - a directory for each symbol where there is a sub-directory for option chain and a sub-directory for volatility surfaces. An option chain for each day lives in it's own file and the same for volatility surfaces. This allows me granular access across assets and dates and the necessary degree of flexibility to store volatility surfaces in detail. Later I aggregate volatility surfaces into historical implied volatility files - most filters/models I have deal with actual files for option chains and volatility surfaces as well as the aggregated data.

However, now I am planning to expand to many more underlying assets (from about a hundred to a few thousands) and starting to wonder if there will be any sort of performance penalty for storing so many files and directories on the disk. In essense, there will be 2k directories and each directory would have 2.5k daily files - is there going to be an issue with that? if yes, what are my alternatives (hardware and software)?

luckyputanski · Feb 3, 2013

Solid State Drive is first thing that comes to mind, also decide for filesystem designed to handle lots of small files.
When it comes to SSD I have a personal experience - a test that takes 40 seconds on SATA, takes about 0.4 seconds on SSD. That was a lot of random access, your improvement might better or worse.

Edit: Of course by SATA I meant old fashioned spinning disc on SATA (as SSD can be SATA as well)

sle · Feb 3, 2013

Right now it's all running on Linux under ext4, supposedly a very fast file system. I just looked at SSDs and they are suprisingly cheap (about a $1/GB up to 500GB or so), so that might be the way to go. The EOD option data is only 110 GB so I can put it onto a dedicated drive and the problem is sloved.

hft_boy · Feb 3, 2013

Quote from sle:

Right now it's all running on Linux under ext4, supposedly a very fast file system. I just looked at SSDs and they are suprisingly cheap (about a $1/GB up to 500GB or so), so that might be the way to go. The EOD option data is only 110 GB so I can put it onto a dedicated drive and the problem is sloved.
More...

Yeah ext4 is quite good (although I have never compared it to anything else myself). I also read that reiserFS is good here: http://serverfault.com/questions/6711/filesystem-for-millions-of-small-files. Alternatively just use a database.

If I were you, I would try expanding to the several thousand assets first and see if there is a performance issue for your particular usage pattern, and then figure out how to fix it.

sle · Feb 3, 2013

Quote from hft_boy:
Alternatively just use a database.

More...

I want to avoid that, it would be yet another software package to maintain and optimize. The access is non-random,

Quote from hft_boy:
If I were you, I would try expanding to the several thousand assets first and see if there is a performance issue for your particular usage pattern, and then figure out how to fix it.
More...

I had my junior do that and he ran into some sort of issues, but i think it had to do with him trying to delete that directory. I guess I have to do it on my own

hft_boy · Feb 3, 2013

Quote from sle:

I had my junior do that and he ran into some sort of issues, but i think it had to do with him trying to delete that directory. I guess I have to do it on my own
More...

Good luck!

cdcaveman · Feb 3, 2013

there are so many reasons to use a database to me... i mean understand that it requires additional overhead.. but i think the advantages are so great that its just counter intuitive... i've looked into this very issue to some depth .. i've done some web development and realize how quickly a flat file system gets bloated and inefficient... besides that structured query language was the easiest language i've ever learned.. and a quick lesson in it would help you navigate to files alot faster.. you can filter and even apply logic within the query to suck out the time frame and or any range or arrangement you desire... how quickly can express this.. i want the price series on options at expiration from 5 days to expiration to expiration for the last three years... seems like alot of work to navigate through flat files and drag and select the ranges... its much easier to say SELECT prices
FROM table_options
BETWEEN value1 and value2
AND value3 and value 4
AND etc etc.

syntax isn't exactly right but you get the point

you can save queries, you can reference the queries instead of writing them out.. the indexing makes it even faster on top of how much faster it is then flat files... there are many arguments for using a database..
Open Database connectivity... you can use an open source db like MYSQL which has a huge community of help and resource... it plugs right into excel... with a driver that is very easy to install...

i've done alot of stuff with databases.. they are way easier to back up then 2000 files.. they are literally for exacting redundency and efficency...

cdcaveman · Feb 3, 2013

http://www.mysql.com/products/connector/

the reason i personally have looked into mysql is because it is so widely used and its free... well and most importantly i plan to host my data remotely so i don't have the responsibliity of keeping the database software up to date... a hosting provider handles all that shit.. i just give it a user name and pass word and efficiently create, read, update , delete my data... and not only that.. i could access this via any computer in the world whenever i wanted very simple and effective.... and i can set up routines to backup the db on a regular basis