EOD database with Excel and Python/MATLAB integration

Discussion in 'Data Sets and Feeds' started by equis, Jul 16, 2012.

  1. equis

    equis

    I am looking for a database solution to query my EOD stocks and options data.

    Below is a description of my data set and of my requirements for the DB system.

    It would be great if you could suggest a DB system which might be suitable for this purpose. Thank you very much in advance.

    DESCRIPTION OF THE DATA SET:
    -----------------------------------------

    STOCKS DATA TABLE:
    - Only EOD data
    - Contains EOD prices & volume data for ~10,000 stocks --> about 30,000,000 records in total
    - About 15 years of EOD data


    OPTIONS DATA TABLE:
    - Only EOD data
    - Options on about 5,000 underliers
    - Data set contains ~100,000,000 record (--> a records is one quote per option per underlyer per day)
    - About 10 years of EOD data


    REQUIREMENTS FOR DATABASE SYSTEM:
    ---------------------------------------------------

    - Integrates well with Python or MATLAB: Easy to read data into Python or MATLAB (very important) and easy to write from these programs back into the DB (a bit less important)
    - Integrates well with Excel (very important): Ideally there is already an Excel add-in which adds functions to Excel which allow me to read data directly from the database into the relevant cell / array in Excel (similar to what the Bloomberg [such as =BDP, =BDH] and Reuters add-in are doing)
    - Database does not require a lot of "low level" work and maintenance
    - Ideally the database is free and open source
    - Performance is not a major issue: Since I am only using EOD data, performance is not my major concern, the above mentioned integration with Excel/Python/MATLAB is more important. Nevertheless, the DB should be performant enough to run "SELECT" queries (such as filtering the options data set by the underlying stock ID) in a reasonable amount of time (<10 sec).
    - It's fine if the DB is non-SQL: Given my bad experiences with SQL databases (MySQL and MS SQL Server) that I have used for far for this purpose (SELECT queries were extremely slow on the ~100m records options data set, some times took >30 min to execute), I would be happy to also consider non-SQL solutions.
    - DB can store multiple fields for the same time series: For example in my stock price table I would like to be able to store both price and volume data for the same day.
     
  2. Currently planning a similar project (EOD stock database for Python and Matlab access).

    I have little practical experience with SQL, but I'm under the impression that it is the best solution. Can't think of any reason why you would need monster queries such as you describe. Speed shouldn't be an issue when you approach your problem from the right angle.
     
  3. After more reading, it seems that HDF5 is more suitable for time series. Well-supported by both Python (PyTables) and Matlab.
     
  4. I just solved for this problem. I am using txt files: 1 per underlying/day for optoins data.
    Stock data through SQL server.

    SQL is too slow for the options data. With txt files, things aren't the fastest, they are easier to deal with. However, once you set it up, it's a pita to move them around and re-create them.
     
  5. Tick accurate sql consolidations are very fast.. ie.. return single record queries across 50,000 instruments x 5 years... 20TB of optimized and properly indexed tickdata takes< 100ms.

    Returning large data sets is limited to network bandwidth. 10gb switches and sata 3 drives are capable of 600mb/sec.

    It takes a lot of infrastructure & skill to implement properly. Historic analysis, testing is different then real time message based processing. Especially when analyzing folio's and pairs.

    Just moving TB's of data is challenging... You really can't have a generalized one size fits all model and expect uber optimization and speed. If you architect the system for specific queries and build out different data sets your performance gains can be in orders of magnitude better.

    Take a look at sqlite inmemory db's and mklite bridge between sqlite and matlab.

     
  6. 2rosy

    2rosy

    youdont need hdf5 for endof day data. any relational db would work fine. you can connect to them with any language with odbc or natively(most of the time)
     
  7. Thanks, interesting info. However, we are talking EOD data here.
     
  8. You don't need a DB layer at all, really, but I might use tick data someday, so it can't hurt to plan ahead.

    I suppose HDF5 is less complex compared to SQL, which is a plus for a DB neophyte like myself.

    BTW, another ET thread on HDF5:
    http://www.elitetrader.com/vb/printthread.php?threadid=152187

    Good thread on the usefulness of DBs:
    http://www.elitetrader.com/vb/printthread.php?threadid=183326
     
  9. I figured that there was some way to do this. I mean some pretty serious corporations use SQL Server for intense tasks. But what you have talked about is way above my knowledge base.

    I chose text files because the filesystem serves as a hashtable: I have date and underlying as keys and then retrieve the whole vol surface.

    In the SQL database I had before, data was added by day in a flat table. If I wanted data for 1 day it was really fast. But if i wanted a timeseries of data it was really slow.