Storing time series data

Discussion in 'App Development' started by ajensen, Feb 21, 2019.

  1. ajensen

    ajensen

    Storing time series data
    by Saeed Amen
    Cuemacro
    February 2, 2019
    Subjects: Data, General, Python
    ...

    When it comes to analysing financial markets, the ingredients are key,
    essentially the data. However, with data, comes the need to storage.
    We need to have a good “fridge” to use our burger metaphor. Typically
    most financial data is in some sort of time series type format. You
    have some timestamp, and then some observation, like a price, economic
    data, news article etc. So the question basically amounts to what is
    the best way to store time series? In this article I’ll go through
    some of the solutions for storing time series data. It is of course
    not an exhaustive list but hopefully it’s still a start.

    ...

    ****************************************************

    What I do is use CSV or binary files named by the stock symbol. How do you store time series data?
     
  2. fan27

    fan27

    I use CSV. It is simple enough and I am not doing a bunch of reads and writes to disk.
     
    tommcginnis likes this.
  3. Snuskpelle

    Snuskpelle

    The key question is whether you can fit all data into main memory and have some way to handle the cases where it does not. CSV in file by symbol name will get you pretty far though and it's what I've used for my amateur stuff. Would I be doing HFT analysis I would be thinking considerably more about storage...
     
  4. tommcginnis

    tommcginnis

    A lot of generalities flying by here -- in terms of data size, frequency of mass input, definition of "storage" (in size and in duration).... but:

    For a lower-level, general purpose ap., I store, and then tap for one input, a single .csv file.
    The output grows to about 10x the input size, but that is perhaps half graphics.
    Size during computation is not an issue, but if it became one, I would focus on the size, not the readability. (As long as my input files are un-corrupted, I can regenerate anything I wish later...)

    Readability by other programs is a bit of an issue, which .csv spanks.

    I could double or triple operations right now, and not see a storage-related bottleneck, I guess. So the focus is on flexibility and durability. == .csv
     
  5. I am storing time series data in a SQL Server database (I have a lot of experience with SQL Server), it's working really well.

    I've built a library to make the data available in Python and that works well. I would recommend a relational database for time series data assuming you are not working with very low frequencies, i.e. minute or less.
     
    tommcginnis and Howard like this.
  6. I use SQLite and use SQLAlchemy to avoid having to code the CRUD operations.

    Here is the worst query that converts hourly bars:

    Code:
    with daily_bars(time_series_id,time_series_name,open_datetime,close_datetime,low,high,volume,openinterest) as (
    select
        time_series_id,
        time_series.name,
        min(baropen_datetime) open_datetime,
        max(baropen_datetime) close_datetime,
        min(low) as low,
        max(high) as high,
        sum(volume) as volume,
        sum(openinterest) as openinterest
    from
        bar_data
    inner join
        time_series
    on
        time_series_id = time_series.id
    where
        time_series.bar_size = 'HOURLY'
        and time_series_id = :time_series_id
        and date(baropen_datetime,:offsetHour,:offsetMinute) >= :start
        and date(baropen_datetime,:offsetHour,:offsetMinute) <= :end
    group by
        time_series_id, date(baropen_datetime,:offsetHour,:offsetMinute)
    order by
        min(baropen_datetime) desc
    limit
        :window
    )
    select
        db.time_series_id,
        db.open_datetime as baropen_datetime,
        bdo.open as open,
        db.high as high,
        db.low as low,
        bdc.close as close,
        db.volume as volume,
        db.openinterest as openinterest
    from
        daily_bars db
    inner join
        bar_data bdo
    on
        bdo.baropen_datetime = db.open_datetime and
        bdo.time_series_id = db.time_series_id
    inner join
        bar_data bdc
    on
        bdc.baropen_datetime = db.close_datetime and
        bdc.time_series_id = db.time_series_id
    order by
        baropen_datetime asc
    
    
     
    silveredge and ajensen like this.
  7. Felix168

    Felix168

    Assuming that you are interested in technical analysis, storing the data is unfortunately only half of the problem.

    The stock data most likely also need to be adjusted, so that you can properly run indicators on them. Adjusting data is an operation, which goes _back_ in time, so you need to purge all the quotes you have stored, and replace them by freshly adjusted quotes.

    Now you might think that this happens only on rare occasions. If you have stock splits in mind, that's probably true. However, data should also be adjusted for cash dividends (which most data feeds don't do).

    If you are just a single guy, with no additional infrastructure to support you, maintaining a proper quote database will become a pretty substantial job. That's when it might make sense to chose a data provider, which does this work for you.

    I personally use Norgate Data. Their tool, the Norgate Data Updater, maintains a database in some proprietary format. It will take care of the updating, by downloading new data on a pre-defined schedule, and (as it seems) they keep the split and dividend information separate from the quotes. Software can now connect to this database with their API, and request data with varying degrees of adjustments done: unadjusted, adjusted for capital restructuring, adjusted for capital restructuring and cash dividends.

    FWIW, my TuringTrader project supports Norgate Data: https://www.turingtrader.org/2019/01/now-supporting-norgate-data/


    Cheers, Felix
     
    tommcginnis likes this.
  8. Craig66

    Craig66

    For what it's worth, I use binary files of tick data indexed by symbol and date.
    When you need to test across years of tick data with hundreds of symbols, CSV just isn't going to cut it.
     
  9. Howard

    Howard

    *.csv. But I only use 1-minute data.
     
  10. tommcginnis

    tommcginnis

    Why for, "worst"?? That looked like a template that would have my Functional Programming friends all tingly with joy!
     
    #10     Feb 21, 2019