Storing time series data by Saeed Amen Cuemacro February 2, 2019 Subjects: Data, General, Python ... When it comes to analysing financial markets, the ingredients are key, essentially the data. However, with data, comes the need to storage. We need to have a good “fridge” to use our burger metaphor. Typically most financial data is in some sort of time series type format. You have some timestamp, and then some observation, like a price, economic data, news article etc. So the question basically amounts to what is the best way to store time series? In this article I’ll go through some of the solutions for storing time series data. It is of course not an exhaustive list but hopefully it’s still a start. ... **************************************************** What I do is use CSV or binary files named by the stock symbol. How do you store time series data?
The key question is whether you can fit all data into main memory and have some way to handle the cases where it does not. CSV in file by symbol name will get you pretty far though and it's what I've used for my amateur stuff. Would I be doing HFT analysis I would be thinking considerably more about storage...
A lot of generalities flying by here -- in terms of data size, frequency of mass input, definition of "storage" (in size and in duration).... but: For a lower-level, general purpose ap., I store, and then tap for one input, a single .csv file. The output grows to about 10x the input size, but that is perhaps half graphics. Size during computation is not an issue, but if it became one, I would focus on the size, not the readability. (As long as my input files are un-corrupted, I can regenerate anything I wish later...) Readability by other programs is a bit of an issue, which .csv spanks. I could double or triple operations right now, and not see a storage-related bottleneck, I guess. So the focus is on flexibility and durability. == .csv
I am storing time series data in a SQL Server database (I have a lot of experience with SQL Server), it's working really well. I've built a library to make the data available in Python and that works well. I would recommend a relational database for time series data assuming you are not working with very low frequencies, i.e. minute or less.
I use SQLite and use SQLAlchemy to avoid having to code the CRUD operations. Here is the worst query that converts hourly bars: Code: with daily_bars(time_series_id,time_series_name,open_datetime,close_datetime,low,high,volume,openinterest) as ( select time_series_id, time_series.name, min(baropen_datetime) open_datetime, max(baropen_datetime) close_datetime, min(low) as low, max(high) as high, sum(volume) as volume, sum(openinterest) as openinterest from bar_data inner join time_series on time_series_id = time_series.id where time_series.bar_size = 'HOURLY' and time_series_id = :time_series_id and date(baropen_datetime,:offsetHour,:offsetMinute) >= :start and date(baropen_datetime,:offsetHour,:offsetMinute) <= :end group by time_series_id, date(baropen_datetime,:offsetHour,:offsetMinute) order by min(baropen_datetime) desc limit :window ) select db.time_series_id, db.open_datetime as baropen_datetime, bdo.open as open, db.high as high, db.low as low, bdc.close as close, db.volume as volume, db.openinterest as openinterest from daily_bars db inner join bar_data bdo on bdo.baropen_datetime = db.open_datetime and bdo.time_series_id = db.time_series_id inner join bar_data bdc on bdc.baropen_datetime = db.close_datetime and bdc.time_series_id = db.time_series_id order by baropen_datetime asc
Assuming that you are interested in technical analysis, storing the data is unfortunately only half of the problem. The stock data most likely also need to be adjusted, so that you can properly run indicators on them. Adjusting data is an operation, which goes _back_ in time, so you need to purge all the quotes you have stored, and replace them by freshly adjusted quotes. Now you might think that this happens only on rare occasions. If you have stock splits in mind, that's probably true. However, data should also be adjusted for cash dividends (which most data feeds don't do). If you are just a single guy, with no additional infrastructure to support you, maintaining a proper quote database will become a pretty substantial job. That's when it might make sense to chose a data provider, which does this work for you. I personally use Norgate Data. Their tool, the Norgate Data Updater, maintains a database in some proprietary format. It will take care of the updating, by downloading new data on a pre-defined schedule, and (as it seems) they keep the split and dividend information separate from the quotes. Software can now connect to this database with their API, and request data with varying degrees of adjustments done: unadjusted, adjusted for capital restructuring, adjusted for capital restructuring and cash dividends. FWIW, my TuringTrader project supports Norgate Data: https://www.turingtrader.org/2019/01/now-supporting-norgate-data/ Cheers, Felix
For what it's worth, I use binary files of tick data indexed by symbol and date. When you need to test across years of tick data with hundreds of symbols, CSV just isn't going to cut it.
Why for, "worst"?? That looked like a template that would have my Functional Programming friends all tingly with joy!