Database organization

Risk619 · Feb 7, 2015

One important thing to consider about RDMS is that they allow you to organize things much better. Regarding ten years of tick data across every instrument on every exchange, very rarely will you need that level of detail. But on the other hand, if you develop a strategy six months from now you don't know what you'll need to be able to test it.

I've approached market center data with two minds:

1) Store everything you can get your hands on, so long as it doesn't slow down anything else you actually need. That is to say that pulling down global tick and market depth data across all instruments is terrific, but it shouldn't negatively impact your strategy's performance.

2) Your strategy(ies) should be hyper optimized and not really related to the storage of "everything". If you only need hourly data, figure out a way to make that as lean and laser focused as possible. Optimize hot code paths.

Having data in a RDMS means that you can shift and swivel it around very quickly. Depending on your hardware, vendor, and table/indexing strategies much of the storage is actually being placed in the buffer pool, held in RAM, negating the discussion (or at least greatly reducing the significance of) SSD vs spindles.

And if you decide six months from now that you want to make candles out of the whole mess and data mine MACD signals, it would take you ~10 minutes to write the query. Fire it off, grab some coffee, and your results will be in front of you.

Butterfly · Feb 7, 2015

Risk619 said:
One important thing to consider about RDMS is that they allow you to organize things much better. Regarding ten years of tick data across every instrument on every exchange, very rarely will you need that level of detail. But on the other hand, if you develop a strategy six months from now you don't know what you'll need to be able to test it.

....

Having data in a RDMS means that you can shift and swivel it around very quickly. Depending on your hardware, vendor, and table/indexing strategies much of the storage is actually being placed in the buffer pool, held in RAM, negating the discussion (or at least greatly reducing the significance of) SSD vs spindles.

And if you decide six months from now that you want to make candles out of the whole mess and data mine MACD signals, it would take you ~10 minutes to write the query. Fire it off, grab some coffee, and your results will be in front of you.
More...

exactly, anyone serious with trading doesn't bother with silly technical details like SSD or seek time or binary files, that's the trademark of basement amateurs who have nothing to build except hoarding data into Gb and Tb to make themselves more important than they are, since there is nothing else they can do with that much data. Scanning random prices in the ms to see what happened in the past reeks of pure amateurism. That's not how you build a successful trading strategy.

In a professional environment, with real traders, you want to factor your data, not hoarding it like a kleptomaniac. And you build a proper datawarehouse of market data so you can re-use it without the need to deal with thousand of flat files that become completely worthless as time goes.

John Tseng · Feb 7, 2015

jtrader33 said:
John, thanks very much for taking the time with that post. Very informative. For sure, I will be reading it multiple times to let it all sink in.

Initially, I thought you were making a theoretical point in the above quote, but after reading on it seems like it may be a practical endeavor. There's no mystery to the order in which files are read in my backtests and it seldom changes. Am I over simplifying things if I ask if there's a way to instruct the HDD to organize files in a certain sequence?
More...

Within a file, data is always stored in sequence. The file may be fragmented across different parts of the disk, but that's where defrag comes in. Where multiple files are stored is up to your filesystem. This is usually fairly random.

I would have to go along with the other guys who favor relational databases, though. Even though binary files are faster, I haven't found it to be worth the time to implement.

volpunter · Feb 7, 2015

this has been discussed just so many times, RDBMS are a very bad idea to store terabytes of time series data. Why are we always getting back to this? RDBMS are not made to handle columnar schemata particularly well.

Regarding compression of data, I challenge anyone to find a language and implementation that rivals the below in C# in terms of speed and length of code (A one liner calling a script does not count, you can use standard use standard template libraries but just loading a Python library and calling 'CompressData[...]' obviously is not a fair comparison. Keep in mind I do not even optimize by providing mid prices in the original ticks, and there are certain optimization opportunities by using structs rather than classes for quotes. Also, the compression can be made a lot faster by fanning out to several tasks/threads for larger datasets.

Lets compress 1 year of tick based data in your sql RDBMS system, lets measure the time it takes to compress such tick data and load them into memory. A simple binary file plus below algorithm will run circles around your implementation. :
Code:
var numberTicks = frequency.Ticks;
            var bars = from rawData in ticks
                       group rawData by rawData.TimeStamp.Ticks / numberTicks into tickData
                       select new Quote()
                       {
                           DataProviderId = dataProviderId,
                           SymbolId = symbolId,
                           QuoteType = QuoteType.OHLC,
                           CompressionType = compressionType,
                           CompressionUnits = compressionUnits,
                           TimeStamp = new DateTime(tickData.Key * numberTicks, DateTimeKind.Utc),
                           Open = (tickData.First().Bid + tickData.First().Ask) / 2,
                           Low = tickData.Min(bd => bd.Bid + bd.Ask) / 2,
                           High = tickData.Max(bd => bd.Bid + bd.Ask) / 2,
                           Close = (tickData.Last().Bid + tickData.Last().Ask) / 2,
                       };
Risk619 said:
One important thing to consider about RDMS is that they allow you to organize things much better. Regarding ten years of tick data across every instrument on every exchange, very rarely will you need that level of detail. But on the other hand, if you develop a strategy six months from now you don't know what you'll need to be able to test it.

I've approached market center data with two minds:

1) Store everything you can get your hands on, so long as it doesn't slow down anything else you actually need. That is to say that pulling down global tick and market depth data across all instruments is terrific, but it shouldn't negatively impact your strategy's performance.

2) Your strategy(ies) should be hyper optimized and not really related to the storage of "everything". If you only need hourly data, figure out a way to make that as lean and laser focused as possible. Optimize hot code paths.

Having data in a RDMS means that you can shift and swivel it around very quickly. Depending on your hardware, vendor, and table/indexing strategies much of the storage is actually being placed in the buffer pool, held in RAM, negating the discussion (or at least greatly reducing the significance of) SSD vs spindles.

And if you decide six months from now that you want to make candles out of the whole mess and data mine MACD signals, it would take you ~10 minutes to write the query. Fire it off, grab some coffee, and your results will be in front of you.
More...

volpunter · Feb 7, 2015

that surprises me. The speed up of binary files over rdbms is generally > 100x. Try to compress 5 years worth of tick data into minute bars alone and you will most likely spend that much more time on this exercise than a simple on-the-fly compression as I have shown above. My Binary file DB which I wrote on my own does not have all the features one would wish for but its fast fast and fast.

John Tseng said:
Within a file, data is always stored in sequence. The file may be fragmented across different parts of the disk, but that's where defrag comes in. Where multiple files are stored is up to your filesystem. This is usually fairly random.

I would have to go along with the other guys who favor relational databases, though. Even though binary files are faster, I haven't found it to be worth the time to implement.
More...

John Tseng · Feb 7, 2015

volpunter said:
this has been discussed just so many times, RDBMS are a very bad idea to store terabytes of time series data. Why are we always getting back to this? RDBMS are not made to handle columnar schemata particularly well.

Regarding compression of data, I challenge anyone to find a language and implementation that rivals the below in C# in terms of speed and length of code (A one liner calling a script does not count, you can use standard use standard template libraries but just loading a Python library and calling 'CompressData[...]' obviously is not a fair comparison. Keep in mind I do not even optimize by providing mid prices in the original ticks, and there are certain optimization opportunities by using structs rather than classes for quotes. Also, the compression can be made a lot faster by fanning out to several tasks/threads for larger datasets.

Lets compress 1 year of tick based data in your sql RDBMS system, lets measure the time it takes to compress such tick data and load them into memory. A simple binary file plus below algorithm will run circles around your implementation. :
Code:
var numberTicks = frequency.Ticks;
            var bars = from rawData in ticks
                       group rawData by rawData.TimeStamp.Ticks / numberTicks into tickData
                       select new Quote()
                       {
                           DataProviderId = dataProviderId,
                           SymbolId = symbolId,
                           QuoteType = QuoteType.OHLC,
                           CompressionType = compressionType,
                           CompressionUnits = compressionUnits,
                           TimeStamp = new DateTime(tickData.Key * numberTicks, DateTimeKind.Utc),
                           Open = (tickData.First().Bid + tickData.First().Ask) / 2,
                           Low = tickData.Min(bd => bd.Bid + bd.Ask) / 2,
                           High = tickData.Max(bd => bd.Bid + bd.Ask) / 2,
                           Close = (tickData.Last().Bid + tickData.Last().Ask) / 2,
                       };
More...
Just for kicks, I wanted to try this out. Here's my sql for EURUSD:
Code:
SET STATISTICS TIME ON

DECLARE @ticks int = 100;
with t as (
   SELECT
     *,
     ROW_NUMBER() OVER (ORDER BY Ticks.Time ASC) TickIndex
   FROM dbo.Ticks
   WHERE Ticks.PairId = 188
   AND Ticks.Time BETWEEN '2012-1-1' AND '2012-12-31'
), bars as (
   SELECT
     t.TickIndex / @ticks * @ticks TickStamp,
     MIN(time) MinTimeStamp,
     MAX(time) MaxTimeStamp,
     MIN(Bid + Ask) / 2 Low,
     MAX(Bid + Ask) / 2 High
   from t
   GROUP BY t.TickIndex / @ticks
)
select *,
   (SELECT TOP 1 (bid + ask) /2 FROM ticks where Ticks.PairId = 188 AND Ticks.Time >= mintimestamp ORDER BY time asc) [Open],
   (SELECT TOP 1 (bid + ask) /2 FROM dbo.Ticks where Ticks.PairId = 188 AND Ticks.Time <= maxtimestamp ORDER BY Ticks.Time desc) [Close]
FROM bars
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

(106152 row(s) affected)

SQL Server Execution Times:
CPU time = 13360 ms, elapsed time = 22418 ms.

Took a while for SSMS to display all 100k rows.

volpunter · Feb 7, 2015

you like this idea, don't you ;-)

106,000 rows for 1 year of tick data? Something is obviously not right here. I have 200,000 on average or so data points per day for my currency tick data on liquid pairs.

The time, spent, you showed is shockingly high. for 106k data points I can manage below 1 second even including the loading of the tick data into memory and compressing such data.

quote="John Tseng, post: 4083971, member: 485453"]Just for kicks, I wanted to try this out. Here's my sql for EURUSD:
Code:
SET STATISTICS TIME ON

DECLARE @ticks int = 100;
with t as (
   SELECT
     *,
     ROW_NUMBER() OVER (ORDER BY Ticks.Time ASC) TickIndex
   FROM dbo.Ticks
   WHERE Ticks.PairId = 188
   AND Ticks.Time BETWEEN '2012-1-1' AND '2012-12-31'
), bars as (
   SELECT
     t.TickIndex / @ticks * @ticks TickStamp,
     MIN(time) MinTimeStamp,
     MAX(time) MaxTimeStamp,
     MIN(Bid + Ask) / 2 Low,
     MAX(Bid + Ask) / 2 High
   from t
   GROUP BY t.TickIndex / @ticks
)
select *,
   (SELECT TOP 1 (bid + ask) /2 FROM ticks where Ticks.PairId = 188 AND Ticks.Time >= mintimestamp ORDER BY time asc) [Open],
   (SELECT TOP 1 (bid + ask) /2 FROM dbo.Ticks where Ticks.PairId = 188 AND Ticks.Time <= maxtimestamp ORDER BY Ticks.Time desc) [Close]
FROM bars
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

(106152 row(s) affected)

SQL Server Execution Times:
CPU time = 13360 ms, elapsed time = 22418 ms.

Took a while for SSMS to display all 100k rows.[/quote]

John Tseng · Feb 7, 2015

volpunter said:
you like this idea, don't you ;-)

The metric you showed is even higher than I thought.
More...

=)

Not only is the time slow, I had to write those two subqueries to tell SQL to do two nested loop joins. It looks ugly as well. (CTE was probably overkill, though.) I knew SQL was slower. Especially at open/close. I don't know how to get SQL to do this in one pass. GroupBy spools it up for you.

So the reason I don't mind is cause my tests take 12 hours, so shaving 16 seconds off is not a big deal.

volpunter · Feb 8, 2015

which was my point all along that the bottleneck usually lies with the strategy code not the data loading and preparation. But saying that compressing 100k or so data points in sql does not slow one down significantly enough to care is not accurate, imho. SQL is by definition always slower for columnar data structures than a db structure that is optimized to operate on time series data, I hope we can agree on that. If the compression size is static then one can store compressed data in the database but you still end up with slower queries in sql than loading the data from a binary data structure.

John Tseng said:
=)

Not only is the time slow, I had to write those two subqueries to tell SQL to do two nested loop joins. It looks ugly as well. (CTE was probably overkill, though.) I knew SQL was slower. Especially at open/close. I don't know how to get SQL to do this in one pass. GroupBy spools it up for you.

So the reason I don't mind is cause my tests take 12 hours, so shaving 16 seconds off is not a big deal.
More...

Risk619 · Feb 8, 2015

Who's using that level of data though in a strategy (talking code running constantly in an alpha/risk/trx cost/execution/pc model).

I can't imagine any possible theoretical or evidentially based strategy that needs tick data beyond a few days to run, maybe a few weeks at most.

I'm talking about lugging the data around in an RDMS for backtesting, in which case it doesn't really matter (within reason) what can run circles around what.

I believe it's worth pointing out that the Nasdaq exchange itself runs on SQL Server. So to say that financial applications consuming the data provided by a RDMS can't handle it if they themselves are a RDMS seems like a stretch.