I just edited my original posting we crossed postings Intermediate result actually were a possible factors. The goal of the research is to determine if these are important factors or not. At this time we don't know that, but that is common with research. My point is for research we don't have to worry about such things. They have to be hidden. I don't want to limit myself to the constraints of the database design. If it not works with SQL I get rid of it.
This is a very interesting topic for me, since I don't have any experience with column-oriented designs. Can I ask how you handle the higher insert cost (or maybe it's not much of a cost, depending on your strategy)? Also in certain situations, wouldn't the disk seek time be greater than for RO databases if you're pulling multiple columns in parallel?
I have to agree with the OP, though. It sounds a lot like SQL databases WITHOUT AN OPTIMIZED DATA STORE BELOW (because that is where the problem comes from) simply are not suitable. Point. I have a similar problem storing tick streams. I end up compressing them and putting them... into a blob field, 5 minutes apiece. Idiotiy trying to store ticks in "one tick per row" and expect great performance. Just too.... generic.
Yes, I think there are ways like this to work within the slower performance of row-oriented storage and still get good results. It's all a trade-off in my mind, and a good solution might be to use both types of db. I'm getting lots of good feedback on both sides.
In my case it actually is also a size issue. I mean, I store a tick in a hand full of bytes - varies, as stream. I never Search them. So This is a LOT more efficient.
I am under NDA and cannot reveal a lot but there is no insert cost at all - just up to the performance limits of the disk speeds. There two ways to access the disk - sequential and random. The way the disks are designed a sequential access with a lot of data at once is of magnitude faster. From other side if the data are naively organized just for a sequential access this is not the most effective way because sometimes you cannot read all the info in the memory - sometimes it cannot fit. Besides you cannot use the vectorizing SIMD parallelization with this. Column oriented sequential access is not the whole picture. In order to get the max performance the only sure way to know is to make benchmarks. That is why we made extensive benchmarks to test the real performance with our own data on increments of 1 MB up to 2GB. These tests were running for 1 week. The result we got fit better to what we do because the benchmarks are custom tailored to our needs. The mainstream benchmarks in my opinion are just useless - too generic. Also I am suspecting the DB vendors are designing their DB's to pass the mainstream benchmark tests and get a good reviews. But if you make your own tests you get a little different picture. For example nobody tells you that using .NET 3.5, MS SQL 2008, MS Server 2008 - e.g the latest and best versions available, you can get an exponential time to finish the task in terms of performance results - initially we think it was a common sense to be a linear time but it turned out to be exponential. With Sybase it was linear but they had a different problems. No matter now we optimized - with garbage collection, without - the load was too much for MS tools. They are simply not designed for such a stressed load. Bottom line - you cannot make extensive research with a lot of parameters with COTS DB tools. C++ programming is performance wise OK even pure C is better - you don't have the overhead of objects creation/destruction, but you need a new DB structure at least in order to manage a lot of data easily. And this is just for the computations needed to make a screener and back testing data ready. For the actual screener and back testing you need a different way to query the data. SQL is a relational data oriented model and it is suitable and effective for specific tasks such as say "give me a list of all customers from this zip code". But if you need a queries such as "how many instruments and at which D-Day had a gain 5% 10 days after D-Day when we had such and such factors with values "abcd" where in place up to 10 days ago", SQL is not the best way to do that. Even OLAP is not suitable because you have here a time series query. OLAP is suitable for inventory DB questions like - "how many customers and with which zip code had a purchases totally > such quantity in $'s" - this is an aggregate query with no time series involved. OLAP is not designed for time series either.
Exactly. When referring to column DB it is the storage organization what differentiates them. But this again is an opposite extreme. Usually the fastest performance you get with full in memory organization but sometimes you cannot just read the full column in memory especially when you have more columns. It is much better but we got one step beyond pure column store after experimenting with it. We invented something like a hybrid store which is dynamically adjusted to the actual data at run time via a dynamic code generator pinging the actual computer performance at run time. The traditional DB design is static. No matter how you design your DB the matter of fact is that it never reaches the requirements of reality. You cannot fit reality in a static design, you need your DB to adjust to reality. Sometimes the best results performance-wise are contrantuitive. Because all we are talking about is performance, to get the best performance you need to ping your computer capabilities and your software to be able to adjust performance wise to the computer and the data too. We have this concept of pinging computer capabilities with different permutations of data organization and our software finds the best combination of instructions. Even we created a visual CPU/NVIDIA CUDA GPU SIMD/SIMT simulator and a code generator to be able to view what happens actually at byte and bit level within the processor and to generate a code automatically for us. It speeds up the development time 10 times. We get a CPU execution time speedup with vectorizing SIMD parallelization 10 times for some of the tasks and execution time for CUDA 100 times for some of the tasks. Not all tasks are suitable but with our simulator and permutations generator it is possible to find automatically very quickly what is appropriate for parallelization and how much speedup you can get. The simulator runs through different scenarios, benchmarks them and makes a list of the fastest. Similar like testing a trading strategies. I am attaching several screenshots.
Very interesting stuff. It sounds like you know what you want out of the system, and it's good to learn why SQL, .NET, etc didn't work out for you.
SQL. .NET etc. work for specific tasks. And they were designed for moderate stress loads. In fact MS has a HPC server on basis of Windows but sometimes it is better to begin from scratch. My point is something interesting happens currently in the time we are living - virtually all old structures are changing and replaced at one time almost simultaneously. If you read the latest developments about Goldman Sachs you can see that all this program trading insanity would end very soon. Up to now who had the connections to install an infrastructure in the basements around the stock exchanges to have the lowest possible latency wins just from order routing. All program trading is not a trading at all. One of the program trading concepts is to ping for orders and cancel your request up to 1 sec. If you say do not get an order in 20.5 level, you go down to 20.4. Eventually you get an order at 20.3. Your computer places 100 more test orders, gets them and now you know at the other side you have another computer possibly institutional one with a big order. Now, you are broker dealer and get an incentive for the order flow and you have a low latency infrastructure to ping if you can get all possible orders at 20.1. Eventually you get all of them and sell them to this institutional computer. This is like a cornering the market for several seconds. Plus you get your incentive for order flow from the exchange. Now, who can do that? Only the big boys at close proximity to the exchange. They have the sub milliseconds latency from the law of physics and can cancel the order up to 1 Sec and perform all of these tricks. And they cannot have these tricks if they don't pushed for the laws allowing it. But what this has to do with trading? I believe this is part of the problem what happened recently. Too many computers outbidding themselves too quickly and the closest physically to the exchange wins. Now, Goldman Sachs had a leak - computer code stolen. It is a fact that the program trading reporting from NYSE returned back to weekly reports instead of up to now daily reports. GS had >50% of this volume. Something happened. All this has to end and the trading an investment to return back from insanity to common sense. The only way to end is when similar as GS fail.
I can't say I am 100% following all that you say. You are saying quite a bit in a short amount of discussion, but with multiple read throughs I think I am catching most of it. It seems that much of what you are describing is actually expressing an opinion about what types of data you think is useful for creating a stock trading strategy. Not that there is anything wrong with that topic, but there are two separate topics here. First, input data and calculated indicators. Second, backtesting a specific trading strategy that uses the input data and calculated indicators. Other strategy researchers I am sure will have other ideas about what input data they need as input to their backtesting and scans. The 1,000 columns of data that you discuss is what you are using for your research, but the next strategy researcher will have completely different ideas about what they want as input to backtesting stock strategies and scanning. Also (in your full original post that I trimmed down quoted above), you discuss a specific analysis strategy, as well as some specific user interface. All of this is a very wide discussion, where the topic I am interested in discussing is specifically the data handling, computation speed, and strategy backtesting (and scanning) capability part of this. That seems to be the main focus of this thread. It is the part of the thread that interests me. So, into that topic... It seems you are saying that what you do is to generate large quantities of data (the ~1,000 columns), then you query the data to do scans and backtests. Certainly this is an approach, but I think not the only approach to the task. Let me say that something which has always helped me with coding the rules of trading strategies. I tell myself, think bar by bar like the decisions are being made in real time actual trading. Sometimes there are decisions to be made at multiple points in the bar. On the open, during the bar, after the close. Write the trading rules code to do processing in sequential order as if you were trading the strategy in real time. After all, the only goal of this is the ability to trade a strategy in real time. Any decision that actually can be made in real time trading can by definition also be made at a point in time in simulation of trading. Do this one day at a time. Move forward to the next day. I find that thinking in this way helps me to code complex strategies. From this, another way to approach your task (as opposed to generating 1,000 columns of data then querying the data) would be to start with the trading rules of the trading strategy or the criteria for a specific scan. Write code to process only the specific data needed for a specific backtest or scan (or write code to process lots of various data with user interface options enabling processing of a subset of data for a specific backtest, similar to your "cabd", 5% Gain 5 Days after D-Day" options). Write code to move through the data forward from first date to last date as if you were actually trading. With this approach, much of your 1,000 columns of data could be dynamically created on the fly, processed, discarded. Then move on to the next bar. This would greatly reduce the quantity of stored data. What I am describing is how most backtesting software does it. I suspect that no single strategy or scan would actually consume 1,000 items of information in a single strategy. I can't see how a single trading strategy could consume that much input data, and you seem to also be saying that in your post talking about ""cabd", 5% Gain 5 Days after D-Day". I interpret this to mean you are saying that any single test or scan will focus on a subset of factors like you are listing. So the alternative approach might be to dynamically calculate only the data desired for the specific backtest or scan bar by bar as I describe above (as well as some of the fundamental data you mention will come from the disk). Move forward, calculate much of this input data on the fly, consume the calculated data, discard calculated data you are finished with, move on to the next bar. This is how I have approached this type of task, and it works very well and with very good performance. What you refer to as "sliding window calculations of 5,10,20,40,80,160,240,360 trading days" is I believe what I would term an indicator of these various lengths. It is no problem to calculate multiple indicator lengths on the fly bar by bar. I am coming from a completely different direction on this whole topic versus all the other posters. All of this discussion about how to construct tables to get performance out of SQL. High speed server Dell PowerEdge 2850 2XCPU XEON 3 GHZ, 4 GB RAM, RAID 2 x 75 GB HDD, SIMD, NVIDIA CUDA GPU parallel computing, up to 240 processors, SSDs This whole thread strikes me as a very detailed explanation based upon extensive experience why SQL is not the right tool for the job of processing portfolios of market data. I have encountered this before talking to others. SQL seems like a great idea at first because much of the low level nuts and bolts work is already done for you. But what you point out in detail is what I have heard before, that performance is a huge problem. I have instead taken the approach of custom data handling hand coded from scratch in C++ specifically for time series market data, and I get excellent performance on this type of processing, even with a consumer level desktop or laptop computer with 1 GB to 3 GB of memory. What is the flaw in what I say? I am proposing that it is a better approach. You have briefly mentioned quantity of market data and various performance times with various approaches you have tried, but I am unclear about the specifics. As a performance benchmark: For example, how many stocks do you run the strategy against? How many years is the time span of the backtest? How long does that backtest take, and on what kind of hardware is that backtest running? - Bob Bolotin