Next the topic of daily trade signal generation. I am having trouble understanding your benchmarks descriptions that I quote from above. However, one thing that I pull out of it (from the part of your full text that I quote above) is you seem to be saying that your goal was to calculate daily trade signals against 500 stocks in 2 hours max, and Sybase was not able accomplish that on a high speed server Dell PowerEdge 2850 2XCPU XEON 3 GHZ, 4 GB RAM, RAID 2 x 75 GB HDD. Furthermore, Microsoft SQL was 8 - 10 times slower than Sybase. So let me modify my test program from my previous post above to adapt to this benchmark as closely as I can. Ok, so I am still using my test strategy that calculates 358 "sliding window" indicator calculations per day, even though I contend 358 input variables is way more than 99% of actual trading strategies would use. I trimmed down the number of stocks from 1,800 to 500. So, the benchmark test is a moving average crossover system with 356 extra moving average calculations each day ranging in length from 5 to 360 inclusive. These 358 moving average calculations are done for each stock, each day. I am running this on 500 stocks, 13 years of EOD data. The test is also applying a fixed fractional portfolio level money management strategy and calculating combined portfolio trading results for the portfolio of 500 stocks over the 13 years (or more exactly the trading period is a little less than 12 years due to the 360 day ramp up of the moving average calculation). On my $479.99 consumer level Windows Vista laptop with 3 GB of memory, the above calculation finished in 43 seconds. Less than 1 minute. There is no database involved. Rather I am calculating on the fly using the PowerST software which is a written from scratch C++ application. Take 43 seconds / 60 seconds = 0.7167 minutes. Take 2 hours = 120 minutes. Take 120 minutes / 0.7167 minutes = 167.43. It works out the PowerST is 167.43 times faster than your Sybase benchmark. Then you said that Sybase is 8 - 10 times faster than MS... Let's call it 9 times faster. That makes PowerST 1,507 times faster then MS. I am as surprised by these comparison numbers as I imagine that you are. Also, the above are conservative numbers highly biased against PowerST because there are multiple worse case assumptions in my comparison test. You are saying Sybase couldn't finish in 2 hours, where these calculations are being done based on if it did finish in 2 hours. You were running on a high speed server and I am running on a cheap consumer level laptop. This is processing 13 years of EOD data, where for daily trade signal generation 1 to 3 years of data is usually enough. Still I am 167 times faster than the fastest (to use your term) Conventional Off the Shelf databases that you found in your testing. This seems hard to believe, but in my interpretation of the performance you are describing and my best approximation test of what you describe, these are the numbers. Certainly you can't still say that my software design approach as described in my first post on this topic would be too slow. The reality seems to be that it is more than 167 times faster than the fastest commercial SQL database you were able to find. What am I missing? Can you see flaws in my reasoning or tests? Well, to do some analysis, simple moving average is not very calculation intensive. More calculation intensive input variables would absorb time, but that is only CPU grinding. My test above does all of the data handling including combined portfolio processing, so more calculation intensive applications would only add exactly whatever time is actually spent in CPU grinding. Besides, calculation intensive input variables can be pre-calculated, and my viewpoint is that many (probably most) stock trading applications don't need massive calculation of hundreds of input variables. On balance, it would seem that this conclusively proves the speed potential of custom C++ code versus SQL. In fact, that would seem to be an understatement. - Bob Bolotin
In 2008 I worked on a project very similar to what you describe. I could type another post describing how I went about it.... but enough typing for now. However, let me only mention before you say I am only talking about backtesting not scanning. Yes my above posts above are only discussing backtesting and trade signal generation, but the software design approach I described in my first post on this topic can also handle scanning as you describe above. With some encouragement I may continue this discussion into that topic on another day.... - Bob Bolotin
Hi, I would say too that this is a very interesting discussion. With this quick note I like to clarify some misunderstandings which affect the whole discussion and would explain better my position and will save your time. The topic of this discussion was SQL and attracted my attention because of my previous experience with it. SQL was just an illustration and in no way we am using it in our development. Actually we created an internal research showing how better is our approach against the traditional one. Some of the results I described in my posts. The results are as astonishing as you find them and we needed them as a proof how much faster our product is as a comparison. This was required from some prospective parties and that is why we needed to create an equivalent code with the same functionality on each of these platforms including ours in order to be able to compare them scientifically. Our solution needs only commodity computers. About COTS solutions - typically MS SQL and Sybase are used on powerful servers. If they are used on conventional computers the way we wanted they simply cannot be used productively because they are too slow. Without this hardware above mentioned benchmarks on COTS solutions would take much more time to finish. These benchmarks were intentionally made on this hardware because this is the way they were designed to be used and some professionals could argue that the results we got were because we don't used appropriate hardware for the task. Our solution needs only commodity computers. It is designed in such a way that if a more powerful hardware is available it uses it. My idea is that the contemporary computers are underutimilzed and if the software adapts to the hardware one can extract maximum performance. Maximum performance is what is needed for a large scale screening and back testing. Much more calculations can be performed if the performance is approaching the theoretical maximum. That is why I mentioned SIMD and NVIDIA CUDA programming. If a C++ solution (I would say that a pure C is much faster, and inline assembly the fastest) is finishing with relatively simple calculations for 15 mins, a faster solution for this time can do much more and more powerful calculations. Or make these simple calculations to finish under 1 min. With this response time it is much easier to explore different strategies. Now, if we move to processing intra-day data - the data are > 500 times more; if you add options - the data are 1,000 times more; economical data; banking info add more and more. If you want to analyse more factors you need much more processing power and a better performance only helps. Also I mentioned also screening. For screening you need an easier way to find your market events based on the factors you want to test. back testing comes later when you actually test your strategy. With the screeners I see today you can screen only for today. You cannot go back in history. Screening is more computing intensive and needs more power. The bottom line is - it does not hurt to extract the maximum theoretical performance.
...for single user PCs there's a fast (free) alternative : SQLite I have played a little with SQL Lite and used it during a multi step analysis of price data sets in form of a so-called "in memory" or RAM database. You can try my Zen Analyzer v1.0 Freeware (implemented in PowerBasic - it's fast like C or C++ and SQLite) with your own data sets and will see it's lightning fast even on a usual Home PC: http://www.zentrader.de/html/support1.html bye, Volker
The problem this is that the quantity of data thstart and I are discussing may not fit into memory. We are discussing larger scale testing that can easily exceed physical memory. thstart already discussed SQLite in previous posts to this thread. He said:
Ok, let me talk about screening. I previewed in my last series of posts that I had not yet discussed that topic. You said this previously: In 2008 I worked on a project very similar to what you describe. I was working with a trader who had developed strategies using a one day at a time screener as you describe. However, before he started working with me he couldn't find a way to backtest the strategies. He wanted to backtest to see how the various strategies performed over a longer time period, which strategies performed better or worse than the other strategies, backtest complete strategies against a large portfolio of stocks, and to fine tune. Without software (i.e. with a one day at a time scanner) this is an impossible task, as you seem to agree with in what you say quoted above. I programmed this using the same on the fly first bar to last bar non-database approach I have been talking about in all of my posts. Essentially it does a scan each bar and logs the results to a report. Due to the vast quantity of information this generates I created options to enable and disable levels of detail and to specify the date range to log. Essentially the result is a report showing scan results each day. It is essentially the same as running a one day at at time scanner each day, but it is run automatically, from a single menu item click, for every day in the portfolio of data files. However, these scan results were only the first step. The next goal was to take the results of the scans and formulate those results into a complete trading strategy, including trade entries and exits, and money management. The goal was to backtest that complete strategy applied to a large universe of stocks and ETFs. It was a smooth transition from scans to backtest as you say quoted above is one of your goals. In fact, the same code is used to scan and examine scan results, and then backtest. I believe that this is extremely similar to what you describe above, and it was done using my software design that I keep talking about. But you keep saying this can't be done with good performance using my approach. Well, on this scanning project I am discussing I did the programming myself so I can actually run the software. Similar to the benchmarks in my previous posts, again I am running this on my $479.99 consumer level Windows Vista laptop with 3 GB of memory. This time 1851 stocks and again 14 years of data. It runs in under 3 minutes. The performance is excellent. You said "Screening is more computing intensive and needs more power". I just ran a screening application against 14 years of data for 1851 stocks in 3 minutes. It seems that there is no performance problem with my approach. To continue the quote from your discussion about screening... In this scanning project it isn't necessary to write code for each strategy. Rather it is very similar to the interface that you describe, with setup options. You do however need to write code to process setup options. But that can't be that much different than what you are doing. It is very much like you say about check boxes to specify parameters to the scan. I wish I could post a setup screen from the scanning project that I am talking about, but it is proprietary. However here is a snapshot from a different PowerST software user for a completely different application that I do have permission to make pubic. It is not the same application but it gives a flavor for what I am saying: This is something that a PowerST user did on his own. He sent me this snapshot after it was finished to show me what he had done. It is a completely different application, but you can see that he created a quantity of options so that he can select components, in this case to construct variations of a trading system. The proprietary scanning project I worked on in 2008 is similar in that it has a setup form with options to enable or disable various scan criteria (as well as entries, exit, money management), very similar to the user interface that you describe in your posts. Note that I say above a user did this setup form (and the underlying processing) on his own. The PowerST software is not limited to any supplied set of options. This kind of thing is completely user programmable. The person who created the snapshot was interested in the trading system components project shown in the screen snapshot. On the other hand, the screening project I talk about in this post was interested in mixing and matching screening criteria including backtesting over large portfolios of stocks and ETFs. In other words, my point is that if the capability being sought can be done within a general purpose backtesting software like PowerST, that is an advantage versus more specialized software. One tool for a wider scope of research. The PowerST software is a general and flexible tool that can be adapted to new and different strategy testing needs or input data. For example, the scanning application that I discuss in this post was not an originally anticipated goal for the software. Rather, when I encountered a trader who needed this capability the software was able to adapt to accomplishing this scanning style of research and testing, and that is done at was done at the end user programming level. In conclusion, contrary to what you say, I believe that I have accomplished the kind of screening that you are talking about using my non-database oriented software design, and with very good performance. - Bob Bolotin
This is exactly the same that attracted my attention to this thread. This thread has discussed the problems with general purpose commercial SQL databases, and we are now discussing alternative software design approaches. But you keep telling me that my non-database design approach won't do the job. That it will be too slow: Real world experience says that my non-database approach does work, and with excellent performance. So I have been defending my design approach. I don't think I said anything about your software that you are developing in my previous posts. I did a comparison of what can be done with commercial SQL databases versus custom written from scratch C++ code based upon my interpretation of Sysbase / MS benchmarks that you mentioned compared to specific benchmark tests with my "from scratch" C++ application. I am still very unclear about the performance of your proprietary design. You said: But I am still unclear what your benchmark test is, on what kind of computer you are running, as well as whether this 10 times speed up is with the extra GPU hardware you have been discussing. My test found PowerST to be not 10 times faster than Sybase, but 167 times faster than the Sybase numbers you mentioned. And that was with multiple worse case assumptions in my comparison test including that the Sybase benchmark was running on a high end server and my test was on a low end consumer level laptop. In a direct comparison on the same computer PowerST would probably be much more than this 167 times faster. But this is getting off topic. The topic of my posts is to talk about an alternative non-database oriented software design approach, and show that software design approach can have excellent performance, which I think I have demonstrated despite you saying it cannot be done. Again you say "relatively simple calculations", possibly in reference to my benchmark test? Yes the calculations were simple, but I ran it against a large quantity of data and 358 moving average calculations per day for each stock. The benchmark is a large quantity of simple calculation, designed to show large scale processing. There is no reason it would not extent to "more powerful calculations". Also, I don't see any reason why my software design would not scale to the larger scale testing as you describe, but on the other hand it seems that we are both at this time targeting large scale EOD backtesting and scanning (i.e. capability to handle large numbers of stocks and large quantities of input data). - Bob Bolotin
It seems to me you do a custom software projects. With your source code and your expertize you surely can do that and do it good. Also if your customers have an access to your source code they surely can adapt it to their needs if they know C++ good enough. I am talking about a screener, not a scanner. But let don't go too deeper about so much details. SQL data base engines are not appropriate for these tasks because of inefficient data store, SQL language is not necessarily bad as a query language. The problem is that you keep taking about your non-database approach - this is not so important. I know you don't use COTS database and you know I don't use COTS database. So there is not a reason to clarify this anymore. I believe at least for a good data organization an efficient data engine is still needed, no matter how we name it - column store, row store or hybrid, whatever. In my previous posts I explained the differences, thinking this is of interest to the public. From the responses of several forum members I see they began researching about column store and at least to expand their horizons. If you use just a flat file - good luck with this approach. 3 minutes is fine but I suppose it depends from the amount of calculations and if you include options and intra-day data it would be definitely more than 3 minutes. So what you will do in this case? Also it is clear if you create a custom program for each specific case it would be fast and undisputed if using C++. What I am talking about is an engine as faster as custom C++ programming minus the time needed to write a program for every case and to make customizations for every user - this is not efficient. This engine can have an optimized data store, etc. but the main point is that it is parametrizable and will not need a custom programming. It adapts to the data and target computer automatically and extract maximum performance without a custom programming. Also we don't do this only for the stock market. It can be applied to other large scale projects. It looks fine. The user is obviously a good programmer, knowing a lot about C++ and your code.
Cannot be done with COTS databases. This is still the topic of this discussion. I don't question your abilities.
Actually, that is not correct. My primary business is selling a backtesting product named PowerST that is an end user programmable product. Some but not all PowerST customers may ask me to help with their strategy programming. In one of my previous posts I said "I understand that one PowerST user is doing this. I don't know exactly why because their strategy is proprietary and I am not involved in the programming, and they don't give me details.". For that customer, I don't even know their strategy because I am not at all involved in the programming myself. Also, although source code is available it is not expected for PowerST customers to dig into the source code. The source code is more like an insurance policy. I think this will help to explain. I would never expect a customer to purchase the source code until they have been a user of the non-source version for a year or two years. Eventually when they are heavily using and dependent on PowerST and if they are managing tens or hundreds of millions dependent on the software, they might want to have the source code only as an insurance policy so that they have control. But the source code doesn't give any additional strategy testing capability, or really even any additional flexibility. It is not a source code oriented product or a custom program. It is a fully documented end user environment. It is not a custom program for each specific case in PowerST. The end user only specifies the rules of the trading strategy. For most trading strategies is it only a small amount of code, say 100 lines of code. Although it can progress beyond that for specialized needs. I have had strategies as much as a few thousand lines of code, but that is very unusual. The screen snapshot strategy from my last post I can only guess since I have never seen the code, but I would expect it probably progressed into a couple thousands lines of code to implement all of those options. But that is an extreme case with that very large set of options to choose from. I was trying to show that it can be done with options, but more typically a user writes a smaller amount of specific code for a specific strategy. So it is nowhere near a custom programming project for each project. An analogy might be SQL and the underlying database. The end user writes SQL statements and doesn't need to write the database engine code. In the same way, PowerST is a backtesting engine and the end user writes trading rules code. You keep saying this about the data engine and I keep responding describing an alternative approach and you go back and say the same thing again. We are starting to go around in circles. Or now you say it is not important. I believe the thread was 1) commercial SQL is not the right tool for the task, then 2) discussion about alternative approaches. I have been presenting an alternative approach, which is very much on topic and indeed important to anyone with an interest in software design for large scale testing. On the other hand, we have at this point both discussed our approaches in detail, so I agree that the topic is exhausted (which I think you are also saying in your last post) so we should quit the discussion. Something that may help to clarify is that PowerST does have an efficient data engine built into the software. It is a hand coded from scratch C++ engine specifically designed for working with time series data. But that is under the surface. On the surface is a programming environment for specifying trading rules, and under the surface the software does the data handling in a fast and efficient way. But the end user is barely even aware of how the underlying data handling is done. I think the core difference between our approaches is that you have two steps. You generate data, then query the data. Your second step makes your data engine highly visible. I do everything on the fly. On the fly the software feeds the data needed for a specific strategy into the data engine. The data engine processes and results are displayed. But the data engine is hidden from view. What is visible is a programming environment for specifying the rules of trading strategies. I other words, I process on the fly. You break it into two steps of generating 1,000 columns of data, then query a specified subset for the data for any one strategy. It's just two different approaches to the task. Thanks, but I don't need luck! It is already all working, as I keep saying. Has been for years. These guys that wrote this paper: http://www.trendfollowing.com/whitepaper/Does_trendfollowing_work_on_stocks.pdf They have been running a hedge fund from PowerST since 2005. Besides, to characterize PowerST as a "flat file approach" is not doing the software justice. PowerST is a whole strategy testing environment for working with time series data including a rich end user programming environment. Which is another way to say that it is a backtesting program. But it is a more technical and extendable backtesting program, based upon C++. Anyway, enough of this discussion. We are going in circles. In one of your earlier posts you said: This hit close to home because I am the developer of a fast screener and backtesting platform. So I decided to post and start asking questions about what you were wanting to accomplish. So far I have not heard anything discussed that cannot be handled very well by PowerST, which is a commercial product. Then to convince you of that, I went into explaining my design and we got off on the subject of software design for large scale strategy testing. Interesting topic, but I think we both agree that we have exhausted the discussion. I am willing to leave it at that there can be multiple valid approaches to a software design. - Bob Bolotin