Hi Bob, thanks for your response. I believe what we can create as a trading strategy can based only the available data. The list above covers a lot of different angles and gives a good enough picture for creating a trading strategy based on 3-5 days period from buying to selling. I believe if we are basing our decisions on ~same data, the trading decisions are a matter of intepretation. It depends of money management issues too. So potentially there can be a trading strategies as many traders there are. My idea is that everybody creates its own trading strategy. What is needed is a easy way to try and test different strategies. At the end it comes to the input data you have. If you slice and dice enough the different independent sources of data you can later play with more variables. The problem with existing tools is you cannot do this efficiently. While indeed out of topic this is important too. You have to be able to analyse, interpret and visualize all of this data. I agree Yes this is one approach. If you pre generate ~1,000 columns in advance - the permutations you will need later - it is much easier for the next phase - the actual screening and back test. There is not needed to generate all possible permutation, just that make a sense later. If you have them pre computer performance wise this is the fastest way later to screen and back test. Why? Because you pre generate one time, but screen and back test many times. If you generate your data dynamically every time when you screen and back test then the performance is not the best. At the end it comes down for the need of better database engine specific for these kind of tasks. This is very true. I would add that when I have all data sliced and diced with our software I can go back in time and inspect bar by bar all factors I pre calculated in advance for every trading day as they occurred. Absolutely - this is a very simple and effective way to analyse the history and that is how we are working too. What I needed too is after analyzing the factors day bay day to try and find similar events using the same factors - e.g. to make a screener with the factors as parameters. What is more I needed a screener working on the all history available. The screeners we analyzed are working only on today's data. That is because the amount of data is too much to store for all the history available. The problem is that I cannot know if the screening criteria are working if I cannot go back in history to see what happened next. This is different than back testing. For back testing you are testing a specific strategy after you formulated it. I consider the screening the phase where you are still not committed to specific strategy but are still researching for formulating a better strategy. With back testing you can test it and they have to help each other and to easily switch between screening and back testing. I agree, the only difference is that our approach is data driven. I don't want to write a code for each strategy I am testing. It is much more flexible and more productive If the end user just picks for example with check boxes which pre calculated data he will will use and formulate the relationships between these data. All these ~1,000 columns - it is not practical and not efficient to use all of them in specific screener and back testing, but they are lot enough to pick from to test one or other idea. The trick is to choose to compute just enough permutations not all of them. This is made one time, but used many times and only part of this data is used in one particular time. The discussion about SQL was that SQL prevents this way of data to be organized and used effectively. I believe screener and back testing working together on much more data as I described them is better idea. I agree it is a fast and effective approach for small amount of data. If you calculate data on the fly you eliminate the DB data handling in case of pre generated data. From other side it is slow. That is why we do it using parallel computing SIMD on CPU and CUDA on NVIDIA. The latest NVIDIA cards have a lot of memory and we tested an approach holding all data in memory. The computations are in order of magnitude faster than CPU so that this is a feasible approach. The problems is - not all kind of computations are appropriate for GPU, some of them are more suited for SIMD CPU vectorized computation. But it is not so fast as NVIDIA CUDA based. So we have a hybrid approach - pre calculated part of data using fast parallel SIMD CPU programming, leave most often used on the fly computations on NVIDIA CUDA and that way we have the best if both worlds. Combining this with an effective hybrid column based database we cover the effective data storage optimized for this kind of data too. It not a problem but when you have more instruments and more history it is not fast enough and you cannot make a screener that way. If you make a screener that way this is basically sequential scanning day by day, instrument by instrument and it it would be slow.
I believe it was good idea to describe our experience because this happens often and it takes a lot of time to understand what actually happens and why. That explains better why we made a decision to create a better solution which we eventually can offer to others too. Basically this is the reason why currently we don't have good enough understanding what happens in the market - we cannot process the data fast enough to make an educated decision: 1) There is a data explosion. I believe the decimalization of stock market prices is a part of the problem. Now the possible prices are much more than before - possibly 10 times more. This is a huge increase of the data which just cannot be handled and processed fast enough. 2) The OS we mostly use is getting slower and slower, not faster. MS is adding a features and features and features, but it gets more bloated every single month with patches, etc. 3) The microprocessor vendors cannot just increase the clock speed anymore - otherwise the computers will burn in flames. They are talking about water cooling, etc. to solve the power problem. A bright light comes from NVIDIA CUDA and it will solve a lot of specific problems. It is all related. That means a new approach is needed to handle all of this. We wanted to be able to test at least the 500 S&P stocks. It turned out the extensive tests with all factors we wanted are not possible to finish in a reasonable time. We are in Orange County, CA. The data vendor send us the processed data at 4:00 PM, they are very good - CSI and have reliable service. Until we download the data it takes say 15 minutes it is as fast as you get it, you cannot get it faster. Now plus minus other things our processing can begin at 4:30PM. It has to finish at reasonable time in order to have a list of possible instruments for the next day. We need a very fast screener and back testing too. If the processing and analysis takes a lot of time it simply is not possible to finish at time. You cannot sit till 10:00 PM to do that every day. So it turns out the basic requirements are these - no matter what the processing and analysis are - it has to finish up to 1 hour plus max 2 hours. It turned out with the hardware mentioned - a high speed server Dell PowerEdge 2850 2XCPU XEON 3 GHZ, 4 GB RAM, RAID 2 x 75 GB HDD, which is a reasonable investment, you cannot do a lot with the MS or even Sybase tools - it is too slow. If you calculate all these data ~1,000 columns just for DJIA 50 years history with MS tools it takes 1 hour, Sybase - 5 min. If you pre calculate, store this history up to last year end and just calculate only Year to day data and merge with pre calculated history, the merging only takes 5-10 min. And that is just for DJIA. So we got rid of all these tools and created our own. Also we decided on hybrid approach to pre calculate a part of it and use NVIDIA CUDA for on the fly computations. For comparison if you take just one operation - LOAD data store - MS tools (MS SQL , .NET) show exponential dependency from the amount of data you LOAD in the DB, Sybase shows relatively linear dependency and is 8-10 times faster than MS, with THStart (our software) it is linear and 10 times faster than Sybase and you can figure out about how much faster relative to MS.
Both these arguments are basically - bullocks. Sorry. To the first: I dont see a 10 times data explosion by decimalization of stocks. Maybe one should get a grip of how to represent stock prices. NxCore (high end data feed) does that now still in a nice way - every price is a 32 bit integer representing the TICKS of the price. The symbol knows how to transpose this into a "float value". That said, for most maths it is good to keep the calculations in float or double, and that has been the same before, too - math libraries never were designed to properly deal with "1/xth". SO, sorry, decimalization has done nothing. The OS you are using is not getting slower and slower. In fact, the OS is totally aside for anything CPU heavy. If you properly install it (server core for server not using .net, until R2 is out this month then it can handle .net, or at least non-aero on the server) the CPU and memory usage of the OS has - as share of the machine - decreased tremendously. In fact, it is barely noticable. My cpu allocation on a non-used system normally stays below 1% for the OS, and the memory usage is a joke, too, compared to what I have in the computers these days. The OS may have a lot of things that are nice, but note most of that is either irrelevant for a duty server, and / or not actually using a lot of resources when not in use. And yes, my bars are pretty high - 1 have nothing against the OS using 512Mb. After all, a heavy duty workstation has between 8g and 16gb of memory to start with That said, Iwish Server 2008 R2 comes out soon, then the .NET framework is installable on the server core Btw., "slow" is not an issue of the .NET framework, and HPC server would incidentally not help at all (the comment made further up basically says the poster has no clue what HPC is all about - total clueless). Oh, and processors are getting more and more powerfull. One just has to be smart enough (and that is not always simple) to program an algorythm / platform in a way that allows all those cores to be used. Not always feasible. But if I look at today's 6 core processors and compare them to the dual core I still use in my workstation.... then.... there is a LOT more power. Blaming crappy programming on "the OS" or "decimalization" is not objective. As is blaming faster and faster processors. Same to the whole SQL debate - pretty obvious that this was simply the wrong tool for the job. Bad news: using the wrong tool is never satisfactory. SQL Is a generic data store. Generalization has costs. CUDA / offloading processing to a higher end graphics card (which have interesting amounts of extremely high speed memory these days) may though really be a solution. SIMD is a joke compared to the gains that one can get from actually using the calculation units on a graphics card. I would still be doubtfull of the memory (especially: graphics cards really shine on floats / doubles, not integers), but fact is that a IF the analysis can be offloaded to the graphics card.... there are a TON of procesing units there that can work in parallel. What I would see as possible negatives of CUDA In this scenario: * Data amount. I am not sure how much data we talk about, but.... the memory in the graphics cards is pretty limited * It does not store the data storage problem. That said, this is a separate issue at all (storage can happen in the background). For most cases I would say CUDA is total overhead and wasted. I do day trading, and my trading platform reall never spikes cpu for calculations. But a screener checking many many many data series.... is really sensible. This may be a lot of parallel data operations happening. Definitely worth a try
It has to do with compression. 10 times more possible numbers - less compression. The more data is similar - better compression. 32 bit is still a lot. Float and double in financial computation is not the best idea. Mine are very high too. This is not the best idea for high performance but we have to live with that. I was referring that the processor clock cannot be increased anymore, so they are adding more cores. But it needs a better way to program. The today's compilers are not as good to write a parallel programs to extract the potential of many cores. The was the original subject of thread and we tried to fit in this box Also it is obvious now, not so obvious in the beginning. It gives a good results for us for some specific tasks. Some tasks are still good for SIMD too. Some - better for NVIDIA CUDA, we try to use the best of both worlds. Float and integer calculations takes the same amount of time on NVIDIA CUDA - please do your homework firstly. There are a reasonable priced CUDA cards with enough memory. Try FX 1800 - it has 64 processors, ~800MB memory it takes only 59W and can fit in most of computers without needing bigger power requirements. To the other end is Tesla 4 GB, 240 processors. Be careful when you choose the computer to install the card to. Be sure it has at least 2 16xPCIe slots and better be latest model because older PCIe interfaces actually are working at half the possible I/O speed. Theoretically you can get 6-7GB/s data transfer speed between the card and motherboard, in practice you get 4-5GB/s. But you need a software for a general computing on CUDA otherwise it will be just a GPU for you. You don't need this for the trading itself, but for screening and back testing it will help a lot.
Not obvious at all...that's why I started the thread. I'm still not convinced that people are fully exploring what SQL has to offer. For instance, there are potential solutions to the "disk access" problem that don't require you to abandon SQL. **I'm not saying SQL will ever be faster in a head to head race against a completely custom solution. But writing that custom solution has big costs too. My trading needs are not at the millisecond level and I wouldn't recommend SQL for that job. If most people reading this are trying to scalp ticks or play market maker, then building something like thstart suggests would probably be worth the time. But for anything in the minute range or beyond, SQL can get it done and quite fast IF you are willing to experiment and get creative with how you're storing and accessing data. For my needs, the difference in speed wouldn't be very noticable and it doesn't justify me going down the C++, GPU road. YMMV Also, I wasn't originally thinking of backtesting applications since I don't do much backtesting myself (flames welcome)...but I'm still interested in reading about it.
That is right. This is true. By SQL I was referring mostly to the SQL data management implementation mostly, not the SQL as a language. Screening and back testing is where my focus was in the entire discussion. There is a lot of amount of data generated and it needs better performance. For moderate data sizes it is fine. If the amount of data fits in memory the free SQLite is better solution - it has in-memory feature. But SQLite needs more low level programming. Screening and back testing are essential for experimenting and testing your trading strategy. The trading itself is relatively simple and boring - if you have a discipline you get the signals according your trading strategy and just execute, or have somebody to execute for you, it does not matter much who executes if you follow a procedure.
I partially agree. Not to get too far off-topic, but my personal feeling is that after a certain point (which varies with every strategy), backtesting quickly devolves into curve-fitting. For most of my strategies, which I intentionally keep very simple, that point is reached pretty early and I haven't had the need for elaborate back tests that SQL can't handle. So I think we may be on the same page...just that our needs and strategies seem to be different.
The simpler is always better, and no curve fitting of course. You said it That point is reached pretty early because SQL engines cannot handle more of it. We don't wanted to be limited from SQL or whatever.
thstart: This is an interesting discussion because it seems you and I are in exactly the same line of work. We have both spent years working in the subject matter of software for large scale strategy backtesting. Absolutely! There is great interest value and information value in this thread, at least for me. It is a topic I have not previously heard discussed in so much detail and with so much experience. It is clearly an important topic for anyone interested in large scale strategy backtesting. What we are discussing is software design for large scale backtesting (and scanning), and I imagine there are a lot of people on this board interested in large scale backtesting. This has bugged me for a long time. I have talked to people using SQL for this kind of processing, and have been told that SQL is slow, but you have really dug into the topic and are discussing your findings in detail. As I said in my last post "This whole thread strikes me as a very detailed explanation based upon extensive experience why SQL is not the right tool for the job of processing portfolios of market data.". You are saying that you agree with this, that (to use your term) Conventional Off the Shelf databases is the wrong tool. That conclusion led you to develop an alternative approach which you are only partly disclosing because of non-disclosure agreement, which is fine. However, we still have significantly different viewpoints about implementations of alternative approaches. You have approached the task in a completely different way than I did. The approach to backtesting that I was describing in my last post (with your responses quoted above) is not a theoretical discussion. I have personally developed backtesting software named PowerST based upon this approach that has been in use by real world customers for years, and the performance is excellent (as I will detail below). Your approach requires a database engine where my approach does not. Remember that was the source of your problems. It was the performance of the database engine that drew you into three years of work how to solve performance problems, not to mention that your approach still requires expensive hardware (high speed server Dell PowerEdge 2850 2XCPU XEON 3 GHZ, 4 GB RAM, RAID 2 x 75 GB HDD, SIMD, NVIDIA CUDA GPU parallel computing, up to 240 processors, SSDs). On the other hand, there are real world users of the software that I developed getting excellent performance with this kind of testing on low cost mainstream computers. Here is a recent quote from a user: Also, there is another real world PowerST user that runs the software against the entire universe of US stocks past and present every trading day. I was mentioning to him a while back that I has some ideas for performance improvements and he said don't bother to do this for him because performance is not a concern for him. Originally, a number of years ago, he was running on a laptop with 1 GB of memory. More recently for reliability and availability they moved to a shared virtual Windows server in a data center, which I understand costs $100/month. He said that after moving to the server that his daily running of PowerST is now... I forget the exact number but it was in the range of 5 minutes to 15 minutes. His point was that the data updates takes much longer then running the backtesting software, and that the running of the backtesting software is fast enough that any potential performance improvement isn't worth much effort. I need to break this into multiple posts due to post text length limits. The next post continues...
Continuing my previous post: However, you talk about "sliding window calculations of 5,10,20,40,80,160,240,360 trading days", which I term as indicator calculations of these various lengths. The users I refer to above are not running against hundreds or 1,000 columns of input data that you talk about. I think it would be extremely unlikely for any one strategy to be backtested or a scan to use this many input variables, and you have indicated agreement with this saying your goal is to have 1,000 columns of input data available but that a strategy will use some subset. You gave an example "cabd, 5% Gain 5 Days after D-Day" options that I quoted in my last post. However, even though I think it is extreme, to stress the performance capability I put together a test strategy that does a large quantity of calculations. I took the moving average crossover system that you used as an example in a previous post and added additional moving average calculations of length 5 to 360 step by 1 to be similar to your "sliding window calculations of 5,10,20,40,80,160,240,360 trading days". These moving averages are calculated each day. So this is 5 to 360 inclusive plus the original two moving average for the crossover system = (360-5)+1+2 = 358 simple moving average indicator calculations per day. In your terminology this is 358 columns of sliding window calculations. I ran this on an 1800 stocks subset of Russell 5000 stocks, with date range from November, 1994 through December, 2007 (don't ask me why! it is the data that I happen to have available). So this is 13 years of EOD stock data. I ran this on my day to day Windows Vista 32-bit laptop dual-core 3 GB memory that I paid $479.99 for in November, 2008 (I am making the point this is a very low end consumer level laptop). Calculation time for 358 simple moving average sliding window calculations per stock each day, 1,800 stocks, 13 years of data: It takes around 2 minutes. As I say, the performance is excellent. And this is on a cheap consumer level laptop. I understand that a Core 2 Duo processor and/or 64 bit with a little more memory will improve this performance substantially (I believe that is the environment the above quote about "7000 markets from a 22,000 database of symbols takes about 3 minutes" came from). With this kind of performance there is no need for a database and all of this huge performance problems that has caused you. However, it is true that a simple moving average is not a calculation intensive task. What about a more calculation intensive task? You mentioned the approach of pre-calculating to save time in later backtesting, scans, and trade signal generation. It isn't hard to pre-calculate calculation intensive indicators and put the results into a flat file (or flat files). I understand that one PowerST user is doing this. I don't know exactly why because their strategy is proprietary and I am not involved in the programming, and they don't give me details. My assumption is that they must have some kind of highly calculation intensive indicators. I assume the point is that certain parts of their strategy is already fixed so there is no need to calculate that part of the strategy with each backtest. So instead they decided to calculate those input variables once and save the results, then read those results from disk as input to subsequent backtests. This is similar to what you describe that you are doing, but it can be done with flat files which are then very fast to read back into subsequent backtests. To explain some more, PowerST has facilities to calculate indicators and log the data items to file(s). It would only be a few lines of code to do that. Then there are also facilities to help read the input data and make that data part of subsequent backtests. It would be a very small amount of code for the generation and later consumption of the pre-calculated data because the backtesting software is already suited to this kind of task of processing time series data. It doesn't need a database. Rather, you can run the same portfolio one time to pre-calculate into flat files, then backtest reading the pre-calculated input from the files. The trick to this is to order the data in the flat file in the order that it will be consumed in the subsequent backtests, and the backtesting software has capability to do that. Once the pre-calculated data is then in a flat file there is nothing faster than reading a flat file intermixed with the backtesting calculation processing. However, the customer I mention above is the only PowerST user I have heard of that has bothered to pre-calculate indicators. The reason is because unless there is some extreme amount of number crunching, calculating everything on the fly (the approach I described in my first post on this topic) is plenty fast enough. Again going back to my benchmark test of 1800 stocks, 13 years, 358 various length indicator calculations per day for each stock, on a consumer level laptop takes only around 2 minutes. I believe that I have demonstrated above that I am actually getting much better performance calculating on the fly than you are getting with your database approach. Besides, if necessary for performance reasons, it is easy enough to pre-calculate to flat files then consume the pre-calculate input data in subsequent backtests. - Bob Bolotin