SQL for trading...?

Discussion in 'Trading Software' started by CloroxCowboy, Jul 1, 2009.

How do you feel about SQL?

  1. Good tool for trading.

    27 vote(s)
    47.4%
  2. Awful tool for trading.

    11 vote(s)
    19.3%
  3. Never tried it / no opinion.

    18 vote(s)
    31.6%
  4. Heard so many bad things I'd never want to try it.

    1 vote(s)
    1.8%
  1. Spinlocks need the right hardware support. such as an atomic "test and set" instruction. In general you need some assembly language code.

    And you are quite right MS didn't invent them - like just about everything that comes out of MS. Test and set has been in processor instruction sets "forever" - probably 40 years or more - just for this kind of use.
     
    #61     Jul 5, 2009
  2. As this thread is already well off topic, some more info on SSDs:

    http://www.fusionio.com/PDFs/Fusion_Specsheet.pdf

    Incredible performance and strictly for the well heeled at around $30 per GB.

    It is a fair assumption that this cost will drop a lot over the next few years. For analysis of large amounts of time series data, you may not be in the race without hardware of this sort.
     
    #62     Jul 5, 2009
  3. PowerST

    PowerST

    thstart -

    I find this thread very interesting. I believe that I understand everything being discussed... except your very first post that discusses the definition of the task you are trying to solve.

    My confusion is what you mean when you say that the algorithm generates ~1,000 columns. There is a trading strategy that the output is 1,000 columns of output data which you then process in some way? I find that unlikely. I suspect that instead you must be saying that you need to generate a table with ~1,000 columns of time series data that is the input to the trading strategy. Is that correct?

    What are these ~1,000 columns? Is it simply 1,000 stocks EOD data since 1928 as input to a trading strategy for backtesting? You seem to be saying that in your later posts.

    Are you saying that the entire input data that you are processing is only 200MB? If so, that would seem to me to be very small, with these days even commercial level desktop and laptop computers having minimum 2 or 3 GB of memory.

    What I am getting at is that I am questioning whether SQL is the right tool for the job for the task you want to accomplish. But before expressing possibly uneducated opinions, my question is, what is the task you want to accomplish? If it is only to backtest 1,000 stocks I question that you may be on the wrong track.

    So, my question is, stepping back, what is the task you are seeking to accomplish?
     
    #63     Jul 7, 2009
  4. thstart

    thstart

    Thanks - it is a lot effort to formulate the problem and solve it - it took more than three years to figure out how to make a good screener and back testing platform.

    Let me try to explain better.

    After spending a lot of time at learning technical analysis and strategies I found the way computations are usually made are not satisfactory for me.

    To make the long story short for example the overused moving average is not calculated right and I believe watching a long time at the real time charts it is actually readjusted post factum. I don't believed this at first but double checked many times using the Schwab real time tools - I mark a point where the moving average used to be, the chart scrolls and after some time, the already calculated point moves up or down - it readjust itself as more data are coming. Moving average is just one example of a simple but a followed indicator.

    Next - using this simple moving average example, a simple trading strategy is if you have two moving averages, the fastest one crossing the slowest from down to up means buy, the fastest one crossing the slowest from up to down means sell. This is very simplified strategy for the sake of illustration. It is also very popular. I like the simplicity but there was two problems:

    1) The way it was calculated would give a wrong signals.

    2) It is very hard to write a robust good screener and back testing software the way it is described usually. A good I mean to have at least some basic scientific principles involved. The way it is made today - one day ago if fast MA it is below slow MA today, etc. or similar methods for event recognition is very approximate, hard to code and to give a repeatable results. It has to be a data driven approach.

    3) Just slow and fast MA are not enough factors to analyse. Man needs better periodicity indicator, more time windows and also more independent factors, not only MA.

    4) A very overlooked concept is the decimal computation. I personally have not seen a product to perform screening and back test using decimal computations - it is calculated using a floating point. If computations are inputted to next computations, etc. if you use a floating point there is an error which accumulated, and the results will be incorrect.


    I would add that some of the methods described are patent pending.

    In order to back test a trading strategy I believe what is needed is as follows:

    1) A better way to analyse the stock data time series.

    2) A better method for classifying, querying and comparing, filtering and visualization of temporal and spatial relationships in time series data

    3) A way for synchronized displaying, controlling, and visualization of information.

    4) A very fast high performance parallel Computations, Storage, and Retrieval engine optimized for large scale time series.

    I believe we created an innovative way for:
    1) Analysis of time series based on

    1.1) OHLC data discovering periodicity relationships in one stock or more stocks.

    1.1.1) A sliding window calculations of 5,10,20,40,80,160,240,360 trading days. These calculations are based on regression on each of OHLC with a very fast proprietary algorithm. Also there are 5 more calculations based on these - their Total, and several combinations of them. Also a calculations of the difference between the price and these values for each trading day.

    1.1.2) A rate of change calculations on each of OHLC and 1.1.1) with a very fast proprietary algorithm.

    1.2) Volume data - a rate of change calculations

    1.3) Options volume data - a rate of change calculations for 3 groups - Market Makers, Firms, Retail.

    1.4) Insider trading info with aggregate total number of shares info rated and divided depending of the type of each participant - Officer, Director, etc.

    1.5) Bank information - the ratio between deposits and loans, etc.

    1.6) Interest rates - rate of change

    1.7) Economic indicators - rate of change and similar to 1.1 computations.

    1.8) OHLC relationships in the same day and a day before-
    Like (H-L)/O, (H-L)/O1DayAgo and many others - Pivot Points for example.

    1-7 are relative long term, 3, 8 - a short term indicators.

    I believe they give a good mix of data
    influencing the trading decision from different independent point of views.

    It approaches the scientific way of analysis as close as possible.

    2) Having all these data which are basically time series EOD data points they have to be sliced on factors in order to make a factor analysis. The factor analysis basically says that such and such factors influence what happened. This is not in the sense of causality but it is easier to represent that way:

    When we have such and such factors at D-Day, D-D1Day Ago...D10DaysAgo - this happened at D1DayAfter.... D10 DaysAfter D-Day.

    D-Day is the day of interest.

    In order to analyse what happened after up to 10Days after D-Day we added calculations of the Gain for each of the 10 Days after D-Day.

    Up to now we have the following calculations for every single trading day back to as much history data available:
    1) ~10 columns for Gains based on Closing price - e.g. together with D-Day it is stored the 1DayAfter...10DaysAfter Gains precalculated - so you can make the screener easily later.

    2) ~400 columns data Based on 1.1.1) and 1.1.2) for D-Day, D-D1Day Ago...D10DaysAgo

    3) ~400 columns data based 1.2-1.8)

    The remaining ~200 columns cover point 2) - a way of classifying,querying and comparing, filtering, etc. of data for
    for D-Day, D-D1Day Ago...D10DaysAgo

    Basically this is a very simple method
    for representing the order between points in D-Day - the day of interest.

    If you look at the attached screenshot you can see the letters abcdefghij - 10 letters representing the 5,10,20,40,80,160,240,360 data points for the currently analysed D-Day and also D-D1Day Ago...D10DaysAgo.
    9876543210 represent the days 0 being D-Day.

    The Orange cross point shows the D-Day of interest intersection. If you look at the two pictures you can see that these D-Days had a similar characterisits - after them there was a significant gain. Now if you look at the letters at righ of each you can see a similar patterns, not the same of course. the most prominent similarity is the upper box brown and black c & a. Then in the middle box h, g & f are at the top.

    These letters representing the order are some the factors I am talking about.


    Now the final idea is this: for each of the trading days of 1 instrument there are about ~1,000 columns of calculations. Not all of them are used, some of them are intermediate but needed further. These computations have to be made in a reasonable time in order to have a trading decision for the next day. This can be made only with a high performance computing if you want to make these computations for more than one stock - say 5,000 stocks. We wanted to perform a high performance computing but on the available technology today, not supercomputers. I believe if we have a powerful workstation it is still underutilized. For example vectorized SIMD parallelization introduced 10 years ago and available in every Intel or AMD processors today is very underutilized and used mostly for graphic applications. But there is a way to use it for general computing with appropriate programming. Also we can use NVIDIA CUDA enabled card if available for superior performance for a general purpose computations. They are not expensive but have up to 240 processors today.

    Now after we have these data as factors, we can have a screener based on a factors analysis. In the screener you select for example some specific letter order on Closing price, say "cabd", 5% Gain 5 Days after D-Day. The screener made this way is very fast because it basically is searching letters ordered in specific way in our case "cabd". The result of screening is a list of all instruments and the D-Days when this happened. you click on selected D-Day and you can see it on the chart.

    Now all these data have to be visualized easily 3) A way for synchronized displaying, controlling, and visualization of information.

    If you look at the screenshot at right you can see the idea. You want to analyse one instrument at a different time frames at once.

    You select the instrument and 4 windows are opened simultaneously with predetermined time frames, say window #:
    1) 1 Week period
    2) 1 Month period
    3) 1 Quarter period
    4) 1 Year Period

    You click on one of them with the left mouse button to select a D-Day of interest. The chart cursor is positioned automatically at all remaining 3 windows at the same D-Day but you can see it better in relationship with the correpsonding time period.

    Then you click at one window with the left mouse button and select a D-Day of interest. The remaining 3 windows a synchronized similarly

    The same way you can analyse several instruments each at different time frame in a different window. When you click at one D-Day in one window the cursor is automatically updated at all windows simultaneously. For this you need again a high performance computing to be fast enough.

    I believe you understand better now what we accomplished.
     
    #64     Jul 7, 2009
  5. thstart

    thstart

    It is ~1,000 columns per stock. I created a benchmark to look how long it would take the generate these ~1,000 columns just for EOD DJIA because it has the longest history - 1928-2009.


    DJIA 1928-2009 (100 years) - OHLCV is 20,212 lines(EOD - End Of Day records), ~20KB input .CSV file. The generated file for half of this (50 years) is ~200MB. I processed only half of the input data because for the purpose of the benchmark Microsoft tools took a very long time to compute the data (~1 hour with 2 CPU vs. Sybase Anywhere ~5 min with only 1 CPU allowed from the licensing), so for entire period of 100 years the data should be ~500MB just for 1928-2009 DJIA.

    I made these benchmarks to measure how long it will take to compute my data (~1,000 columns) depending of the number of input records just for 1 instrument. That way I would know how long it will take if I have more instruments. The idea is all these computations to finish for a reasonable time in order to have the data for the next trading day.


    I am adding the attachment now
     
    #65     Jul 7, 2009
  6. That's what I suspected. It's no wonder SQL Server wasn't performing well if you're trying to store hundreds of columns worth of intermediate results...why would you need to do that? Back to my point about good database design. How can you expect the database to function correctly when it's bogged down with intermediate and final results all jammed into one bloated table? Granted, I'm just going by your summary, but that doesn't sound like good table design.

    For example, why not store the bank information, options data, economic indicators, etc in separate tables with sensible keys to link them? Split the data by categories, IOW. That would perform several orders of magnitude faster than the 1000 column monster table you're referring to. And it doesn't have to be a database with 1000 tables of one column each, more like 20-50 tables with 20-50 columns each. That shouldn't be too much of a maintenance nightmare.

    Interesting thread!
     
    #66     Jul 7, 2009
  7. thstart

    thstart

    This is not one 1,000 columns monster table - MS SQL do not allows so much columns in one table - it allows up to ~500.

    I was referring for a total of 1,000 columns of data but not in one table, they are organized of course in many tables but I don't want to make a long descriptions here about the internal structure, indexes, database optimizations, etc.
     
    #67     Jul 7, 2009
  8. I was just referring to what you said earlier, the reason you gave that row-oriented databases were too slow...

    You're saying two different things. I realize that Sql Server won't allow you to create a 1000 column table, but look at your response from 7/1 and explain to me why SQL won't do what you need IF the design were a better one. You said that disk access was the problem, and my point is that well-designed tables and queries can solve that issue without having to use column-oriented data storage.
     
    #68     Jul 7, 2009
  9. thstart

    thstart

    I am referring to an aggregation. If you have several tables which totally have ~1,000 columns if you want to access just one column of each table you have to make a join on all of tables to extract just one column of each table. In practice extracting just one column from one table means you read all columns from that table. With a join practically you would access all 1,000 columns.

    No matter how the relational database is organised there is always a trade off because if you group some columns in one table this would be good for one type of queries but not good for other. You can duplicate columns in different tables to eliminate joins, but this is not a good idea either - it is not normalized.

    This is why Mr. Codd proposed OLAP.

    If the DB is column oriented you just access only these columns you need.
     
    #69     Jul 7, 2009
  10. I guess we would need to get too far into the details of your design to answer this question... I do understand what your saying about the joins, but I'm still not sure why you'd need to store intermediate results for instance. My thought is that perhaps not all 1000 columns need to be in the tables that are being joined? Or perhaps not all the tables need to be joined for each aggregation?

    Again, I understand if you don't want to go into specifics, it's just what came to mind reading your posts.
     
    #70     Jul 7, 2009