SQL for trading...?

Discussion in 'Trading Software' started by CloroxCowboy, Jul 1, 2009.

How do you feel about SQL?

  1. Good tool for trading.

    27 vote(s)
    47.4%
  2. Awful tool for trading.

    11 vote(s)
    19.3%
  3. Never tried it / no opinion.

    18 vote(s)
    31.6%
  4. Heard so many bad things I'd never want to try it.

    1 vote(s)
    1.8%
  1. Ok, I see what you mean by sliding window now. When I do a moving calculation like that, I usually have good results with something like the query below. This is a moving average example taken from chapter 8 of a book called "Transact SQL Cookbook":

    SELECT x.Id, AVG(y.Sales) moving_average
    FROM BulbSales x, BulbSales y
    WHERE x.Id>=6 AND x.Id BETWEEN y.Id AND y.Id+5
    GROUP BY x.Id
    ORDER BY x.Id


    The key idea being to inner join your data table to itself, then use the where statement to define the starting point and width of the window. As you mentioned I like to have freedom to vary the width, so in this example I'd replace the numbers 6 and 5 with parameters and package it as a Sql Server function.
     
    #11     Jul 1, 2009
  2. thstart

    thstart

    Yes - this is possible with SQL but it is slow because of these joins. The main reason COTS databases are slow is because they are designed for online transaction processing - e.g. row oriented. If you have a table with HOLCV in 5 different columns every time you perform for example a sliding window calculation on one column, the database engine reads all columns with each row it reads and then selects the one of interest. You can say I need only C(Closing Price) but internally the database reads all HOLCV for the row and then isolates the C(Closing Price). This is one of the reasons it is so slow.

    The solution is in column oriented database - you access only the column you need and also storing it that way you can compress it better without the interleaving with other columns. Compressing better means reading it faster from the disk - the slowest operation. If reading compressed and decompression is faster than reading without compression, you have faster access to the data. But decompression is not always faster and there is a need for faster algorithms. This is a part of we are doing for about 2 years.
     
    #12     Jul 1, 2009
  3. You're not required to read all columns when you join. Just use two subqueries that reference only the close price if that's all you need for the calculation. Assuming the BulbSales table has 200 columns, we could re-write the example to:

    SELECT x.Id, AVG(y.Sales) moving_average
    FROM (SELECT Id, Sales FROM BulbSales) AS x,
    (SELECT Id, Sales FROM BulbSales) AS y

    WHERE x.Id>=6 AND x.Id BETWEEN y.Id AND y.Id+5
    GROUP BY x.Id
    ORDER BY x.Id
    OPTION (MERGE JOIN)


    If we also assume that Id is the only column in the primary key, this query shouldn't be too bad.
     
    #13     Jul 1, 2009
  4. thstart

    thstart

    In fact internally the database is reading all the columns, then using only the columns you need in the query. Because the database is stored on the disk and reading from the disk is the slowest operation, this is why it is so slow.

    If you don't believe you can make some benchmarks and see it yourself. Also you can always read the patents filled for databases and look how they read the data when you make a query.

    What I am saying is that basically the COTS databases are hitting a wall in regard to performance for analytical queries. They are made good for online transactional management, but not good for analytical tasks.
     
    #14     Jul 1, 2009
  5. A flat file would be stored on a disk also, wouldn't it?
     
    #15     Jul 1, 2009
  6. thstart

    thstart

    Yes but you can store as one column in a separate file. That way if you want a sliding window on C(Closing price) you load the file with just C(Closing Price) not all columns stored in this file.

    It compresses better because you have similar values and thus load faster. It requires less memory and it can sit in memory for subsequent access.

    This is the basis of column oriented databases.

    If you have 1,000 columns as a result of extensive research your SQL queries would be painfully slow. To access just 1 column you read all 1,000 columns in each row. You can limit the range of rows read to make things faster but you cannot limit the number of columns you read with each row. You read all of them. It is how it is stored on a disk.

    COTS so called online transactional databases (OLDB) were created when for example you have have customers and you store all data for each customer in one row. The row ID is your customer number and in each row you have Address, Phone, etc.
    The access pattern is row by row one at a time. This organization is OK if you collect customer data and want to retrieve customer data by ID number. With the ID number you get all data for each customer very fast.
    But if you want to do analysis for example how many customers buy something, this organization is not very efficient. This is why there are Online Analytical Databases (OLAP). But If you try them too you can see they are not exactly appropriate for analyzing trading data e.g. time series. They more appropriate for data warehouses, inventories analysis, etc. Also they are very expensive too.

    We evaluated, tested and benchmarked most COTS solutions available so far and created an internal report of performance benchmarks and cost. Also analysed most of database patents.

    The bottom line is they are not appropriate for time series analysis, large data sets, they are very expensive, do not utilize fully the processor resources, there are a limits on the number of CPU used, some of them charge for additional CPU.

    No one ever tries to use the SIMD processing capabilities available already for 10 years and which is one way to do parallel processing via vector operations. The truth is SIMD programming is very hard.

    Also in MS SQL you have a limitations to the number of columns you can create in a table. It is 500 columns max. The other solution is to create 1 Table with 1 Column, but SQL is not good at this too - the maitenance would be a nightmare.

    The best way to get the maximum performance is do it as column oriented database optimized for vectorized SIMD and NVIDIA CUDA GPU parallel computing. That is what we are doing.
     
    #16     Jul 1, 2009
  7. Great, thanks for the input. I don't agree with your views, but it's nice to hear how other people are processing their data.
     
    #17     Jul 1, 2009
  8. nitro

    nitro

    I don't understand why the programmer has to worry about any of this. This seems to be an implementation detail. You may pass a hint to the database that it should structure itself as column or row oriented per table. Heck, a database may even be able to intelligently rewrite itself on the disk based on use statistics, or whether it has been mostly a read or read/write db. The reason for column vs row is to minimize seeks on the disk. This layout on the disk should not be the worry of the application, and every database should support both on the table level.

    The CUDA part is cool.

    C-Store is open source column oriented db.

    http://db.csail.mit.edu/projects/cstore/

    and so is MonetDB

    http://monetdb.cwi.nl/
     
    #18     Jul 1, 2009
  9. thstart

    thstart

    It is no so simple as that when you are trying to do an actual work.

    The concept of column oriented is only a part of the picture. It solves the problem of accessing only needed columns.

    But a pure column oriented DB is a simplification. Actually the data has to be column oriented but grouped for an optimal SIMD operations.

    If you do not code with the underlying architecture in mind you cannot get maximum performance. Usually the performance is critical in analyzing a large data sets.
     
    #19     Jul 1, 2009
  10. nitro

    nitro

    I see, so your point is, [queries] has to be parallizable over many columns and tables [on a join], and just doing the column oriented part is only half the problem?
     
    #20     Jul 1, 2009