SQL for trading...?

Discussion in 'Trading Software' started by CloroxCowboy, Jul 1, 2009.

How do you feel about SQL?

  1. Good tool for trading.

    27 vote(s)
    47.4%
  2. Awful tool for trading.

    11 vote(s)
    19.3%
  3. Never tried it / no opinion.

    18 vote(s)
    31.6%
  4. Heard so many bad things I'd never want to try it.

    1 vote(s)
    1.8%
  1. As a former DB2 DBA I use DB2 express. One of the indirect trading functions I use DB2 tables and SQL for are loading trade station optimizations (into a table) and then using SQL to filter the rows that fit my trading plan requirements.
     
    #21     Jul 1, 2009
  2. thstart

    thstart

    Column oriented solves only the reading from the disk problem - to read only the needed data and read them faster because of better compression. This works only if you have an algorithm for faster decompression, better if it is a parallel decompression.

    Yes they have to be properly grouped to feed the SIMD registers for max performance (for example Structures of Arrays), properly sliced together to fit CPU cache, arranged in an appropriate memory access patterns to get the max performance.

    I believe we have a powerful computers which are still underutilized. Vectorized SIMD operations were introduced 10 years ago but still not utilized well. If you look at the .ASM code generated from MS or Intel C++ compilers you can see even if you set the /O2 option for max speed, the SIMD operations generated are far from optimal. Too often the code generated accesses the memory many times but if you do it properly you can store these data temporarily in SIMD registers. For reuse. Accessing memory being much slower than accessing registers hurts performance. And that is when you work with C++. If you work on .NET it is getting worse because you have a lot overhead, context switching and garbage collection. I made extensive benchmarks on increments of 1MB of data and the time for processing is raising exponentially. When data gets ~100MB in size it gets very slow and not practical to use. Performance meter shows .NET actually is using available processors (2 CPU's on my server) and is doing a higher level parallelization but it is still very slow for large data sets. Too much overhead because it is a general purpose not custom tailored solution.

    If this happens for vectorized instructions, what to tell about multi-core and multi-processor optimizations? Try to get the new Intel or MS parallelizing compilers and see how hard is to make it work properly. Also the performance gain to get from 2 cores is not comparable from what you can get from NVIDIA CUDA capable GPU's where you can get up to 240 processors. Personally I still haven;t heard of COTS database capable of doing all of this.

    Just do some benchmarks yourself on your own data on different increments, workloads, computations and don't believe to the conventional useless benchmarks.

    We coded the same algorithm on both MS tools and Sybase tools - 10 different siding windows back regressions, 10 days gain computation back and after each trading day, plus proprietary factor data to be able to create screeners.

    The number of data generated for each trading day is ~1,000 (columns) and it is about 200MB uncompressed for 1928-2009 DJIA. Up to 140MB data computations (half of this data) last for ~1 hour using MS tools, ~5 minutes using Sybase tools limited to 1 CPU.

    Sybase is much better, has more formulas embedded, there is not exponential growth in the computational time, it is almost linear, but for maximum utilization of 2 CPUs for example you pay two times more.

    Also even 5 min processing time is a lot if you want to make quick strategies testing. The screeners (queries) are slow too because it is not column oriented. Sybase IQ is column oriented but is about $40K and needs a database for storage. Sybase RAP trading solution with all options- DB, IQ, etc. comes together about $100K.

    The max configurations for MS tools (Enterprise version) is about $25K (unlimited CPUs), Sybase SQL Anywhere for 2 CPUs is $40K. That is without support option. 2 CPU configured machine is a reasonable today. I don't see how they can cram more CPUs in a server without burning it to flames - too much power consumption. So that is what you get - too slow for the $'s. The ROI is questionalble.

    The test machine was:
    Dell PowerEdge 2850 Server, 64-bit MS Server 2008;Intel Xeon 2x3.00 Ghz; 4GB RAM

    We are exploring now NVIDIA CUDA capable GPUs. The most powerful (240 processors) use a lot power too and need a computer capable to handle ~800W. There are a capable mid range professional GPUs with more memory with a good mix of memory/computation/PCIe bandwidth - for example we are testing now FX 1800 - 64 processors, 800MB memory, only 50W. Converting algorithms to it is not easy but the speedup we get for some of the algorithms is ~100 times.
    They can do this because these processors are very simple and thus can be packed in much more quantity. Currently they are still using 65nm process and the most powerful of them dissipate a lot of power, but moving to 45 nm will make them better.
     
    #22     Jul 1, 2009
  3. nitro

    nitro

    Thanks. That's interesting.

     
    #23     Jul 1, 2009
  4. If that is the case, it may be a solution looking for a problem. For a small to moderate size database, SSDs are starting to look very attractive. The access times are absurdly low (~ .1 - .2 ms) and IO rates absurdly high.

    It would be very interesting to hear about practical experience with these things.
     
    #24     Jul 1, 2009
  5. nitro

    nitro

    They are extremely expensive for very little size. Also, it sounds like storage logic is only half the reason for column oriented dbs.
     
    #25     Jul 1, 2009
  6. How much size do you want ? I've been reading some of the reviews lately. One standout seems to be the G.SKILL Falcon. Close to Intel SSD in performance and much cheaper. Here in Aus you can pick up 128Gb for AUD 425.

    I'm not sure how that compares with enterprise grade 15K RPM SCSI disks, but it may not be much more expensive per Gb.
     
    #26     Jul 1, 2009
  7. thstart

    thstart

    SSD disks right now are reported to deteriorate with time and the access time to increase after usage. It is still not clear why but this info is not popularized. They still expensive and indeed storage logic is half of the story.

    If you format your disk properly - with right sector size and alignment and read larger data sizes at time the modern disk caching and read ahead is very powerful.

    The problem is most of the computers ' hard drive are not configured properly. For example we own a Dell PowerEdge 8250 Server with RAID controllers, etc. but the disks were not properly aligned:
    Partition 2 primary 136 GB
    31 MB <--odd alignment

    it has to be 32 MB. We reformatted it properly changing also the cluster size to 64KB. It is a tricky business but when you do it properly you can read up to 200MB/s sometimes more if your organize the data around these parameters. This is only possible if you are creating your database from scratch. Disk operations being the slowest is one of the most critical part and the database must fit both the optimal disk parameters as well as appropriate storage to fit the SIMD processing later in order to get the max performance.

    I can tell you it is possible to extract a lot of performance and still using not very expensive computer.
     
    #27     Jul 1, 2009
  8. I feel a little embarrassed mentioning this (what with all the high powered DB guys posting here) but I've been using SQLite for my OHLCV data. I'm a software guy by trade but my specialty is not databases (embedded/DSP). But the price analysis apps I've been writing for my own use were getting unwieldy with .csv files. So I learned all I could about DB design and I'm glad I took the plunge. Managing this data is *so* much easier now. Plus I wrote a nice C++ wrapper around the SQL stuff to better integrate with my existing apps.

    While SQLite isn't any kind of enterprise solution, it certainly got the job done for me.
     
    #28     Jul 1, 2009
  9. thstart

    thstart

    We tested SQLite too. It is very good for small databases who can fit in the memory. It very popular because it is free and with very small footprint. This is the main reason it is fast. If you have more data and more columns they would not fit in the memory because SQLite is still a conventional DB, and you cannot isolate one column and work only on it. It is as good as much memory you have installed and as much data you have to test to fit that memory.

    We approached Dr. Richard Hipp about adding sliding windows like Sybase is doing it, but he has no plans doing it. Currently its development is stabilized and that is it. For pure SQl tasks it is OK.

    Form the databases we evaluated only Sybase Anywhere 11 has the in-memory capability but with this option it costs 2 times more and was a prohibitive for testing it.
     
    #29     Jul 1, 2009
  10.  
    #30     Jul 1, 2009