Backtesting Pains-How to store data locally for multiple backtestings at wire speed

Discussion in 'Data Sets and Feeds' started by sanjay_arora, Jul 10, 2009.

  1. Hello All

    As I get more into technical Trading, I find that the Retail tools for Trading do not cover the entire user requirement. I find that back-testing & optimizing multiple scenarios in multiple ways is really not possible in these systems or is just too slow.

    One major problem I find is that if one is running multiple simulations on the same data set or permutation of the same data set, the software tends to query it again & again from the data providers server.

    It was quite shocking to me that one could not buy the historical data one time & update daily from real time data providers and have a local data store to run queries at wire speed rather than be limited by your internet speed, especially if you are using more than one machine for simulation, back-testing & optimization.

    Maybe this sort of thing is possible with Non-Retail Softwares.

    What I envisage is something like this:

    - A Data Storage Server which will take offline data of any/all symbols (one time) and then query live data provider for tick data on each symbol live and store on a daily basis.
    -Preferably Open Source. May be free or Paid, again preferably free or GPL/Community Effort.
    - Preferably an Appliance/User Friendly machine or User Server running Linux & RDBMS like postgreSQL. User Specified OS/RDBMS may be possible if software is built in RDBMS triggers & multi-platform languages like python/perl/java etc. exclusively.
    - With capability to load TICK Historical Data from a wide variety of file formats, downloaded online or from CD/DVDs.
    - Historical Data of an entire Exchange (all scrips) or all Symbols to be traded, to be loaded in Tick Format.
    - Capability to clone/split the incoming data stream in real time to two threads, one to store the data and second to provide data to any local user, if wanted raw/unprocessed on tick by tick basis.
    - Data Storage Server to have Data Transformation procedures as triggers to build downstream data, as required by various client applications e.g. 1/3/5/15/30 minute bars or 20/60/1500 tick bars etc. and this data to be available live & independent of the base tick data from which it is derived without any online data transformation, for speed issues.
    I am assuming that speed would be vital & hard disk space merely a commodity. One has to be aware however that years of historical data, kept with multiple permutations, in populated views, alongwith required indexes and other rdbms system generated data, would make for one hell of a storage management requirement.
    - Server to have Views, preferably populated views (to be rebuilt each weekend), so that recursive queries are avoided for speed.
    - Data Transformation Procedures should have method for rarely used data type say 20000 tick bars to be dropped on admin/mgmt confirmation and create new data/bar type on user software demand after confirmation from mgmt. Disk Space Mgmt issues
    - Data Server to have pre-defined data cleansing procedures for clearing out bad ticks.
    - Data Server should store all incoming real-time data feeds as compressed csv files, to be uploaded to a geographically remote server, so that data need not be bought, in case database recovery is required.
    - Data Server should serve local clients after authentication & keep a record of the type, symbol & data range queried by user/strategy, for mgmt. analysis with simulations/actual trade. Don't really know what use this will eventually give except to tailor server rollout to user requirements, but one never knows.
    - Each Data Query & Delivery thread should be separately forked and multi-processor/multi-thread friendly.
    - Database server should have a good connection pooling system.
    - If possible data should be streamed to the client, rather than returned as a recordset. FIX? CSV strings in a tcp/ip socket/connection? Any specific protocol?
    - It should be possible to transform the data from native storage type to the required delivery type i.e. Metastock, Tradestation formas etc. I'd like to put a note here that I don't know anything about any software's data format, so maybe I don't have the right idea.
    - Server capacity should be measurable in connections and no. of symbols/data bars delivered/deliverable per second on a given hardware.
    - Server should have a native delivery method in addition to normal jdbc/odbc pooled connections and a streaming delivery, for those who want to write custom connectors for faster speed.
    - Server should have capability of connecting futures data periods & smoothing the joins, for extended history of futures contracts.

    If someone knows of an opensource project or even a commercial project that can provide this sort of functionality, please advise.

    I would also like to hear from people who would have an input on the featureset required or comments about something I am thinking wrong.

    I personally feel that a Data Server as outlined above, a GRID based back-testing/simulation server, that can be augmented as needs, an OMS, an Automated Trading Server, Accounting Backoffice and clients, would form a basic professional trading framework.

    I would really like your input on this first issue that I am tackling. Please do contribute to this thread.

    With best regards,
    Sanjay.
     
  2. thstart

    thstart

    We made extensive benchmarks of using COTS software for screening and back testing - .NET, MS SQL, Sybase Anywhere. You can get some info at this thread:
    http://www.elitetrader.com/vb/showthread.php?s=&threadid=168685


    Retail tools for trading are not exactly focused on screening and back testing, they are focused on trading. The problem is the trading is relatively simple and automated procedure if there is already a good trading strategy in place and strict money management requirements. A good strategy means extensive research and experimenting on historical data. We had the same question three years ago and found actually the tools available are not what we are looking for.

    You can buy a tick by tick data from third parties but it costs a lot. The problem is even if you get the data practically you cannot process it fast enough with the available software. You need a high performance computing working on affordable server or workstation.

    I believe EOD data is where you should begin and expand to tick by tick later.

    With tick by tick data you will suffer from data overload. This is too much data to download and process every day if you want to process a lot of instruments.

    I don't believe you will get a quality open source software for such important task.

    Better you just simply forget about
    using COTS DB and Open Source for this. The problem is the database storage and management in traditional DB's is not appropriate for this kind of processing.

    This would be a Terabytes of data. You better be close to the exchanges if you want to do this.

    Yes - all of your processing has to finish in a reasonable time in order to be able to use your results for actual trading.

    Please look at www.thstart.com
    We are developing a screening back testing platform for three years after evaluation the available options on the market from point of view of capabilities and price. It turned out we had to develop everything from scratch in order to achieve our goals. At some point we can offer it to the public.

    The task was to be able to process EOD data every day after market closing and finish the screening and back test in a reasonable time. The parameters we needed were ~1,000 based on OHLCV. The amount of data generated for one benchmark testing - just the DJIA from 1928-2009 is exploding from 20KB to ~500MB. Extrapolated to the full market it is a lot of data.

    To make the long story short MS tools after processing 50 years of data lasted ~1 hour, showed exponential rise in time and we do not continued the testing - it was not feasible. Sybase tools for same time period lasted 5 mins, showed a linear time but still it was slow if we want to process more data.

    We developed a time series optimized data base from scratch. It is focused on time series analysis rather to be general; uses a parallel computing where possible without the need of expensive servers - uses the available computer power - vectorized SIMD instructions are still underutilized; uses the modern NVIDIA CUDA accelerators when appropriate; implements data compression and a decimal (note - a decimal rather than float) calculations on compressed data without the need to uncompress ->calculate->compress; has an unique feature representations to be used later for the actual screening and back testing; uses a hybrid pre calculation - on the fly computation; uses a hybrid data store optimized for later SIMD processing; besides data processing innovation we innovated in a data visualization with synchronous multiple windows manipulation of one of several instruments in different time frames simultaneously. Better look at www.thstart.com for more info we update often.
     
  3. Actually, I was thinking of a more modular approach, rather one extra module...the Data server, to replace the feed from the broker/data provider.

    Basically, the target is to buy historical tick feed and keep on adding to it on a daily basis using a real-time feed provider. Subsequently, to manufacture out of this raw data all kinds of bars that may be required by the trader and avoiding high fees & slow access times for historical data needed for extensive back-testing.

    To me, this server is just a data feed server and further back-testing infrastructure would be separate, most probably a grid of inexpensive machines on linux or other license cost free OS, so that new servers can be added to the grid on need, at hardware & maintainence costs only.

    I would be really interested to know how do the data feed providers store & serve out their data. That's all I intend with this sort of a server. Just like a proxy server for feed but with full history so no data except real time data need be queried from data provider, all data provided from local data store to multiple clients on the local LAN at LAN speeds.

    Sanjay.
     
  4. For most tests (that are linear to one tick stream - symbol) you do not need a data server to even query "real time".

    If one sets up a test to for example optimize algorythms and will run the same system 100 times over a data set, the test system could downloa dthe tick stream and write it to a file once, then test against this file ;)

    Not saying the server would not be usefull (and I work on the same right now), just that... .well.... there are optimizations that go even further.
     
  5. promagma

    promagma

    Maybe tickzoom .... if you can code a little C# ?
     
  6. kdb+
     
  7. I am looking at a scenario where multiple users, with multiple test cases are there and the data for all should be on the office server.

    Again data may be queried on one or multiple symbols by one or multiple people/strategies/test case softwares, repeated multiple times for the same/similar data/time series. All need to be done at fastest speed possible, so data should be on server in office, doing away with slow internet connectivity for historical data at least.

    Real time data to be proxied/relayed by this server, in addition to being added to database.

    Is this a sound requirement or am I simply baying for a non-requisite feature?
     
  8. rosy2

    rosy2

  9. That's exactly how TickZoom works. It has multiple components.

    The data server/trade server stores tick data to file.

    The data engine converts it to all kinds of bars. The principal challenge in your design is performance. TickZOOM solves the performance issues and is extremely fast. Plus it offers a variety of technique to optimize the process. The data engine exists as a DLL with an API DLL with only interfaces.

    The rest of the trading, historical testing, optimizing, portfolio trading, etc is a separate library based on the data engine API interfaces.

    As a trader, in TickZOOM, you developer a plugin or plugins which contain your strategies, indicators, portfolios, etc.

    Back to your data servers, you propose adding machines in to a grid for collecting more data.

    You can do that if that's the cheapest way but you must understand the that real bottleneck (if you have high speed software) will be in the disk i/o to write out all the data. That could also be solved by adding disks and network infrastructure rather than machines.

    The raw, real time data from providers is always in the form of Ticks but the providers all have slightly different data formats or APIs to get the real time tick feed. Still they have the same fundamental information so TickZOOM defines a common data format.

    TickZOOM actually goes one step further and allows using Level II data to build ticks that include 5 levels of DOM which can be invaluable in many trading scenarios.

    Wayne
     
  10. Sanjay, there's also a project on sourceforge that does a lot of what you describe. It's somewhat outdated and entirely written in C on *Nix.

    But it was interesting to learn all of this.

    It sounds like you want to build this yourself. Keep in mind that it's impossible to build this module totally isolated from the historical testing and real time trading components.

    It's important to simultaneously build and test all the historical testing, bar data generation, real time trading functionality to avoid "painting yourself into a corner". That can happen either from functionality or missed requirements or from performance issues which cause you to have to rewrite much of your system.

    TickZoom was built generally in this idea way but with some overlooked items which were discovered and parts of the system refactored.

    One example was multi-symbols which was recently added and required refactoring some of the data engine since multi-symbol was never really tested.

    Get in touch if you would rather have 90% of this already fully working and get the source so you can go from there as well as get support and custom development from us if you need it.

    Otherwise, I wish you success in your development!

    Sincerely,
    Wayne
     
    #10     Jul 20, 2009