By the way, the method of queuing up happens because the quote collection and writing to file happens in one thread. While feeding the real time client is in a different thread. The first thread write ticks in batches of every few seconds to file. If it finds the file locked for reading, it starts queuing and retrying every few seconds to write to file. I added that so it wouldn't die or lose ticks while I copy a file or zip it up, etc. Similarly, when the other thread which feeds real time clients starts to read the same file, it has the duplicate effect until it finishes reading history. Then it reads ticks real time and no longer needs history data. Sincerely, Wayne
Hmm, price and volume pattern queries, like reversals and large block trades aren't terribly esoteric. From the sound of it you've come up with a highly optimized format that lends itself quite well to stream oriented reads. How do you handle data corrections, late trades and similar updates that a provider might send? Also, how easy will it be to do walk forward testing if the data is arbitrarily chunked into 100Mb files? The daily trade volume of the SPY dwarfs most smallcap issues.
Final decision on database. Anyone who wishes can build and find a better data storage solution and contribute it to the project. As you saw on the site we plan to have rewards and benefits for contributors. In the meantime I'll make it organize files in the file system. Your better solution needs to exceed the default solution in these requirements. 1. As fast or faster at loading ticks. (Challenging) 2. Easy to put new blobs into the storage. 3. Easy to get blobs out to share with someone else. 4. Able to locate in less than 3 seconds the specific blobs requested based on a Symbol and date range. 5. Able to stream ticks from the blob WHILE it's loading into memory. (This is the hard one.) TickZOOM will have all that with a simple file system organization. NOTE: Let me know if you think of other requirements the software must meet. Wayne
Can you elaborate on the relevance of this? I missed something here. Thanks! Jprad, those are excellent questions. While the ticks currently support last trade price, last trade size. For forex TickZOOM synthetically extrapolates from the DOM. Can you define a late trade and data correction? I will assume it's simply a tick sent with a timestamp out of sequence. I'm guessing: A trade occurred 2 minutes ago that was missed, the exchange sends a tick out with that trade and a timestamp of 2 minutes previously? Is that right? Does it ever identify which tick was corrected, replaced in some way? Either way, it's not hard to handle that logic. If a tick comes in that's has a prior timestamp to the previous tick, we go back and add it to all the relative bars. Every bar has both a start time and end tick time. So the engine simply finds the bar in which it belongs, you call a correction method which will make any adjustments to the volume, high/low, etc. That will happen so rarely as to not affect performance. NOTE: TickZOOM collect ticks to file in RAW format without any filtering formating, etc. That way we can truly simulate a production environment in our historical test including data corrections, late trades, etc. [/quote] I'm not sure the point here. For clear terminology, I'm calling chunks the section within a tick file and it seems we're calling the tick file a BLOB which it is. So that part is all transparent to the engine and the historical testing. All it knows is that is that it gets a stream of ticks to process. The data loader can load multiple BLOBs or parts of BLOBs to satisfy the date range and symbol requested. TickZOOM doesn't have walk forward yet. So I'm basing my answer off this definition: http://www.tradersstudio.com/Overview/tabid/68/Default.aspx?PageContentID=23 (It claims no software does walk forward testing yet. Is that true?) Anyway, even with that kind of walk forward year after year testing, the I/O of ticks is transparent and irrelevant. What everyone needs to understand is that the bottleneck is is the CPU. Notice: If I only stream the 159Meg file into memory in TickZOOM with the engine disabled, that takes about 5 seconds. If I load the whole file into memory first and then run the engine, the engine takes 40 seconds to process that data. The point is that every time a new walk forward test starts, it starts reading the file back from the beginning. That's important to eliminate the memory wall most software hits with loading ticks into memory. So let's forget about I/O. In most systems, I/O is the bottleneck. But in TickProcessing it will not be the case for a long time. Besides if we could get the CPU to process 10million ticks as fast as the data can be loaded then it would only take 5 seconds. That would be a good thing. Wayne
Here is a another file system database (directory trees): TREE Data Server http://tree.sourceforge.net/ The TREE Data Server captures real-time financial data from one or several datafeed services, archives data in a historical database, and makes both live and archived data available to client applications. The system can be used in real-time charting, an ATS, tick feed simulator, etc. or any situation in which multiple clients need real-time access to the tick stream or archived tick data. In addition, the archived tick stream and tick data are available for offline data analysis and backtesting. In the current distribution a subset of the TREE Data Server collection of applications(C/C++) has been updated to support the following platforms and datafeed: Platforms(32-bit): Linux x86 (Ubuntu 7.10) MacOSX PPC (10.4.11) Windows XPSP2 x86 (MINGW port see tree/doc/ib.notes) Datafeed: Real-time tick(snapshot) data from Interactive Brokers TWS API
This is the most interesting suggestion yet. Are you the author? Would you like to make it compatible with tick zoom? It has a few issues that can probably be resolved: 1st It has a different binary format for ticks than TickZOOM. I'm sure a "take the best of both approach would work to make them the same." 2nd. It filters the ticks. This is very bad. It's critical that the ticks stored on file be raw ticks--exactly as fed by the exchange. Tick filtering must be done by the engine on the fly during historical testing since that is how it will work in production during live trading. That's key mistake of most systems like those you might build on NinjaTrader. You test on clean data and hit production and choke on some bad ticks. So tick filtering is a KEY part of making sure the system works reliably against raw ticks in historical testing. 3rd. It might be good to compare the tick filter algorithms and take the best of both. The one there looks similar in the comments any way to tickZOOM. 4th. Also, does it do streaming? I assume it can be modified to start feeding sticks onto the tick queue for the engine at the same time as loading into memory. This is just to get the discussion going. Sincerely, Wayne
Storing tick data in binary format in files (one file per symbol) is the way to go for tick stream playback/back-testing etc. However, there are many use cases where offline analysis of the tick data is necessary - the binary files are not normally amenable to this kind of analysis. The solution, or rather, ONE solution, is to have the necessary tools to easily transfer data from binary format to RDBMS etc. when needed. The data conversion/transfer time is not normally critical for offline analysis. Naturally, the API for data persistence can be pluggable so that other storage mechanisms can be deployed as the user deems fit. First, you need to define what is needed in this API. As far as a proprietary binary file format, going with a symbol per file is not a bad idea. Symbols can be put onto different disks if needed as disk I/O becomes a bottleneck. The alternative is to have a virtual file-system within a file on disk but this cannot be de-fragmented as easily and doesn't have the above mentioned benefits. You may also want to consider an appropriate level of compression - not for saving disk space, as that is cheap, but for reducing the time taken to read the data from disk. Proprietary binary file formats for storing tick data has been done a bunch of times and you aren't inventing anything new here but it's nice to see you being so enthusiastic about it. BMe.
I agree with everything including that I'm not inventing anything new by using binary. But there's one thing I hope you will elaborate on and one correction to make. See below: Q: What kind of analysis? I confess I'm clueless to any other use of tick data than for historical testing and this is most excellently done in binary format. Q: You "sound" as though this is something you have actually done. And yet, being an RDMBS expert myself and having tried to load tick data into several I found the usability impossible to load a tick per row. Most databases have trouble after a few million rows of data. In this case, for one symbol we're talking 100 million rows (when storing every single change of DOM) per year. I worked at one company with THE largest Oracle installation in the country and they had to move to TeraData DB to handle that kind of volume. Forget about Berkeley, MySQL, etc. So you "sound" like you've done this but I can't see how that's possible. So please elaborate. Oh really? Not critical? Did you catch the number of ticks? Let's say it only takes you 10 milliseconds to process each tick. Let's do the math. On 10 millions ticks, that will take 27+ hours to process. I beg to differ, it seems just the opposite. Real time processing is far easier. That's why so many platforms do that but fail at historical testing of ticks. It's MUCH more important to have the fastest possible speed during offline testing. TickZOOM handles them at 1 microsecond per tick, that means in real time it can handle 500,000 to 1 million ticks per second. No exchange will generate that many ticks per second even if you're tracking dozens of symbols. so it's the off line analysis where speed is most critical. Certainly. Agreed. That won't be an issue. Again people keep focusing on the wrong problem. Remember, it takes only 3 seconds for TickZOOM to load an entire 100Meg file. But it takes the CPU 40 seconds to process that file through the engine. So what happens when you add another symbol? The CPU will now take 80 seconds and loading only 10 seconds. In other words, until we get massively parallel processing (and even then) disk still won't be the bottle neck. The disk will continue to be faster than the processing. That's due to limitations in parrallel processing time series data (it has dependencies.) Okay but we're still working on the wrong problem. The bottleneck here is the CPU. Decompressing, therefore, will make it even slower, not faster because it takes more CPU to perform the decompression while loading. Thanks for your comments. And I'm sure you're much smarter and more skilled at database than I am. I mean that sincerely. However, this is not a database problem. It's a CPU problem. And, obviously, I do okay in that area (but I can do better). [/QUOTE]
Folks, Please keep the ideas and suggestions coming. However, also realize that most of us programmers have worked on database systems and have database expertise. So what's happening here is "we have a hammer so everything looks like a nail." This is by no means a database, RDBMS or OODBMS problem. The problem here is CPU. CPU. Remember in your performance tuning training in school? The 3 potential bottlenecks? Memory? CPU? and I/O? We all are so used to I/O being the bottleneck because we use RDBMS and OODBMS all the time. And that is necessary for referential integrity and fast searches and other reasons. But how does any of that apply here? It doesn't. Now... Many want TickZOOM to become the defacto standard for open source trading systems. That will never happen if we keep focusing on non-problems. Your help will be greatly appreciated on this project and ideas too. But let's focus on the real issues. How about broker interfaces? TickZOOM only integrates with MB trading right now. How about adding indicators? Or a better optimizing algorithm? Or walk forward optimization? Frankly, be my guest, but you'll be wasting any time you spend on a better data storage solution. Sincerely, Wayne
IB has more clients on this board than any other broker. You will get more feedback/participation if IB is connected. Furthermore, if you want to reach critical mass quickly, IB is your card.