Man....Go is FAST!

Discussion in 'App Development' started by fan27, Dec 17, 2016.

  1. fan27

    fan27

    Very good points bjohnson777! For my website, the algorithmic processing speed/memory footprint is certainly a concern, but the main bottleneck/cost factor will be the data. I am looking at AWS and their DynamoDB looks interesting but looks to be potentially quite pricey. Another option would be to store ticker data in flat files in CSV formant and keep a copy in an in memory cache server. I have not thought all of this through, but you are right, planning this out is key!!!
     
    #21     Dec 27, 2016
  2. Properly stored local data will be far faster than any DB. Local disk space is far cheaper than a database. I've been doing some personal programming for SierraChart that I've open sourced. Since you're likely to still have space constraints, I'd recommend converting to a smaller binary format. It will also load faster than converting text to floats and int's from a CSV. I have source code here as an example for the SCID format:
    https://www.sierrachart.com/SupportBoard.php?ThreadID=18894

    SCID uses a double for a date stamp that's the same format used in spreadsheets. That's probably a bit slow to convert, so changing that to split int's for dates would be better in your case. That's what I use in my own scratch pad C++ linux program I do various tests in.

    If space is really an issue and you have more CPU time than space, you can gzip (faster, less compression) or xz (much slower, higher compression) the data files to make them much smaller. If you're needing to compress data, you'd probably want to rewrite your web interface into a type of queue system, where someone would enter test parameters and have to wait a few minutes for the report to be generated. I ran into that method recently with a genetic testing web site.

    Getting back to DB's, technically a file system is a database of files. Keep your intraday data files in individual days or weeks (larger preferred). Keep your daily files in months or years (larger preferred). Rename each file to something that's easily parsed by a program. Most modern file systems have 4k blocks. Having data files much smaller than that slows down read efficiency, that's why I recommend the larger grouping I just mentioned. It's probably faster for your program to skim through the data to a specific date than to do multiple file system read operations looking for a file of that date... at least up to a point. This is one of the bottlenecks you'll need to look out for.

    Also make use of directories to group your data files accordingly. You want something that's easily traversable by a program, but at the same time, not too deep and not too many files per directory. That will help optomize the OS file system calls.

    If you've got the RAM, the unix variants offer a RAM disk option. It is FAST. And when I say fast, I really mean FAST. It won't be appropriate for your larger data files, but it will be for your smaller stuff that you access often (mainly config, potential job queues, temp output holding when building the report). It will also help with disk longevity by not pounding it with lots of small reads and writes (really bad for SSD's). Of course RAM is volatile and will need to be backed up on a periodic basis. A looping shell script to copy or tgz to disk every X minutes is trivial to write. The counter argument is to just depend on file system buffering. Depending on flush times and available buffer memory, that my not always behave as expected. If the kernel decides to commit a write, that is a relatively expensive CPU operation compared to just holding the temp files in a RAM disk... and then letting the CPU commit a write when they're properly finished.

    All this should give you plenty to think about for your future design. Like I mentioned before, you don't have to implement everything at once. Just be aware of where you're going and the potential bottlenecks.
     
    #22     Dec 28, 2016
    userque and fan27 like this.
  3. Zzzz1

    Zzzz1

    I have gone through all this years ago and I came to the conclusion that file IO and the efficiency of reading data structures (unless one really messes it up, such as reading time series based data using SQL or other relational db structures) is the last thing one should look to optimize. I tweaked my binary database to read time series data at insane speeds (close to File IO caps, using then top notch hardware and technology). Turned out that it was the last thing I had to worry about. What really slows down the processing of historical data is the actual strategies in back-testing, for example. The algorithms inside those strategies are several orders of magnitude slower than the actual data stream that feeds the strategies with time series data.

     
    #23     Dec 28, 2016
    fan27 likes this.
  4. @Zzzz1: Which language were you using?

    There is a massive difference between loading raw data off a disk and getting it from a database. Since you seem to know so much, why don't you take us through the internals of each process? ...all the way down to the kernel level...

    And we're talking servers here, not workstations. Servers will always have a higher load, so don't forget to include that in your explanation.
     
    #24     Dec 29, 2016
  5. Zzzz1

    Zzzz1

    I am not sure why you appear so agitated. This is not even an issue of language choice much less of kernels. Any modern language today can read binary data from disk and de-serialize the data at a rate of several million data points per second. When you run even a single strategy that peruses algorithms that are computationally intensive then your rate of processed data points can easily drop down to less than one million per second. Hence my saying that the throughput of data imports is usually not the bottleneck. I believe the real bottleneck lies in un-optimized and inefficient algorithms in the code that consumes the data.

    Your points re server, desktop, workstations, choice of hardware today are not making much of a difference of what I said above. Most commodity hardware suffices to load data faster than it can be consumed. Regardless of language.

    P.S.: I agree with all your points that you made when you iterate an empty consumer over your imported data structures. For bragging rights all of the above you said is most certainly valid. But that is not reflective of real world use cases where you load and import data for the purpose of steaming to a consumer that utilizes the data as source to its algorithms.

    Re your raw data import vs db import I do not see your point. A db either accesses files on disk or accesses memory. This determines the isolated process of data loading. But you are not done after loading a bunch of binary data, you need to de-serialize them also, you need to feed the target data structures into algorithms which are much more computationally intensive than the time it takes for most any columnar database, in memory or file based, or raw data loads.

     
    Last edited: Dec 29, 2016
    #25     Dec 29, 2016
  6. So you don't know. You're trying to present yourself like you have real world server experience when you don't. You deflect to other issues. I see this kind of answer all the time. Language makes a huge difference. I already noted that with python in a previous post. You seemed to agree.

    Let's go over the rest...

    Why are you indicating that disk speeds are so blindingly fast? They may be able to burst data in their cache through the south bridge, to the north bridge, and into a DMA address somewhere in RAM, but this is rarely sustainable at burst speeds. The better SSD's can probably do 300-500megs/sec quick burst. Sustained will be a lot lower. This is hardly keeping up with the processor. For the sake of argument, let's stick with SSD's as rotational platter based disks will be much slower... but are still in use for most data storage back ends, so shouldn't be totally ignored.

    A semi-current processor will be 3ghz or faster. Let's stick with the 3ghz number for easy math. That's 3 billion clock ticks per second. The actual instructions per cycle (IPC) will be much higher depending on the program and optomizations. I don't have current numbers, but L2 cache should be over 30g/sec and RAM speed should be over 5g/sec. I'm willing to admit that memtest numbers can be a little off, but the point is that internal CPU speeds are hugely faster than any disk. The CPU waiting for a file to load can add up to billions of clock ticks over the total length of the file. On a 3ghz processor, a 20ms delay means the CPU has to wait for 60 million clock ticks.

    Some might argue to just run multiple instances of the program and during the kernel waits the other instances can fill in the gaps. That only works to a point. If the data set is large, a lot of instances will be fighting for the disk bandwidth. This is part of the reason I recommended converting CSV to a smaller binary format. This is also why in some instances that compressed files may load faster than larger uncompressed ones. Those files get into system RAM faster and time offset the excessive number of CPU cycles needed to uncompress them. But as I hinted above, sometimes this works, many times not.

    With that being said, anyone who thinks getting their data over a network connection (either DB or a shared file system) needs their head examined. This hasn't been mentioned in this thread, yet, but I know someone will be thinking about it. Gigabit NIC's are fast, but they will never compete with a built in disk controller. If the remote system the data is being fetched from is equally fast, it will still have to go through all of what I've stated so far PLUS the networking overhead. If the server isn't being heavily used, this isn't much of a problem. On the flip side, it is.

    Going back to my kernel statements, the file system kernel driver will be responsible for multiple reads (and maybe a few status update writes) from various locations on the disk to find the disk data and get it into system RAM. Remember these will go back and forth through the motherboard bridges. If the disk data isn't too fragmented, this is relatively fast in disk terms (not CPU terms). The kernel file system driver will execute in some kind of loop that will add up to thousands of lines of code being executed (not including the necessary wait times for the disk to catch up). These add up.

    Back to databases. The better databases (usually the more expensive ones) will bypass the file system totally and write to a raw partition directly. This eliminates the file system overhead and potentially some file fragmentation issues. The rest of the databases will use the file system like any other file. They will issue a file seek, get a find, read through multiple tables scattered about, start assembling the data, read more and repeat the assembly, and then return the data request to the program usually through some kind of socket. Depending on the database server programming and data design, there might also be some temporary files and queue files scattered about. This has a HUGE overhead, especially if the data being fetched is hundreds of thousands of little OHLC bars. If there are multiple requests being made to the database server (like from dozens of the same program being run or other paying clients), there will be a lot of thrashing about (but minimized with SSD's).

    From your point of view, you've got a database on a workstation that's minimally used. You're not running multiple instances of the same program. It will look deceptively fast. Don't go telling others that they can scale this up while ignoring real issues.

    One thing you are quite right about is code efficiency when processing large amounts of data. I asked about your programming language and got ignored and deflected. Going back to my python example, being 30-50x heavier than C/C++ makes a huge difference in language selection. Where people get deceived to thinking that other languages (interpreted in general) are fast is that they tend to return results faster than the user is expecting them. There's nothing wrong with that on individual workstations that are single use. I do that on mine all the time. What I keep indicating is that mentality will kill a heavly used server for reasons already mentioned. If your code runs fast on your system, great! But don't mislead other when it comes to server scaling.

    You've also mentioned that loading data from CSV files is fast enough. For a single workstation, yes. I do this with my scratch pad program from a RAM disk. For light server use, you can get away with it. For heavy server use, NO! Why? Break it down. 1) CSV files are larger than their binary counterparts. See above about disk bandwidth. 2) Every CSV field has to be parsed. You may pass your program one line of code to do this, but in processor instructions, there are probably a dozen or so. 3) Every CSV value, once parsed, has to be converted back into an interger or float. That will take another dozen or so processor instructions. Compare that to a binary load. The data block has everything in fixed positions. There is no data conversion, just variable loading. These operations will take a few processor instructions per variable. That is a huge difference. Parsing hundreds of thousands of bars translates to millions of variables. On small data sets, this doesn't add up to much. On large data sets, it does. This is simple math. If you're using python, multiply the CSV overhead by 30 (or maybe more).

    Everything I've mentioned so far falls under the catetory of "Performance And Tuning". That topic is also much larger than what I've mentioned.

    I keep telling fan27 to keep these point in mind when programming because if his site really takes off, he will need his server(s) to perform maximally. Inefficient programs in inefficient languages will mean more servers/cloud time will have to be purchased and costs will be unnecessarily high.

    On a personal note, it also bugs me when I'm browsing the web and I hit some poorly scripted site that I need information from, but it's cratering under it's own weight. As a business, it's bad to piss off your customers with these, especially if they're paying for access.

    So... fan27, keep what I've said in mind. If your site takes off, you'll be running into these issues in the future. Plan ahead and you'll do great. Ignore them and you'll run into some nasty problems just like others have.

    Zzzz1: I don't really mean to bash you, but you're the one who stood out the most. There's a lot more to admin than what's on the surface. To be fair, I had more programming experience than admin experience when I started to take over handling servers many years ago. I had some rough times to get through, but the programming experience was useful in watching server execution and figuring out where the processes were slowing down and hanging. Once I had this experience, I could then tell clients why they were having performance problems, and what was needed to be done to fix it. They like that compared to the useless answer of "just buy more expensive hardware".


     
    #26     Dec 29, 2016
  7. Zzzz1

    Zzzz1

    You either don't want to get my point or can't get my point because you apparently have zero experience with trading system and backtesting architectures. The above you wrote may all be true or not, I still maintain however that it is completely irrelevant.

    You seem to want to ignore that his applications purpose is not to load data ASAP. His overall goal is to do something with that data. He wants to feed trading strategy algorithms with the loaded data. No matter how fast you get the data into your system, they will just sit there, unused waiting to be processed because the bottleneck is most often the strategy algorithms. In this specific context it is irrelevant whether you can read raw data at 5 million sets/second or 15 million sets/second when your strategy code can only process 2 million sets/second.

    Is that now clearer for you?

     
    #27     Dec 29, 2016
  8. fan27

    fan27

    @bjohnson777, @Zzzz1, I think you both have some great points.

    If I may attempt to summarize.

    Zzzz1: Avoid premature data optimizations.

    bjohnson777: Architect your system so that it can scale.

    My goal is to both. For example, my current marketdata package currently reads from CSV but it is written in such a way that the public interface method that does the reading takes a "ReadTickerData" interface as a paramater. The paramater passed in is a structure that implements that interface by reading from CSV but I anticipate that may need to change in the future, at which point all I will have to do is write a new structure that implements that interface and reads the tickerdata from whatever the source may be (in memory cache, NoSql DB, compressed data on disk, etc). Also, the site will have a micro-service architecture so if I determine some of the number crunching needs to be in a lower level language such as c++, that can be done in its own service without having to rewrite the entire application.

    Anyhow, good points gents!

    fan27
     
    #28     Dec 29, 2016
  9. Zzzz1

    Zzzz1

    Makes sense . Always optimize after you figure out there is a need for optimization. Don't waste time on needless things when you can spend it on essentials.

    Can you confirm that your data import need is to stream to strategies for backtest purposes?

     
    #29     Dec 29, 2016
  10. fan27

    fan27

    Yes....that is correct. Data being loaded is for backtesting.
     
    #30     Dec 29, 2016