Tick data storage

Discussion in 'Data Sets and Feeds' started by 931, Jun 29, 2020.

  1. 931

    931

    Please.

    How to make that person angry to the stage where he attacks my ideas,arguments and i get better info and learn more?
     
    #11     Jun 29, 2020
  2. I was slapping my own head because of my carelessness. Not angry at you!

    GAT
     
    #12     Jun 29, 2020
  3. 931

    931

    "as well as obviously disk usage" depends on reader perspective if its taken as lower or higher.
    Lower disk usage is benefit.
     
    #13     Jun 29, 2020
  4. 931

    931

    The "text" based non-standard text format is for long term storage reliability and only used for initially loading data in. Atm it gets stored on slower drives with 550MB/s read.

    It is kept as simple and foolproof.
    Also in case i implement some more complex floating data compression algo , then convert all data and delete initial readable data.
    Then maybe find out there was some mistake and big portion of data is flawed ,6+ month of downloading again....

    Also in more compressed formats if some bits get corrupt, maybe 10000x more data gets lost compared to bits that got corrupt.
    If some "key" bits get corrupt probably all lost/highly obfuscated.

    What i look for is available components for the purpose of building multithreaded in-memory database that buffers floating format data from/to disk and from/to RAM at maximal speeds that some effort can achieve within reasonable timespan.

    I would hope to find full package that is close to my specifications and open source, but i dont think something like this it readily available as lightweight library.

    If better solutions are available, il rethink my plan.
    In the end the purpose is feeding data to other algos working on other threads with decent speed and efficiency.

    So far haven't yet found what i search for in database software.
    With hardware its much easyer to find best readily available stuff and tweak.

    Large enough fast storage drives are available and larger SSDs tend to have higher read speeds. (but M.2 slots are limited on boards).

    With RAM sticks its dif story, You cannot use 512GB ram kit and overclock it much compared to smaller memory sets with more headroom.

    So far i have got RAM read speed up to 80GB/s, by overclocking CPU cache and tweaking RAM timings.

    Max SSD read speed should be ~5GB/s when newer M.2 drive gets installed.


    Current software specs:

    Support for mid file timeseries data loading from compressed text like format at 2x disk speed or better compared to commonly used ascii format.

    Jumping trough file and finding correct timings in file should be fast.(hdd excluded , ssd chosen).
    The later version may generate separate files that hold timing positions info to reduce seeking(indexed method should find positions faster)

    Data blocks are lockable, usable and requestable by other threads.

    Working principles:

    On opening app the data on M.2 drives get evaluated and valid data specs left from previous runs built into in ram database(in other words where to get what).
    Also free some data blocks on disk if app crashed before and left half broken junk.
    (Would be nice to have it as quick 100-500ms loading operation, as i debug alot and open/close/crash)


    During runtime if some part of software request some data period of some symbol, it gets loaded from "text" file or if it was previously saved to M.2 drive then from there as "compressed float" at higher speed.(plan is to use large enough M.2 drives to store all data that got used once or more)

    Loaded data is stored in ram in regular float format until 95%+ of memory gets filled...

    When memory is full, least used blocks get compressed from ram to ram(in whatever compressed format) but it should get back in floating format for calculations.

    If compressed data in ram gets higher than some parameter(lets say 50% ram usage) then in next stage instead of compressing ram to ram the least used blocks that got evaluated statistically as least used get saved to disk.
    In same format or whatever compression format could benefit from reduced bandwidth compared to in-memory compression.

    Data that gets used most remains in memory, non compressed.


    Plan to add performance throttling method to give priority to other threads and only take higher priority if buffers start to run out.(to achieve max speed for other threads working on data) and auto-optimization for speed (to auto configure various parameters).


    One idea was making single file that uses full size of disk and handle all buffered data in one file that is constantly open by the process.

    Not sure how big speed benefit by keeping single file on disk but no bottleneck of open/close file.
    Directing all bits/bytes on disk might seem like overcomplication but if using simple concept its not too bad.
    Could the speed benefit be worth the effort for linux and windows OSes if loading small chunks?

    Another idea is to redesign and keep most floats going to RAM in different <32 bit format. and decompress float data just in time before usage to get performance benefit.
    Stock data does not really use many decimals (usually 2 , max 5-6 decimals used) and there may be efficient methods for in cache float compression or even generation using constant,multiplier and integer format?

    Could this be beneficial for time series market data or CPU architecture is going to be faster if float gets loaded as floats from RAM?
    Seems possible after some ideas involving vectorized operations but idk how to calculate how many operations each move would take so i just plan to test...

    What ready made library might save me from those plans?
     
    Last edited: Jun 29, 2020
    #14     Jun 29, 2020
  5. Farmas

    Farmas

    An excellent topic with useful information. For me, this question is just relevant. I also studied some information on cooling and types of hard drives here. I will share my implementation soon.
     
    #15     Apr 2, 2021
  6. 931

    931

    Some M.2-s runs slightly faster hot.
    I was thinking about cooling but found out its better at constant high temp. Unless overheated and thermal throttling goes on.
     
    #16     Apr 5, 2021