best way to hack up an rdbms to approximate something like kdb for poor/dumb people

FutsTrader111 · Mar 24, 2010

Quote from NetTecture:

I would also say using 4gb ram on such a server is bordering sabotge - when a modern consumer board can handle 16gb already. [/B]
More...

No shit. If you don't half a fucking brain you start out within a budget with specs allowing expansion and then add as you go.
You don't read very well do you. The guy says he's on a limited budget. I recomended to him a motherboard that obviously can expand dumbass.

I said to get 4GB for starters and THEN IF IT IS TOO SLOW to add 16GB along with Mysql 7 Cluster so that he can keep the bulk of the data in-memory.

You are totallly off with the drives - they will SUCK.
More...

Any fucking idiot that has built a historical quote application would know that you keep the data in-core. FUCK IO.

DarthSidious · Mar 24, 2010

garch, can you provide some more details on what you are trying to store? I am no position to give a definitive answer as I have very recently started to put together something very similar to you.

Remember, this is experimental / proof of concept stage
system (definitely qualifies as cheap / old) :
mysql 5 on ubuntu 64
Core 2 CPU 6300 @ 1.86GHz (dual core)
4GB RAM
commodity SATA disks in raid1 (linux software raid)

At present I am storing futures data only (well, I have only started getting data from here & there and loading them, changing database schema as I go along - strictly work in progress. Plan to store daily & 1 minute resolution data only. Tick data, if I plan to store in future, would be in a) binary files with (possibly) links from within mysql or b) binary fields within mysql itself. Perhaps a day's worth of ticks on 1 ticker in one file / row. I think trying to store one tick per row is simply not workable.

I am also storing one symbol per table. _ESZ09_m, for example stores all 1 minute data for ESZ09 from listing to expiry. About 100k+ rows per table. For stocks, my estimate is.. let's see.. IBM 20 years of 1 minute data = 20*250*390 = 1,950,000, ignoring non RTH data - still not too bad. I can do a select * on the table about 0.59 second. with some "tradedate between 'this and 'that'" is still under a second.

I also have a bunch of mysql stored functions and procs that string together series to produce adjusted / unadjusted continuous contracts - those take a bit of time - but that's processing time, and I do not need to do that very often.

Are you seeing similar times or better / worse?

DarthSidious · Mar 24, 2010

Quote from garchbrooks:The test table layout I'm using for tickdata is this:
More...

What is your definition of "tickdata"? tick data, as I know it, refers to one trade. It has no OHLC. just one price.

rosy2 · Mar 24, 2010

theres a developer edition of kdb thats free.

DarthSidious · Mar 24, 2010

Quote from rosy2:theres a developer edition of kdb thats free.
More...

End User understands that the Download Kdb+ Software contains a feature which will automatically cause the Download Kdb+ Software temporarily time-out every two hours and to cease functioning after a specified period

garchbrooks · Mar 24, 2010

Quote from DarthSidious:

garch, can you provide some more details on what you are trying to store? I am no position to give a definitive answer as I have very recently started to put together something very similar to you.

Remember, this is experimental / proof of concept stage
system (definitely qualifies as cheap / old) :
mysql 5 on ubuntu 64
Core 2 CPU 6300 @ 1.86GHz (dual core)
4GB RAM
commodity SATA disks in raid1 (linux software raid)

At present I am storing futures data only (well, I have only started getting data from here & there and loading them, changing database schema as I go along - strictly work in progress. Plan to store daily & 1 minute resolution data only. Tick data, if I plan to store in future, would be in a) binary files with (possibly) links from within mysql or b) binary fields within mysql itself. Perhaps a day's worth of ticks on 1 ticker in one file / row. I think trying to store one tick per row is simply not workable.

I am also storing one symbol per table. _ESZ09_m, for example stores all 1 minute data for ESZ09 from listing to expiry. About 100k+ rows per table. For stocks, my estimate is.. let's see.. IBM 20 years of 1 minute data = 20*250*390 = 1,950,000, ignoring non RTH data - still not too bad. I can do a select * on the table about 0.59 second. with some "tradedate between 'this and 'that'" is still under a second.

I also have a bunch of mysql stored functions and procs that string together series to produce adjusted / unadjusted continuous contracts - those take a bit of time - but that's processing time, and I do not need to do that very often.

Are you seeing similar times or better / worse?
More...

I've got one minute tick data for around 3000 stocks. The total set size for me is 200 million rows. To index with 'btree' on just 50million rows takes about 4 hours (per index) on a 3.2GHz pentium-D, 4GB of RAM, with a crappy PATA drive hanging off of the PCI bus (extra disk controller) with 300GB of storage. After indexing, fetching around 80k rows took 60 seconds, if the query was done on the ticker. An infobright LOAD DATA INFILE takes 7minutes; without infobright, the load takes so long it's unbearable (>30min.)

I took some forum member advice and reallocated some hardware to the cause, 2 terabyte serial ATA drives on a box with 8GB of ram and 8 cores, and I'm still not getting the results I want, although they are better. Just the 'load data infile' operation takes forever, infobright still fails. I can go out and spend a few hundred bucks, but I'd rather not spend money if it's possible. I'd sooner reduce the dataset than buy new hardware, because I'm trying to stick to a very carefully budgeted trading plan. It's just a matter of fiscal discipline, but also because I feel this is a problem worth attacking for personal skills development. If I can't avoid the problem without exhausting many methods, then I will just cough up the money and save myself the time.

garchbrooks · Mar 24, 2010

Quote from DarthSidious:

What is your definition of "tickdata"? tick data, as I know it, refers to one trade. It has no OHLC. just one price.
More...

Sorry, OHLC data. I meant to say 'homogenized tickdata', where ticks are snapped to 'last tick in 1min interval'. I have the original ticks, but I am fairly confident the size of the data is just too big to deal with for what I have.

DarthSidious · Mar 24, 2010

Most definitely split your table. 3000 stocks? 3000 tables. Just take MSFT or whatever, and stick it in a MSFT_m table with same schema. MSFT, GOOG, AMZN etc. really have no business being in the same table, do they? The speed improvement... well just try it out.

I have a few small java programs that manage the loading / maintenance of data. They are also responsible for creating tables for tickers that do not always exist. Example. XYZ starts trading. When I load the first day's data, Instruments table will automatically be inserted with a record for XYZ. _XYZ_m & _XYZ_d will automatically be created etc. etc. and the data loaded

DarthSidious · Mar 24, 2010

Actually if that splitting works out, I would suggest more splitting by year, 5 year whatever. Then add some stored procs. like if you split MSFT in yearly tables, and are consistent with naming like _msft_m_2008, etc., it quite easy to have a stored proc like:

getallmindata("msft", from, to)

getallmindata can easily figure out what table it needs to look at, make a union and send it back to you.

Focus on the schema changes, not more hardware, methinks.

garchbrooks · Mar 24, 2010

Quote from DarthSidious:

Most definitely split your table. 3000 stocks? 3000 tables. Just take MSFT or whatever, and stick it in a MSFT_m table with same schema. MSFT, GOOG, AMZN etc. really have no business being in the same table, do they? The speed improvement... well just try it out.

I have a few small java programs that manage the loading / maintenance of data. They are also responsible for creating tables for tickers that do not always exist. Example. XYZ starts trading. When I load the first day's data, Instruments table will automatically be inserted with a record for XYZ. _XYZ_m & _XYZ_d will automatically be created etc. etc. and the data loaded
More...

What are the implications of splitting the tables? This is where my inexperience becomes a problem. Would it be possible to do something like (in pseudocode):
Code:
Using prices from stock A and prices from Stock B:

For a given date range:
   For a given set of times on each date:
      stocktuple = assemble tuple (price A from stock A, price from stock B)
      tuplelist.append( stocktuple )
Where given date range could be from a news event, a backtesting range, earnings day, expiration day, etc.