best way to hack up an rdbms to approximate something like kdb for poor/dumb people

garchbrooks · Mar 23, 2010

Yeah, I asked it. I have a ridiculous amount of data I would like to "join" together like in SQL and I'm a poor trash retail trader who learned about the stock market last week. I've seen a few threads on people doing various things for tick databases, but let me be real specific:

1) Want the ease of SQL-like queries
2) Need about 500-700 million rows, and about 50-80GB of data per "instance" (instance = research project)
3) Query time for a slice of time (at most a few months) for any given ticker should be a few seconds at most
4) Needs to be usable for a guy who knows NOTHING about databases, other than the online tutorials about SQL
5) Works on a non-server computer that's about 3-4 years old, 4GB of ram tops
6) Learning curve can't be so high that I spend a month reading manuals

I'm homogenizing tick data into time intervals, so it doesn't have to be tick for tick. I just need something that works with a reasonable amount of speed.

Simple and "fast enough" is all I care about. The system does not have to be available during the day and is for research. I've read the threads on ET already, but am not competent enough to understand all of the lingo and implications of design decisions, so please respond as if you are addressing a total beginner.

What is the best way to get as close as possible to what I want, using the cheapest, most-free stuff, as quickly as possible?

Thank you.

Baywolf · Mar 23, 2010

What format is the data in? You might be able to import it directly in to SQL Server Express, which is free, and quite powerful.

garchbrooks · Mar 23, 2010

Quote from Tight Face:

What format is the data in? You might be able to import it directly in to SQL Server Express, which is free, and quite powerful.
More...

Whatever format I want, really. The data is just sitting in raw binary records, which I've written some adapters for to transform into CSV, XML, you name it.

I've tried a few SQL implementations, but what happens is that when I do a select, even after indexing on such a large amount of data, the queries take on the order of minutes to pull out what I want. I've tried a few different things, but none of the hacks have really given me something I think is very workable.

Corey · Mar 24, 2010

Well, it really comes down to access patterns. How do you plan on accessing the data? What data will be frequently accessed together? Are calls to data clustered? Can we take advantage of caching (memcache could be useful)?

You could take a perusing through High Performance MySQL.

Personally, I have been looking into working with Cassandra so I can scale horizontally with ease and not have to worry about setting up RAID systems -- everything is distributed and persistant.

MonetDB is one that people seem to talk about a bit as an open-source column-oriented database, but I haven't seen it in use yet.

PocketChange · Mar 24, 2010

SQLite is simple and fast and already on your system if you use firefox.
Just add this SQLite plug in and your off.
https://addons.mozilla.org/en-US/firefox/addon/5817

Optimizing your wishlist is another story

Baywolf · Mar 24, 2010

Quote from garchbrooks:

even after indexing on such a large amount of data, the queries take on the order of minutes to pull out what I want.
More...

OK? So you are working with a lot of data. Its going to take time. Welcome to the world of research. Queue up some jobs and execute overnight. That way you can analyze batch data sets instead of doing everything ad hoc.

NetTecture · Mar 24, 2010

You are fucked. I also would love a chatp car (not more than 500 USD) that uses 1 liter of fuel per 800 miles and goes 400 miles per hour. Wont happen, though.

You need a high end server for that (dual opteron, lots of RAM - possibly 32gb up) and / or a TON of fast discs to get the IO filled, because yuor local disc just takes a lot too much time to load the data. Even saturating a SAS link (1gb per second roughly - takes a LOT of discs to do) you would take about 50-80 second to just load the data. Do you are in for a high end RAID controller and distributing the data over - hm - 4-6 links, and even then the reading takes time.

Alternatively you could get a proper server with 96+ gb RAM and a decent high end database server would then cache all the database in that RAM.

See, no chance with a "non server computer that is a couple of years old". I won't get my car either.

FutsTrader111 · Mar 24, 2010

I definitely would go Ubuntu 64 bit with AMD 965 x4 Phenom II and 4GB ram and 1TB SATA for starters. Use MySql Community server 64 and if too slow ptut 16GB RAM and use Mysql 7.0 Cluster.

Although not a fair comparison, I'm running AMD Phenom II X4 965, 3 GB RAM, 10yr 30min data on Win XP 32-bit SP3, Mysql Community Server 5.x and its fast enough. Total time to do a select on all data is a mere 0.002s (roughly 60K rows)

NetTecture · Mar 24, 2010

You are totallly off with the drives - they will SUCK. You need as many IO as you can get, which means, from hig hcost to lowe rperformance:

* SSD
* SAS Drives, 15k RPM
* WD Velociraptor SATA drives, 10k RPM, 300gb.

The WD Velociraptor has easily twice the IO performance as your typical 1tb drive.

For database work - sorry - IO is core. Like REAL core. And even then you need - as I demonstrated - a battery of SAS backplane systems (into which you can put SATA drives) to get the table scan time down into a manageable level. This is simply too much data.

I would also say using 4gb ram on such a server is bordering sabotge - when a modern consumer board can handle 16gb already.

jagadish · Mar 24, 2010

Check http://www.mcobject.com/in_memory_database to see if it helps