Large sorting in R

volpunter · Aug 4, 2015

I do this actually for a lot of my core algorithms. It is prohibitively expensive to add data to queues/collections and run the same calculations over the adjusted dataset rather than adjusting the computed value for dropped out data and new data points directly. But not all algorithms lend themselves to such approach. Kalman filters are such example where no time series are required to generate new estimates. Most moving averages (to just name a very simple example) can also be updated without having to retain the entire data collection in memory.

i960 said:
Not sure if R can handle this, but a common approach when dealing with large working sets and stable sorting expectations is to use something called a priority queue or heap - which is a form of tree (the usual implementation). Data is inserted into the heap such that the insert pays the cost of the sort (it's being stored in sorted form). Reads from the heap cost significantly less than constantly qsort()ing the data when needed.

However, if you need to sort things different on a continual basis then it's not an option. If you only need 1 sort and only that sort, then it's definitely an option, provided you have a way of expressing/storing it that way with R.
More...

spacewiz · Aug 6, 2015

As mentioned above - R keeps everything in memory, so your algo will be limited by that. The easiest solution, IMHO, is to load your data into a relational database (say MySQL) running locally, and use it to sort your data. You just write ORDER BY in SQL when you pull data from database into R . If you index your table correctly - sorting in mysql will be faster than in R, and you are limited by the size of you hard drive/SSD instead of memory.

volpunter · Aug 6, 2015

You are comparing Apples and Oranges here. If it's just sorting/ranking of a metric then the same can be done on millions of stocks in R without any significant memory uptake.

spacewiz said:
As mentioned above - R keeps everything in memory, so your algo will be limited by that. The easiest solution, IMHO, is to load your data into a relational database (say MySQL) running locally, and use it to sort your data. You just write ORDER BY in SQL when you pull data from database into R . If you index your table correctly - sorting in mysql will be faster than in R, and you are limited by the size of you hard drive/SSD instead of memory.
More...

spacewiz · Aug 6, 2015

volpunter said:
You are comparing Apples and Oranges here. If it's just sorting/ranking of a metric then the same can be done on millions of stocks in R without any significant memory uptake.
More...

Yes, you are right, it will work for millions of stocks, given enough disk space and CPU speed, and memory, since even db required it. I just described a solution that will scale better than a pure R approach in response to the original question by R1234, not really trying to compare anything

i960 · Aug 6, 2015

spacewiz said:
Yes, you are right, it will work for millions of stocks, given enough disk space and CPU speed, and memory, since even db required it. I just described a solution that will scale better than a pure R approach in response to the original question by R1234, not really trying to compare anything
More...

No it won't be any different whatsoever. The working set ends up in RAM regardless. Either the buffer/page cache of the DB or in actual memory pages within the process space of R. Indexing only helps if you want a subsection of data and order by is not going to result in magically faster sorting. Things still need to be sorted.

spacewiz · Aug 7, 2015

In fact,

i960 said:
No it won't be any different whatsoever. The working set ends up in RAM regardless. Either the buffer/page cache of the DB or in actual memory pages within the process space of R. Indexing only helps if you want a subsection of data and order by is not going to result in magically faster sorting. Things still need to be sorted.
More...

This is, in fact, wrong. A relational database like mysql, oracle, SQL server, etc... , unlike R, does not keep entire data set in memory, unless you are using a memory table, which is a special case..., but only a limited amount of data, based index configuration, db settings (memory allocated to index cache) and available memory on your machine. It can also use disk space for "overflow" storage to store part of index that does not fit into memory. So, typically only the part of the index, unless all of it can fit in memory, is kept in RAM. In addition, database index takes much less space per record than the record itself, unless you create an index on EVERY column in the table, which usually does not make sense...

So, while you are right that both R and database both require amount of RAM proportional to the amount of data - the database can use memory more efficiently and will not use as much memory as R would.

i960 · Aug 7, 2015

Well obviously a DB uses a non-memory based backing store and doesn't store the entire thing in RAM, trust me I understand this and use the technology daily. The point here is the OP wants a solution that is the most computationally efficient for dealing with sorting of data. My point about "all in RAM" is that assuming he has ample memory (which he probably does), it's going to end up in RAM eventually.

Indexes are just a key reference to a row, if he needs row data, that's not a magic win. If data is not in the buffer cache of the DB, it's also not a magic win, because it'll be gated by the speed of storage on retrieval. I do not believe his issue is in how to warehouse this data. If it were that type of thing I'd surely recommend a database as well. His issue is in how to efficiently sort it on a continual basis.

That's why I said he needs to change the algorithm. A heap w/ tree is designed for this type of thing - given that his insert rate is probably significantly lower than read rate, i.e he inputs data and wants to analyze it a bunch of different ways.

volpunter · Aug 7, 2015

Nor does R, Python, C++, C# or what have you necessarily keep the whole dataset in memory. You can iterate over chunks, aggregate metrics and end up with a much smaller memory footprint than SQL could ever offer. SQL is anyway way out of place for anything time-series related. I thought we discussed this on this site ad infinitum.

Also keep in mind that a) memory is dirty cheap these days, b) R is free of charge and can run in a 64bit environment, c) SQL server is very limited in the free tier edition (CPU core limit, data set limit, memory consumption limit,...), d) SQL takes probably 2 orders of magnitude more time to properly set up (install SQL server, configure tables and schemata, indexing, hooking up to the database,...). In R or Python you get all of that out of the box free-of-charge. Beats SQL hands down if you ask me.

spacewiz said:
In fact,

This is, in fact, wrong. A relational database like mysql, oracle, SQL server, etc... , unlike R, does not keep entire data set in memory, unless you are using a memory table, which is a special case..., but only a limited amount of data, based index configuration, db settings (memory allocated to index cache) and available memory on your machine. It can also use disk space for "overflow" storage to store part of index that does not fit into memory. So, typically only the part of the index, unless all of it can fit in memory, is kept in RAM. In addition, database index takes much less space per record than the record itself, unless you create an index on EVERY column in the table, which usually does not make sense...

So, while you are right that both R and database both require amount of RAM proportional to the amount of data - the database can use memory more efficiently and will not use as much memory as R would.
More...

volpunter · Aug 7, 2015

I fully agree. The problem here is not data storage, it is efficient computations (sorting)...SQL should not even be on the list for anything time-series related anyway...

i960 said:
Well obviously a DB uses a non-memory based backing store and doesn't store the entire thing in RAM, trust me I understand this and use the technology daily. The point here is the OP wants a solution that is the most computationally efficient for dealing with sorting of data. My point about "all in RAM" is that assuming he has ample memory (which he probably does), it's going to end up in RAM eventually.

Indexes are just a key reference to a row, if he needs row data, that's not a magic win. If data is not in the buffer cache of the DB, it's also not a magic win, because it'll be gated by the speed of storage on retrieval. I do not believe his issue is in how to warehouse this data. If it were that type of thing I'd surely recommend a database as well. His issue is in how to efficiently sort it on a continual basis.

That's why I said he needs to change the algorithm. A heap w/ tree is designed for this type of thing - given that his insert rate is probably significantly lower than read rate, i.e he inputs data and wants to analyze it a bunch of different ways.
More...

i960 · Aug 7, 2015

Technically SQL is just a standard for querying. The actual storage of data can be implemented in a multitude of ways. There's plenty of non-SQL and SQL-frontended storage libraries that could also be used (e.g.: BDB) allowing the programmer to have small record overhead and fast access. At the end of the day, it's all data.