Large sorting in R

Discussion in 'App Development' started by R1234, May 1, 2015.

  1. R1234

    R1234

    A lot of my current strategies are cross-sectional in nature. They rank across a universe of stocks for several factors daily. Then it does a weighted average of the various factor ranks to come up with a final score for each stock.

    I do this for a universe of several hundred stocks daily in excel/vba without too many problems (but it's not lightning fast).

    Now I need to do the same across several thousand stocks globally and my excel/VBA framework chokes on the data. The historical daily data goes back to the 1990s and there are about 3,500 data series. So it's a pretty big dataset.

    Does this sound like something that the R framework can handle with ease or will I be up against similar issues with choking and memory hogging?

    I know there's a steep learning curve with R and I don't want to learn it unless I think it will be useful in this type or research.

    Thanks for any insights you can give me...
     
  2. 2rosy

    2rosy

    I know you can do it in python with pandas or blaze libs backed by hdf5 or blocz. maybe excel is loading everything into memory​
     
    eusdaiki likes this.
  3. R can handle large datasets well. Its limitation will be defined by your hardware. Generally, R should have twice the amount of memory of your dataset.

    Learning R is not really a big task. Try "SWIRL" it makes things a lot easier. It's not exhaustive but should get you started.

    Like, 2rosy has said, Python is great as well.
     
  4. R1234

    R1234

    thanks rosy & nach. I will explore your suggestions.
     
  5. Databases handle sorting and selecting within large data sets quite handily. No need to hold all the data in RAM on a monster computer to crunch it.
     
    eusdaiki likes this.
  6. i960

    i960

    Has nothing to do with databases whatsoever. Technically these are all databases at a raw level. The main issue here is the sort algorithm used (qsort vs heap sort vs merge sort) and how the data is fed and results collected.
     
  7. Yes, at a bare minimum. To be carefree you should have several times more.
     
  8. windwine

    windwine

    In R and Python you can use SQL database. A R package called "data.table" is very fast in dealing with big tables which could fit into your RAM. If you have a 16GB memory I guess your factor ranking could be done pretty easily in R.
     
  9. R can handle millions of data points within a time series and sort and rank within the time series but also across time series. You did not specify whether you want to use R to compute your metrics, also, or whether you load daily metric data into R and just want to rank within R.

    Both should be doable in R without breaking a sweat but as someone else pointed out for the former you may want to make sure you have sufficient memory. But a quick back-off-the-envelop calculation should indicated that even when loading the complete daily price data for all your stocks in your universe into R it should not consume too much memory:

    Example:

    20 years of daily data, 5000 stocks, 10 metrics each day:

    250 [daily data points per year] * (10+4) [10 metrics and 4 open,high,low,close data points] * 20 * 5000 = 350 million data points. Assuming a very liberal 64 bits per floating point data point and you consume around 2.6 gigabytes. This should be your ceiling purely for the data set. Depending how algorithmically efficient you construct your computations within R you should do perfectly fine with the 64-bit version of R and even 8gb of main memory.

    You can actually do this quite computationally efficient, given that, as you stated, you come up with one rank (percentile or whatever other normalized rank) for each stock as function of different metrics each day.

    Hope this helps.


     
  10. i960

    i960

    Not sure if R can handle this, but a common approach when dealing with large working sets and stable sorting expectations is to use something called a priority queue or heap - which is a form of tree (the usual implementation). Data is inserted into the heap such that the insert pays the cost of the sort (it's being stored in sorted form). Reads from the heap cost significantly less than constantly qsort()ing the data when needed.

    However, if you need to sort things different on a continual basis then it's not an option. If you only need 1 sort and only that sort, then it's definitely an option, provided you have a way of expressing/storing it that way with R.
     
    #10     Aug 4, 2015
    volpunter likes this.