R for datamining/backtesting/trading

ssrrkk · Mar 9, 2012

Quote from bs2167:

Agree with the approach overall - in my experience, being able to quickly test a novel idea far outweighs the disadvantages of having to develop/maintain an additional code base for live trading (which very rarely changes).

I've never used Python...if you have a minute, would be interested to get your take on what it does better / worse than R that makes it worthy of inclusion in your process.
More...

I have debated with myself extensively when I decided to include python into my workflow. There are a few reasons I chose to do so.

First, the ability to create a class hierarchy in python means that I can create a quick library of reusable code that is well-organized. In theory I could perhaps do this in R by building an R package, but I never really learned this particular aspect of R, and also I could be wrong but the R package framework doesn't appear to be completely object-oriented.

Second, most of what I do is not quickly amenable to optimal vectorized codes unless I thought and planned hard about it. So I end up writing loops around minute bars anyway. In this case, python (+numpy) in my experience is much much faster than R.

Third, I have all my data stored in a mysql database (both market data, and performance data of my models). I found that I could quickly write a "Database" class that wraps many of the mysql query and data loading functions I need for backtesting in python whereas I felt it was a little more cumbersome to do the same in R (of course, again if I learned how to build an R package then I probably could have done this in R).

Fourth, I have also built a web server to monitor my live / forward-test trading. I have the java code running, and then I have a web server that checks the log file of the running code to show me the PL, holdings, and even charts and graphs. I implemented all of my web server scripts into my WebServer python class that also uses many of the database query and load functions. I also wrapped a lot of the plotting functions using matplotlib.

I guess a fifth important reason is that I am very comfortable programming in python, and I can do things extremely quickly in it. I am also comfortable with R except for the package part, which means I can write re-usable code in python much better.

At some point before I was doing this in python, I had hundreds of R scripts scattered about, and I had been re-writing very similar stuff over and over. So this is when I decided I need to start re-using code.

In practice, building a code base in python for me is not a hugely organized and extensively planned endeavor either. All I do is this: I create a base-class that I think conceptually is necessary. I implement the minimally required methods in there. Then in a particular analysis or backtest code, I instantiate that class, and use it's methods to implement what I need. After doing this a few times, I will recognize that there might be a group of similar operations I repeatedly do on the data in my base-class. That is the time that I "elevate" that code to a member function of that class. So slowly as time goes by, I start accumulating useful functions in the form of methods of that class. I also try to make it a habit to write a simple sentence in the triple quote remark section of the function so that later, I can extract it into a set of webpages via pydoc or sphinx.

ssrrkk · Mar 9, 2012

Quote from caementarius:

Very insightful post. Appreciate it. I worry about model risk in moving from one framework to another - but I think your way might be most practical.
More...

Regarding model risk, yes, there is always the problem of introducing hard-to-catch bugs or unexpected statistical errors in the translation process. But I think a lot of this can be prevented by good programming and testing habits, e.g., always check baselines or controls (cases where you know what should be the right answer, etc.). Additionally, it helps to build a collection of tested classes and functions in each phase and re-use those when you can. But if you can't, then you have the flexibility to branch out in however form you please.

jtrader33 · Mar 9, 2012

Quote from ssrrkk:

I have debated with myself extensively when I decided to include python into my workflow. There are a few reasons I chose to do so.

....

More...

Thanks for taking the time with your reply. Not to put words in your mouth, but interestingly it seems as if Python may have become the most indispensable of the three languages for your work.

On the other side of things, have you considered using Python to run the strategies live instead of Java? I've read that it might be slow for such an application, but to be honest I don't have enough programming experience to understand why.

(Apologies for the mulitple questions and the partial thread hijacking, but I've read so many good things about the language that I also want to try and understand its limitations.)

caementarius · Mar 9, 2012

Quote from bs2167:

(Apologies for the mulitple questions and the partial thread hijacking, but I've read so many good things about the language that I also want to try and understand its limitations.)
More...

For what it's worth, I'm interested as well. I saw a post by rosy2 awhile back demonstrating some python and it looked pretty slick. I'll try to find it.

Edit - here it is, using python and pandas from:
http://www.elitetrader.com/vb/showthread.php?s=&threadid=235027

Quote from rosy2:

import pandas as pd
from pandas.io.data import DataReader

symbols = ['MSFT', 'GOOG', 'AAPL']
data = dict((sym, DataReader(sym, "yahoo"))for sym in symbols)
panel = pd.Panel(data).swapaxes('items', 'minor')
close_px = panel['Close']
rets = close_px / close_px.shift(1) - 1
rets.corr()

that will be $10,000
More...

ssrrkk · Mar 10, 2012

Quote from bs2167:

Thanks for taking the time with your reply. Not to put words in your mouth, but interestingly it seems as if Python may have become the most indispensable of the three languages for your work.

On the other side of things, have you considered using Python to run the strategies live instead of Java? I've read that it might be slow for such an application, but to be honest I don't have enough programming experience to understand why.

(Apologies for the mulitple questions and the partial thread hijacking, but I've read so many good things about the language that I also want to try and understand its limitations.)
More...

I would say Python is my main back test engine, but because I am so used to plotting things and doing quick stats in R, I do use it a lot for just playing with an idea on a few days or a few months of data at a time. For example, I use R the very first time I want to check a new idea. So R is great because I can check things with a few interactive lines of code. But it is slower and the historical data handling is a little bit more cumbersome -- again this is my fault as I haven't fully developed a good SQL interface for it. (Often I write a 5 line python script using my database classes to output a text file of the data I need, then I read that into R using read.table and then analyze my idea).

By the way, I might have given the impression that I have solved every development problem I have which is far from the truth. I still struggle with it every day and sometimes I feel like I am spinning my wheels just to get a simple thing going.

Regarding replacing java for live-trading, I don't see any need for that, as I am NOT constantly redeveloping my live trading platform. The live trading platform is a different thing to me with very different considerations. It has to do with communicating with TWS, order handling, real-time data handling, building bars from ticks, account querying and cash management, logging, etc. All of this does not change every time I have a new strategy. Of course, the core loop / signaling class overlaps with the back tested strategy and that part of the code is where I change if I have a new idea to try.

The main reason I use Java is because the TWS API comes in Java or C++ only. In addition, TWS itself is written in Java. So I decided to use the native interface. There is something called IBPy that wraps all the API calls into python but it looks like it is half finished and the authors seem to have abandoned this project some years ago. As far as performance considerations, there is no question python will be slower, and I am not sure how well it will deal with the multi-threaded nature of TWS API apps -- for example, all API apps must implement callbacks that are initiated by TWS after requests are sent to it. Sometimes, I might make 30 or 50 data requests to TWS. So I will be receiving a huge amount of call backs (hundreds or thousands of times per second) each time tick data comes in for any of those 50 instruments. I would imagine that is a big load for python to handle and in addition I am not sure how well python can handle thread synchronization. Java has absolutely no problems with this and is rock solid.

stevegee58 · Mar 10, 2012

If any of you are interested, a fellow wrote some MetaTrader (MT4) support for R where you can call R functions through a dll interface.

He also provided a couple of examples called Trend-O-Mat and Arb-O-Mat.

caementarius · Mar 11, 2012

Prompted by an earlier post I decided to look further into the quantstrat module. There isn't documentation with the module itself, but it looks like the place to start are the quantstrat papers from Guy Yollin:
http://www.r-programming.org/papers

These presentations have a lot of example code and lay things out pretty well.

There's reference to quantstrat handling both streaming and historical data - so it sounds like it might help with my original query.

johnnyqpublic · Mar 12, 2012

Quote from caementarius:

Prompted by an earlier post I decided to look further into the quantstrat module. There isn't documentation with the module itself, but it looks like the place to start are the quantstrat papers from Guy Yollin:
http://www.r-programming.org/papers

These presentations have a lot of example code and lay things out pretty well.

There's reference to quantstrat handling both streaming and historical data - so it sounds like it might help with my original query.
More...

Yes, this is almost certainly the way to go if you're going to backtest in R.

If you have questions about quantstrat or related things, and you know how to use IRC or want to learn, you can come visit the #r-finance channel on Freenode. Myself and others (including one of the quantstrat authors!) who use R for finance work, hang out there.

kattypier · Oct 27, 2012

Data binding is used by almost all modern applications. It provides simple means to separate the data layer from the presentation layer. Generally, binding means connecting a graphical control property with a data object property.
i am interested in that topic .you can check helpful tutorial in dapfor. com

hft_boy · Oct 27, 2012

Quote from ssrrkk:

The main reason I use Java is because the TWS API comes in Java or C++ only. In addition, TWS itself is written in Java. So I decided to use the native interface. There is something called IBPy that wraps all the API calls into python but it looks like it is half finished and the authors seem to have abandoned this project some years ago. As far as performance considerations, there is no question python will be slower, and I am not sure how well it will deal with the multi-threaded nature of TWS API apps -- for example, all API apps must implement callbacks that are initiated by TWS after requests are sent to it. Sometimes, I might make 30 or 50 data requests to TWS. So I will be receiving a huge amount of call backs (hundreds or thousands of times per second) each time tick data comes in for any of those 50 instruments. I would imagine that is a big load for python to handle and in addition I am not sure how well python can handle thread synchronization. Java has absolutely no problems with this and is rock solid.
More...

Well, there are only official implementations in Java and C++. The API itself is not particularly hard to understand, or implement (although tedious). You can also check out IBrokers for R, I think the implementation is pretty good. I don't really know how slow Python is, but it should be fast enough to handle IB data. IB sends tick data out at 250 ms intervals, so you would probably be getting ticks about every 5 ms. Maybe less if the ticks trigger multiple callbacks.

Also, you can use async i/o to make your application single threaded. I believe this is the approach taken by IBrokers (since R is single threaded). Not that the Java implementation handles any sort of synchronization between threads anyways.