Vectorized backtesting with pandas

Discussion in 'App Development' started by nooby_mcnoob, Apr 30, 2019.

  1. Just wanted to submit this since I've found it useful for my own work.

    TL;DR: compute returns at signal time for various holding periods and be satisfied that it's 80% accurate

    One of the issues with backtesting is that it can be time consuming. Backtesting is never 100% accurate, nor should you seek it to be 100% accurate because then you're probably wrong.

    I usually aim for 80-90% accuracy in anything I try because the remaining 20% will likely not give me any huge returns. If you want 100% accuracy, this will likely be useless.

    In order to backtest a strategy, you have to decide what you want out of it. For me, this is:

    I don't want to wait for 10 minutes to backtest against 15 years of data, so quick turnaround is important. 15 seconds is my cutoff.

    This means event-based backtesters are probably out of the question since they need to go through each bar one at a time. We need something that can work faster.

    Vectorization is (basically) performing multiple operations at the same time, usually across a whole array. If you have a N-element array and want to multiply it by 3.5, there are two ways to go about it:

    1. Loop through each element and multiply it by 3.5
    2. Multiply each element by 3.5 at the same time

    The latter requires hardware support or at the very least, can be done in native code.

    With Pandas, operations are vectorized when they can be broadcast. Fortunately, many operations in Pandas are conducive to broadcasting. This includes the basic arithmetic operations. So the above operation (A*3.5) is more or less guaranteed to be done in hardware. This is what makes Pandas faster than doing things with Python arrays.

    Now, in order to backtest a strategy, you need to know your entries and subsequent returns. Exits can be done this way as well, but I haven't bothered with it yet.

    My process is something like this:

    1. Identify positioning (position size can be done, haven't bothered): -1, np.nan, 1 as short, none, long
    bars = pd.DataFrame(....)
    long = bars.ema15 > bars.ema30 # or whatever
    short = bars.ema15 < bars.ema30
    bars['signal'] = np.nan
    bars.loc[long,'signal'] = 1
    bars.loc[short,'signal'] = -1
    2. Identify subsequent returns
    # want to look at returns after holding for N periods
    for i in range(1,N+1):
        # return after holding for i periods
        # Note the negative shift: that looks into the future. OMG.
        bars[f'return_{i}'] = bars.signal*(bars.shift(-i).close - bars.close)/bars.close
    3. ???

    4. ???

    5. Profit? Maybe? Probably not.

    I often identify any columns looking into the future with a 'return_' prefix or a 'future_' prefix so I don't accidentally use them anywhere else.

    Hope this is useful to someone, would love to hear any criticisms.
    Last edited: Apr 30, 2019
    helgen_1 and fan27 like this.
  2. jharmon


    Fail. It doesn't matter how long your backtest takes - the only thing that matters is that you can calculate an entry signal before you need to pull the trigger when your system is live.
    tommcginnis likes this.
  3. Is that even relevant to what I'm talking about? How?
    d08 likes this.
  4. d08


    Doing things slower can sometimes be beneficial, you won't win on speed anyway. It gives you time to think so you'd plan testing more carefully. That said, I use Pandas as well, not always vectorized.
    nooby_mcnoob likes this.
  5. You're not wrong but most of my time is spent thinking. I hate transcribing the ideas and then having to wait for an hour for results.

    I also have code that does a more thorough backtest, this takes a long time. I only reach for this if I think I've found something.
    d08 likes this.
  6. What do you use pandas for? Research? Trading? Analysis? All?
  7. d08


    Everything, the non-core functions are also useful. I like its efficiency but I'm not always smart enough to take full advantage of it.
  8. It takes practice, I find jupyter invaluable for quick oneoffs.
  9. Metamega


    Depending on the data used, I can’t see the benefit of event driven vs vectorizesd.

    If your using minute data and assuming values besides acting on open and close values, your kind of guessing entries/exits to an extent.

    Unless you use tick data theirs no difference.

    Using EOD data and vectorizing and using open/close or a cross of a previous bar for instance, you should get same results.
    tommcginnis likes this.
  10. Event based back testing let's you do things you can't easily do with vectorized tests. For example, building up state. Conceptually, it is also easier for other people to understand.
    #10     Apr 30, 2019