Vectorized backtesting with pandas

Discussion in 'App Development' started by nooby_mcnoob, Apr 30, 2019.

  1. IAS_LLC

    IAS_LLC

    You CAN vectorize a back test, to a certain extent. Generate your signals in serial, and then vectorize the fill modeling and do path path length dependency correction.

    I do it all the time, and it saves me a ton of time. And no, I don't care to elaborate further..
     
    #51     May 9, 2019
    nooby_mcnoob likes this.
  2. Then what's the thread or the entire site good for? It's not like anyone is asking you for your "trade secrets". It's impossible to image what you might mean with

    *generate signal in serial
    * vectorize the fill modeling
    * path path length dependency correction

    But it surely sounds cute. A simple example would make it crystal clear.

    I am by the way not asking for myself, I run backrests in a pure even driven architecture with tens of millions of tick based data points per symbol and it performs perfectly fine. I am not interested in a vectorize approach because it is technically impossible to do so because I use a portfolio backtest approach where even multiple strategies are dependent on each other.



     
    #52     May 9, 2019
  3. Hey @GRULSTMRNN tell us more about this event based backtester. What does it mean to "perform perfectly fine" and do you use it to hypothesize or do full tests? To make it clear, the original post was meant to help me hypothesize and not full backtests.
     
    #53     May 9, 2019
  4. IAS_LLC

    IAS_LLC

    I'm not going to recreate what's already in open source literature (for free) because you lack the imagination to think about ways of doing things differently than your current methods. See one of the first few chapters of the Marcos de Prado Lopez book for some inspiration on the subject.

    If you'd like, I can provide a minimal example with full commented source code. Just Google-Pay me $1489.95 first
     
    #54     May 9, 2019
  5. Translation: you are talking out of your ass and are just too damn proud to admit. Thanks for the confirmation. Just sad there are so many posers and liars on this site.

     
    #55     May 10, 2019
  6. For the benefit of others here is a link to someone who implemented a vectorized backtest in Python. In the following I outline all the things that went wrong with this approach:

    https://tim-zhang.com/2016/06/12/algo-series-2-vectorized-backtesting-module/

    The astude reader will notice that the backtest results are completely unrealistic. I am not talking about a few percent or dozens of percent deviation from realistic performance metrics but completely and utterly wrong results: one year of 1-minute data, 95,000+ trades implying on average a trade every 4 minutes with double digit Sharpe ratios. Anyone who can think a little further immediately realizes the origin of the problem. The assumption in any vectorized backtest is made that there are no path dependencies, which in simple English means that what happend yesterday or a minute ago has zero bearing on decision-making today. In this particular example the algorithm trades at each point in time without any knowledge whether a trade has been taken a minute before or an hour before. This leads to tons of trades and hence a tying up of tons of margin/capital.

    That is why I said in my first post in this thread that vectorized backrests only make sense when there is no path dependency and that such assumption is completely unrealistic in financial trading. Capital is limited, so are risk limits. That means when an algorithm makes a decision whether to buy or sell or do nothing it must know how much capital is currently used and whether the algorithm is currently already long short or flat. But a vectorized backtest does not allow for knowledge of state.

    In summary, my point was that a vectorized backtest does not make sense in financial trading other than for some rudimentary initial idea profiling, for example, to visualize when signals were triggered. For any backtest that incorporates performance metrics and involves the utilization of capital and information on risk/reward a vectorized backtest will always fail.

    Don't believe anyone who only uses buzz words and otherwise is unable to walk through a simple example. Chances are high that such person either does not know what he is talking about or that he is intentionally misleading or both.
     
    #56     May 10, 2019
    jharmon likes this.
  7. Wow pretty much what I said that it was about idea validation. I think this guy just reads what he wants to read. He may not only be retarded but blind.
     
    #57     May 10, 2019
  8. IAS_LLC

    IAS_LLC

    You're right. My mistake. It was my ass talking again.
     
    #58     May 10, 2019
  9. I was using Dask on and off (just converting from Pandas when necessary) but I found the overhead when running with multiple processes was a bit too much. I found a happy medium by just writing my own grouped apply. Thought it could be helpful to someone.

    In particular, I cannot use chunksize from pool.starmap/map because this quickly runs out of memory (I think it processes arguments eagerly or something). It isn't optimally parallel since there are times when it sits more idle than it would ideally, but it does let me backtest an intraday strategy over 10 years in ~25 seconds on my threadripper vs about 5 minutes serially.

    Code:
    from multiprocessing import cpu_count, Pool
    import pandas as pd
    import numpy as np
    
    def _func(func,name,group):
        return func(group), name
    
    def df_parallel_apply(grouped,func):
     chunksize=cpu_count()
        with Pool( chunksize) as p:
            args = [(func,name,group) for name, group in grouped]
            # list of tuple(result,index)
            results = []
            while len(args):
                args,args2 = args[chunksize:],args[:chunksize]
                results2 = p.starmap(_func,args2)
                results += results2
           ret_list,index = zip(*results)
        if len(ret_list):
            if type(ret_list[0]) not in [pd.DataFrame,pd.Series]:
                df = pd.DataFrame(ret_list,index=index)
                df.index.names = grouped.keys
            else:
                df = pd.concat(ret_list,keys=index)
                df.index.names = grouped.keys + (ret_list[0].index.names)
            return df
        else:
            return pd.DataFrame()
    
    Some threadripper porn (note that significant data copying overhead still exists - that's ~ the red). I think this is due to suboptimal grouping, as far as computation is concerned.

    Selection_736.png
     
    #59     May 19, 2019
  10. It went to $274. Not exactly $260 but I did make money on the way down aside from the obviously retarded way I went about it <3
     
    #60     Nov 3, 2019