pandas too slow for event driven backtesting

Discussion in 'App Development' started by Marshall, Feb 10, 2022.

  1. Marshall

    Marshall

    Hi looking for some suggestions for how to store minute resolution data in memory for a better performance for backtesting. I am writing my own event-based backtesting framework in python, just for fun. The data is read into memory as a whole, stored as a pandas dataframe and then indexing will be performed with .loc[] on this dataframe during the backtest main loop.

    The dataframe has two levels of multiindex, the datetime level and the symbol level. And on every time change, I will have to perform an indexing operation to get the historical 1 min data for certain stocks in certain time periods(the time periods is monotonically increasing).

    My main issue is that the pandas indexing/slicing operation is way too slow and I know there is no pefect remedy for this operation.

    So I wonder if there is any suggested alternatives for those kinds of purposes, maybe I should not store my minute data as a pandas dataframe. Is numpy ndarray or structured array a better choice? But then there will be no easy solution for time series indexing.
     
  2. R1234

    R1234

    looks like you might gain speed by vectorizing your code if at all possible.
    I avoid loops in python.
     
  3. Zwaen

    Zwaen

    Aren’t tibbles the way to go ( instead of dataframes)?
     
  4. d08

    d08

    So you're looping with .loc? That seems like a terrible idea. At least try to use itertuples if vectorizing is out of the question. What I did was to use vectorized solutions (numpy) where possible and elsewhere switch to numba and within numba compiled functions use numpy where possible and sensible.
     
    Marshall and nooby_mcnoob like this.
  5. Marshall

    Marshall

    yes, I think numpy ndarray is the correct direction.
    I have timed the speed of numpy and pandas indexing. For numpy, indexing takes at the level of 100 nanoseconds to 1microsecond but for pandas the labeled indexing method .loc[] could take 100 microseconds to up to 10 milliseconds. This is about 1000 times faster then pandas indexing!!
    But I am still not sure how to switch to numpy ndarray, as it only supports integer location based indexing. So if I want to performed labeled based indexing, like indexing certain time periods and certain assets as I did in pandas using .loc[]. Not sure how to achieve this goal in numpy.
     
  6. Marshall

    Marshall

    I know the datetime64 data type in numpy. But this is not helping. I am not trying to perform arithmatic calculation on datetime. Just trying to find a better way to get time series data during backtesting:(
     
  7. I will never understand why people use Python for trading.
     
    dholliday likes this.
  8. Marshall

    Marshall

    not everyone knows how to write c/c++. python is not the most efficient way, but it is convenient.
     
  9. Also, I will never understand why people think that Python is easier than c++ or c#.
    The amount of effort that you need to get things done in Python would allow you to learn any language.
     
    #10     Feb 11, 2022