Hi looking for some suggestions for how to store minute resolution data in memory for a better performance for backtesting. I am writing my own event-based backtesting framework in python, just for fun. The data is read into memory as a whole, stored as a pandas dataframe and then indexing will be performed with .loc[] on this dataframe during the backtest main loop. The dataframe has two levels of multiindex, the datetime level and the symbol level. And on every time change, I will have to perform an indexing operation to get the historical 1 min data for certain stocks in certain time periods(the time periods is monotonically increasing). My main issue is that the pandas indexing/slicing operation is way too slow and I know there is no pefect remedy for this operation. So I wonder if there is any suggested alternatives for those kinds of purposes, maybe I should not store my minute data as a pandas dataframe. Is numpy ndarray or structured array a better choice? But then there will be no easy solution for time series indexing.
looks like you might gain speed by vectorizing your code if at all possible. I avoid loops in python.
So you're looping with .loc? That seems like a terrible idea. At least try to use itertuples if vectorizing is out of the question. What I did was to use vectorized solutions (numpy) where possible and elsewhere switch to numba and within numba compiled functions use numpy where possible and sensible.
yes, I think numpy ndarray is the correct direction. I have timed the speed of numpy and pandas indexing. For numpy, indexing takes at the level of 100 nanoseconds to 1microsecond but for pandas the labeled indexing method .loc[] could take 100 microseconds to up to 10 milliseconds. This is about 1000 times faster then pandas indexing!! But I am still not sure how to switch to numpy ndarray, as it only supports integer location based indexing. So if I want to performed labeled based indexing, like indexing certain time periods and certain assets as I did in pandas using .loc[]. Not sure how to achieve this goal in numpy.
I know the datetime64 data type in numpy. But this is not helping. I am not trying to perform arithmatic calculation on datetime. Just trying to find a better way to get time series data during backtesting
Also, I will never understand why people think that Python is easier than c++ or c#. The amount of effort that you need to get things done in Python would allow you to learn any language.