Beginner ML Trading Question

skinny · Nov 3, 2023

I'm working through a book on ML trading and there's a concept I don't understand.

What does optimal training data for an ML trading model look like? Would it consist of just OHLCV values? Wouldn't you have to train the model on data that shows profit, since that is the ultimate goal of traders? How would you express that profit in a training set?

tiddlywinks · Nov 3, 2023

skinny said:
I'm working through a book on ML trading and there's a concept I don't understand.

What does optimal training data for an ML trading model look like? Would it consist of just OHLCV values? Wouldn't you have to train the model on data that shows profit, since that is the ultimate goal of traders? How would you express that profit in a training set?
More...

Strange question you pose...

Im not an expert in this topic in any way. I'm even leery to say I know enough to be dangerous because I don't think I do.

Anyway...
Unless you can reverse engineer the profits/losses as to the criteria used for a trade, you have to build your "model". Then you can add-in profit/loss and come up with with things that work. You can not see the car in front of you in your side and rearview mirrors.

If the only input you are using is OHLCV (aka market data), then for a start, you can calculate the first derivative which is usually a slope. Hey, Its a start.

Armed with a slope you can begin to build "something" based on similarity of slopes past. And ascertain other derivatives like duration, range, length, etc.

IN YOUR CASE, three "concepts" come to mind. One being price only, another being volume only, and third, a combination of volume and price.
If price-only, than you need to formulate some form of commonality that exists in all pricing... slope, S&R, range, length, sentiment, etc.
Volume-only is different. Unlike price, every volume measurement starts at 0! Beyond a numeric value, you are looking for geometrics and pace...
Peaks, troughs, shapes, acceleration, deceleration, etc. Again, you have to formulate based on only OHLCV and it's derivatives, because those are your only inputs. Then you can add-in the profit/loss and come up with with things that work.

Speaking of inputs, # of trades and open interest (where applicable) is missing from your "market data" inputs.
And there are many other inputs you can throw into your mix. Here's just a few, definitely non-exhaustive...

Official financial texts (dates, times, actual data vs data expectation, etc)
Geopolitical news and events
Specific sector news
Weather
Fundamentals (earnings, call transcripts, etc)
Cloud data. GoogleTrends, X, tiktoc, etc

And a few of the obvious... day of week, time of day, even/odd numbered year, political regime, # of ET members/posts, etc.

Hope that helps just a little bit.
Good luck

2rosy · Nov 4, 2023

Time of day. Other markets. Order book. Activity. ...

EIDSTER · Nov 4, 2023

skinny said:
I'm working through a book on ML trading and there's a concept I don't understand.

What does optimal training data for an ML trading model look like? Would it consist of just OHLCV values? Wouldn't you have to train the model on data that shows profit, since that is the ultimate goal of traders? How would you express that profit in a training set?
More...

When it comes to exploring ML models with market data / metrics, I have yet to find something as effective as the models I've developed "by hand" over many years. In other words, nothing ML based has been as good as all the exploring, back-testing and optimizations I have run using back-testing software, in addition to, data analysis tools like excel or pandas.

However, I keep trying because I can't seem to stop... Lol.

skinny · Nov 4, 2023

Thanks for the responses, I will continue to work through Machine Learning for Algorithmic Trading by Stefan Jansen to uncover more insights.

smacdtrader · Nov 20, 2023

It feel a little like you're fishing to over-fit your model to your training data, which will lead you to have very poor performance out of sample. "Optimal" data for training is just what ever you need to explore your model, as well as lots and lots of it. Usually the more the better, even though sometimes we do things to shrink the data to make it more manageable.

Training data is the data you use to fit your model and discover parameters. So it needs to be very clean, complete and representative of the process. So whatever you use, klines, volume, order book, fundamentals, etc. it needs to be clean, cohesive, complete and representative. No garbage data or missing values, etc. You use this data to explore and see what kind of models work. Only once you have finished you exploration and fit your model and validated, do you then test the model on a separate set of data that the model has not seen yet.

metalztrader · Nov 26, 2023

It is an inherently confusing topic because market data is so different than most other time series data. Then on top of that ML itself is a toolbox to accomplish different tasks.
The most simple project would be to take a single time series based on close or open, engineer features and then try to predict T+1. The major issue though is most likely your features have no predict power so the ML algorithm will figure out that the least wrong prediction for T+1 is just T. A prediction for the close on Monday of ES as the close on Friday is totally useless to trade on but if you are stepping through the time series that is going to be the least wrong prediction it can make of all possible predictions.
The optimal data for ML is not much different than the optimal data for discretionary trading. If you are trading ES there is no shortage of data to look at that won't help you and make things more wrong than if you didn't use it.
You need to be really careful with using closing prices too. It is easy to predict the close of XOM on Monday if we wait for the close of USO or of a stock that is highly correlated with XOM. Of course, that is totally useless for actually trading XOM.
Personally, I have found the whole idea to be a waste of time.

murray t turtle · Dec 5, 2023

skinny said:
Thanks for the responses, I will continue to work through Machine Learning for ........ uncover more insights.
More...

%%
Like Dr Stanly noted today = ''time machine''[AKA= clock] LOL i dont use a watch ]
THAT's what I'm lookin' for\
the whisky ain't workin' anymore/M S / Travis Tritt/RINO Records.
Price > time machine , use both.

maciejz · Feb 21, 2024

When creating an ML model, assuming it is a supervised learning regression model (which is what you’re looking for in this particular case), you model the relationship between independent and dependent variables. Think back to algebra … y=f(x) … this is literally what ML is doing, it is estimating the function f() which is the relationship between x and y. Y is your dependent variable and X is your independent variable(s).

when you’re doing a trading system, you will probably want y to represent the return over some time-frame of whatever market you’re modeling. So, if you’re implementing a daily SPX system, then y would be the daily SPX returns. X would be the independent variables or “features.” OHLC don’t really sound like great explanatory features. Instead, we “know” that equity markets trend and display momentum, so having some sort of trend/momentum features would make sense. But make sure that you align your x and y appropriately; you are trying to predict tomorrow’s y given today’s x. So, on any given row, the y on that row should be from one day ahead of all the x’s, if you’re trying to predict one day ahead. Of course, there are many other ways to do this, you may want the y to be the average forward 5 day return or something like that; the point is that you’re predicting something in the future.

Hope this helps. You have not embarked on an easy road. Developing ML models is challenging on its own, and developing ML trading models is even more challenging for many reasons. Good luck.