Data mining

Discussion in 'Data Sets and Feeds' started by Indrionas, Oct 9, 2007.

  1. Let's suppose price patterns are mined from price data (data sample).

    So we come up with a set of patterns that conforms to our preset requirements.

    These requirements could be:
    1) support - how many times the pattern showed up in our data sample: s(A)=50 would mean that pattern A showed up 50 times.

    2) confidence - % hit rate: what percentage the pattern predicted the target correctly, i.e. A->B (pattern A led to target B), so it's basically s(A,B)/s(A). An example could be 80% accuracy.

    3) interest - what confidence is of interest to you. I'll try to explain it with a simple example. Suppose we're analysing data sample containing 1000 elements. We mark 400 elements as our targets.
    Now, if you tried simple random prediction (guessing), you would expect accuracy of 40%.
    If we mine patterns and let's say get three patterns that conform to our minimum support requirements. Pattern A has confidence of 48%, pattern B has confidence of 65% and pattern C has confidence of 32%.
    How do you know if these patterns are significant? They should be better than random by some preset threshold. Random guess accuracy is 40%, so pattern A has advantage of 48%-40%=8%, pattern B has advantage of 65%-40%=15% and pattern C has negative advantage of 32%-40%=-8%, so we automatically reject pattern C.
    If we had preset interest threshold to 10%, the pattern A is rejected (8%<10%) and pattern B is accepted (15%>10%).


    The problem I see here is that validating patterns this way is not enough, because data mining for patterns produces large amount of garbage. So the subject I would like to discuss is PATTERN VALIDATION.
    One known technique is out-of-sample testing: test the patterns and see if they still conform to our preset requirements.
    Even here it's still unclear, how much data there should be in training (where we mine) and testing (out-of-sample) data samples? What ratio? In our example we used 1000 data elements to mine patterns, but we could have 3000 data elements in total, so out-of-sample data set size would be 2000, and the training:testing sample size ratio is 1:2. It's clear that the smaller this ratio, the better it is, but on the other hand, you have to have training sample big enough so you could actually mine something meaningful out of it. So what's the optimal ratio? And of course, the training sample should be wide enough to cover different market conditions (uptrend, downtrend, ranging, low volatility, high volatility etc.).

    This one technique is widely known and used. But are there any other pattern validation techniques out there? Anyone experienced in statistical data analysis and/or data mining care to share their knowledge? :)
     
  2. Following is a little snip from my files. One problem I have with pattern recognition techniques is that they are too specific. The method misses lots of price changes that do not show the required pattern. Anyway here is an example of my approach:

    I identify different types of two interval patterns. For example, patterns are the high price low price range for one daily session followed by high price low price range for the next daily session.

    ...|
    | |
    | |
    |

    Ignore the dots, I am having trouble drawing the spacing between vertical bars.

    # 1
    Next day high price greater than prior day high price.
    Next day low price greater than prior day low price.

    =====

    |
    | |
    | |
    ...|

    # 2
    Next day high price less than prior day high price.
    Next day low price less than prior day low price.

    ====

    My computer program first codes the sequence of daily data as integer codes and the series looks like this: 12253426322172... Then my program searches for a specific series, say a three value series 221. I recall testing exits using patterns but they are so specific that few trades actually exit. I use a trend following exit rule, say exit when low price is less than the least price of the prior 20 daily sessions.

    So I buy when the pattern is 221 and sell when daily low price is less than the least low price value of the prior 20 daily sessions.

    Position size is (2 * account equity) / (10 * opening price). I plan to change this calculation in my next version of the program. This program is still experimental.

    Following are some of the results for Apple corporation stock 22.38 years price data using this method:

    ===

    Preceded by about 150 trades.

    27-Mar-06 buy 59.63 size 693
    28-Mar-06 OHLC:[ 59.63 60.14 58.25 58.71 ] sell 59
    Position Net Gain Or Loss is -423
    Subtotal profit $ 108581

    23-May-06 buy 62.99 size 660
    24-May-06 OHLC:[ 62.99 63.65 61.56 63.34 ] sell 62
    Position Net Gain Or Loss is -152
    Subtotal profit $ 108429

    1-Jun-06 buy 62.99 size 661
    7-Jun-06 OHLC:[ 60.10 60.40 58.35 58.56 ] sell 58
    Position Net Gain Or Loss is -2849
    Subtotal profit $ 105580

    17-Jul-06 buy 53.16 size 762
    18-Dec-06 OHLC:[ 87.63 88.00 84.59 85.47 ] sell 84
    Position Net Gain Or Loss is 24254
    Subtotal profit $ 129835

    Total profit or loss is 129835
    Initial capital is $ 100000.

    ===

    I am still working on it. This pattern matching method appears to be very complicated. That is the problem with this method. It gets complex. If I add more securities to trade a portfolio then the method might become a mess.
     
  3. Hi Hook N. Sinker,

    I forgot to post the 4th requirement for mined patterns.
    4) complexity - this is pattern complexity in amount of atomic rules that constructs the pattern. I will give an example of how I look at patterns:
    I use binary atomic (cannot be decomposed into smaller sub-rules) rules to define a pattern. A few examples of rules:
    today's high < yesterday's high, today's range > day's before yesterday range, close today > close yesterday, etc. etc. I have a set of about 200 of such rules applied to every day of my data sample. So I have a huge table of 1s and 0s where rows represent days and columns represent rules (or you could call them conditions).
    Now to construct a pattern you take some of the rules and put them in a conjunction. Example: pattern A,B,C means that the pattern is found on days where rules A, B and C were true.
    The complexity of the pattern is number of rules you used to construct it. In the case of example it's 3.
    I think anything above 5 is too complex and too specific, thus you won't find a lot of such patterns in your data sample, so they will be rare, won't pass minimum support requirement and have no statistical significance.
    In my opinion max complexity of 4 is enough. With such complexity and 200 rules you can construct about 65 million patterns. That's more than enough. The problem is how to separate garbage from real thing.


    About testing patterns, I've already explained that I mark some days as my targets, and see how well the patterns predict them. No dollar, point or pips amount is calculated at this stage of system development.
    Although you can mine patterns by using price movement amount. That's a little different approach. You then need to volatility-normalize your data before you do any data mining with it. And the best approach is to not use stops or specific exits, just some sort of breakout you wish to investigate and exit at market at day's close (or 2,3,4... days if you look at bigger timeframe). Take a look at Toby Crabel's work.
     
  4. What you are doing... is "data dredging"...
    Which, by definition, is statistically fallacious.

    Your mindset and the design of your analysis... inevitably leads to this.

    http://en.wikipedia.org/wiki/Data_dredging
     
  5. JackR

    JackR

    Q+:

    The Wikipedia article you reference says:
    A key point is that every hypothesis must be tested with evidence that was not used in constructing the hypothesis. This is because every data set must contain some chance patterns which are not be present in the population under study, or simply disappear with a sufficiently large sample size. If the hypothesis is not tested on a different data set from the same population, it is likely that the patterns found are chance patterns.

    In asking ET for advice Indrionas says:
    One known technique is out-of-sample testing: test the patterns and see if they still conform to our preset requirements.
    Even here it's still unclear, how much data there should be in training (where we mine) and testing (out-of-sample) data samples? What ratio? In our example we used 1000 data elements to mine patterns, but we could have 3000 data elements in total, so out-of-sample data set size would be 2000, and the training:testing sample size ratio is 1:2. It's clear that the smaller this ratio, the better it is, but on the other hand, you have to have training sample big enough so you could actually mine something meaningful out of it. So what's the optimal ratio? And of course, the training sample should be wide enough to cover different market conditions (uptrend, downtrend, ranging, low volatility, high volatility etc.).



    It seems to me Indrionas fully understands the difference between mining and dredging and does not want statistically inadequate results. What am I missing here? Have I misunderstood your comment?

    Jack
     
  6. Solutions to this problem need not be arbitrary. This subject has been well-studied and documented in the literature. Other resampling methods include k-fold cross validation and bootstrapping. I recommend Weiss and Kulikowski's Computer Systems That Learn.

    -Will


    Data Mining in MATLAB
     
  7. The out-of-sample testing (and cross-validation in general) is only a partial solution.

    It works nice if your out-of-sample results are similar to your in-sample results every time.

    More likely they are not. If it looks bad, you reject your hypothesis (=trading idea) and start over again with a new idea.

    If you do this many, many times you are vulnerable to the same data dredging fallacy, b/c eventually you will find a trading rule that looks good both in-sample and out-of-sample. Unfortunately, it looks good just by chance in that case.

    In other words, you should impose a certain discipline on yourself: Work as hard as possible on a trading rule using in-sample data. When you are 100% convinced you have something really robust, test it out-of-sample. If it looks bad, reject the rule and start over. If you have to reject your rules frequently and don't see any improvement over time, terminate your trading career and choose another profession.
     
  8. What one really needs to do is determine the expectancy of a given pattern over time. Reject all patterns that give a negative expectancy, and trade the pattern that has the highest expectancy.

    You could always subtract slippage and commission from that number as well. A high trading frequency could end up taking a big chunk out of your final tally, even though expectancy was at a maximum.
     
  9. I would venture...
    That you could replace "price data" with "snowflakes"...
    And come up with an equivalent number of "statistically meaningful patterns".

    IMO...
    As someone with a successful trading business for 15 years...
    Built entirely on quantitative analysis using proprietary software...

    (1) You have to approach this differently.
    You have to start with rational correlations...
    Such as oil stocks vs oil price, gold stocks vs gold price, various bond market relationships, etc...
    And then look for EXPLOITABLE INEFFICIENCIES in a fairly narrow way.

    (2) Doing #1 requires a significant amount of fundamental knowledge and trading experience.

    What I am also inferring...
    If you take a PhD quant and unlimited computer resources...
    BUT the quant has NO trading experience and NO detailed fundamental knowledge...
    It would be impossible to come up with anything more than very marginal trading strategies...
    Using this type of buckshot "pattern matching" approach.
     
  10. While you are correct, this should never happen is models are being constructed appropriately. All the test data (by whatever method) provides is an unbiased estimate of model error.

    Simplistically, assume that some test indicates that our model's tested performance is within specified tolerances of some hypothetical performance, with some probability. If this process is repeated enough times, the probabilities of making a mistake accumulate to the point of uselessness.

    As you say, no serious analyst would make this mistake.
     
    #10     Oct 10, 2007