Data mining challenge

Discussion in 'Data Sets and Feeds' started by Indrionas, Dec 18, 2010.

  1. ronblack

    ronblack

    Can you explain what the numbers on the attachment mean?
     
    #31     Dec 25, 2010


  2. Jack, what you wrote makes absolutely no sense.


    The result file contains 10 patterns. One pattern per line.
    Pattern is made up of variable numbers and sign before a number denotes if that variable's value must be true or false (1 or -1 in this case). For example, the first pattern:

    -175 -7 +18 +35 -79

    literally means: "if (var #175 is false) and (var #7 is false) and (var #18 is true) and (var #35 is true) and (var #79 is false) then target is true"

    Variables are counted from zero. There are 300 input variables, therefore they are indexed from 0 to 299.

    The model is this:
    if at least one (out of those 10) patterns is true, then target is predicted to be true.

    The parameters I used when generating the data set:
    number of data points - 3000
    target frequency - 10% (300 targets in 3000 rows)
    pattern accuracy - 70% (this means that 30% of model predictions are false positive)
    target predictability - 80% (this means that 20% of all targets (60 out of 300) are not predictable by any patterns, i.e. they are unexplainable)
    Also, only 48 out of 300 input variables are relevant (used in the patterns)
     
    #32     Dec 25, 2010
  3. tim888

    tim888

    In your file you have:

    -175 -7 +18 +35 -79 +145 -171 -213 +15 -78 +109 -288 +0 -107 -293 +208 +161 +157 +13 -238 -180 +195 +270 -179 -205 -186 -204 +109 +36 -69 -75 -179 +285 -106 -216 +138 -127 -200 -271 +47 +37 +12 -147 +17 -94 -191 -114 -125 +284 -16

    Why are you using only the 5 first numbers to form the pattern?

    "if (var #175 is false) and (var #7 is false) and (var #18 is true) and (var #35 is true) and (var #79 is false) then target is true"


    I have another question: if ony 48 variables are relevant, why do you have 50 of them in the result above?
     
    #33     Dec 25, 2010

  4. Sorry for misunderstanding. You are probably viewing the file with a simple Windows text editor. The new lines in the file use Unix format. I recoded it to be viewable for Windows users (attachment). There are 10 patterns, one pattern per line, five variables per pattern.

    48 variables are relevant because variables 109 and 179 appear twice.
     
    #34     Dec 25, 2010
  5. tim888

    tim888

    Thanks for the clarifications. I still don't see the point. I understand you generated a number of targets based on a number of binary inputs with a few known to you patterns and then you populated the output with random entries. Am I right so far?

    Then you asked in this forum if people could find the patterns. I think this is also right.

    I still don't see how this can relate to trading. It sounds something that is more related to communications and expecially descrambling.
     
    #35     Dec 26, 2010
  6. Thanks for the additional information and providing a file in your new format.

    Now, I can see all you posted were ID's of data reductions.

    I am binary oriented and what has always been significant to me is how the granularity og the market can be used to complete a deductive process that sets forth an ongoing evaluation of dtat which reveals the unfolding order of events in a context where no noise and no anomalies exist.

    In my post I used the numbers (your pattern ID's) as segment values which wre "signed" to make "direction of the segment pertinent.

    For me, it was a fun analysis and for you it did not mke sense since you were using a notation that you got from some process of conveting raw data into a contraint set.

    You are combining binary with probability.

    For me, fortunately, when I work with binary vectors, proability, noise and anomalies are eliminated.This is a nice orientation when it comes to continually extracting the market's offer fully.
     
    #36     Dec 26, 2010
  7. rdg

    rdg

    OP, in the interest of seeing this thread continue, have you verified that you can get meaningful results from this data set when you know a priori which inputs are relevant?
     
    #37     Dec 29, 2010

  8. Yes, that is the core of the problem. The NN comes up with similar results (i.e. generalization estimate) no matter if I train it with all 300 variables or with the (48) relevant only: sometimes a little better, sometimes a little worse, but no significant difference. And I tried as many training settings as I could think of with no breakthrough:
    * weights initialization
    * number of hidden units
    * back-propagating RMS error and cross-entropy error
    * learning rate from large to very low values
    * with and without weight decay, turning it on from the beginning and later in the training, etc.

    It just seems NN is not powerful enough for this kind of problem. But I'm not a neural networks expert so I may be missing something important here.

    I'm currently investigating other mining methods.
     
    #38     Dec 30, 2010
  9. rdg

    rdg

    Well, that doesn't surprise me:

    With only 3000 examples, I can easily come up with sets of 16 variables that encode the entire target, noise and all. And the result is meaningless. With 48 variables, the state space is 2^48 (281,474,976,710,656). It shouldn't be hard to see why it is difficult to tease out the true relationship with only 3000 examples.

    That said, other people and I could have told you interesting things about the signal, so all is not lost. Do you know at what point the NNs becomes useless for you? How about posting the simplest examples you can find that you don't know how to tackle and see what we come up with?
     
    #39     Dec 30, 2010
  10. Yes, the search space is huge. But it doesn't mean that there are that many patterns in the data. In fact, the search space is greatly limited simply because only a very small fraction of them all are "interesting". And by "interesting" pattern I mean those that have at least some support, are statistically significant, are not redundant and are validated.

    Here's something to ponder about:
    there are ((48!/43!) / 5!) * 2^5 = 54'793'728 patterns that have 5 attributes
    3'113'280 - 4 attributes
    138'368 - 3 attributes
    4'512 - 2 attributes
    94 - 1 attribute

    total: 58'049'982 patterns that have 1 to 5 attributes, that can have values both "true" and "false".

    That's not that big of a search space. Only a very small fraction of patterns will pass "interestingness" criteria. And only the true patterns will pass validation tests after that. That's actually nothing that a single CPU cannot do in a few moments. I have run operations that explored search space of the size in hundreds of billions.

    The problem is still tractable with 300 attributes.

    The problem becomes infeasible to solve with a simple brute-force apriori algorithm when you consider more than 300 attributes, and patterns more complex than having only 5 attributes. So, clear thinking suggests that some sort of trade-off must be made - i.e. complete results cannot be guaranteed and exploration of patterns becomes probabilistic. This is achieved by using some sort of heuristic. Genetic/evolutionary/neural algorithms are based on some heuristic.
     
    #40     Dec 30, 2010