Log in or Sign up

ET News & Sponsor Info

General Topics

Markets

Technical Topics

Brokerage Firms

Company Specific

Interactive Brokers

Tools of the Trade

Trading for a Living

Community Lounge

Site Support

Feedback

Data mining challenge

Discussion in 'Data Sets and Feeds' started by Indrionas, Dec 18, 2010.

ronblack
- 754
  Posts
- 18
  Likes
Quote from Indrionas:

P.S. here is the list (attachment) of the patterns in the dataset (that is just one of the many datasets that I generated).
More...

Can you explain what the numbers on the attachment mean?

#31 Dec 25, 2010

Share
Indrionas
- 250
  Posts
- 0
  Likes
Quote from jack hershey:

The slow fractal is set off by semicolons (; ) and the faster fractal within is set off by spaces. Occasionally a slower pattern âfails to completeâ and this sub segment is âfoldedâ into the existing developing slower fractal pattern. To show the this in the series I just began the âfoldâ with a ( and ended it with a ).
A value was arbitrarily assigned a sign (vector status) which is inconsistent so I changed the inconsistent sign. It is colored green so a reader can find it easily.

-175 -7 +18 +35 -79 (+145 -171 -213) (+15 -78) ( +109; -288, -0 -107 -293); +208 +161 +157 +13 -238 -180 +195 +270; -179 -205 -186 -204 +109 +36 -69 -75 -179 (+285 -106 -216) ( +138 -127 -200 -271) ( +47 +37 +12 -147 ) (+17 -94 -191 -114 -125); +284 -16 â¦..incomplete as yet.

More...

Quote from ronblack:

Can you explain what the numbers on the attachment mean?
More...

Jack, what you wrote makes absolutely no sense.

The result file contains 10 patterns. One pattern per line.
Pattern is made up of variable numbers and sign before a number denotes if that variable's value must be true or false (1 or -1 in this case). For example, the first pattern:

-175 -7 +18 +35 -79

literally means: "if (var #175 is false) and (var #7 is false) and (var #18 is true) and (var #35 is true) and (var #79 is false) then target is true"

Variables are counted from zero. There are 300 input variables, therefore they are indexed from 0 to 299.

The model is this:
if at least one (out of those 10) patterns is true, then target is predicted to be true.

The parameters I used when generating the data set:
number of data points - 3000
target frequency - 10% (300 targets in 3000 rows)
pattern accuracy - 70% (this means that 30% of model predictions are false positive)
target predictability - 80% (this means that 20% of all targets (60 out of 300) are not predictable by any patterns, i.e. they are unexplainable)
Also, only 48 out of 300 input variables are relevant (used in the patterns)

#32 Dec 25, 2010

Share
tim888
- 111
  Posts
- 0
  Likes
In your file you have:

-175 -7 +18 +35 -79 +145 -171 -213 +15 -78 +109 -288 +0 -107 -293 +208 +161 +157 +13 -238 -180 +195 +270 -179 -205 -186 -204 +109 +36 -69 -75 -179 +285 -106 -216 +138 -127 -200 -271 +47 +37 +12 -147 +17 -94 -191 -114 -125 +284 -16

Why are you using only the 5 first numbers to form the pattern?

"if (var #175 is false) and (var #7 is false) and (var #18 is true) and (var #35 is true) and (var #79 is false) then target is true"

I have another question: if ony 48 variables are relevant, why do you have 50 of them in the result above?

#33 Dec 25, 2010

Share
Indrionas
- 250
  Posts
- 0
  Likes
Quote from tim888:

In your file you have:

-175 -7 +18 +35 -79 +145 -171 -213 +15 -78 +109 -288 +0 -107 -293 +208 +161 +157 +13 -238 -180 +195 +270 -179 -205 -186 -204 +109 +36 -69 -75 -179 +285 -106 -216 +138 -127 -200 -271 +47 +37 +12 -147 +17 -94 -191 -114 -125 +284 -16

Why are you using only the 5 first numbers to form the pattern?

"if (var #175 is false) and (var #7 is false) and (var #18 is true) and (var #35 is true) and (var #79 is false) then target is true"

I have another question: if ony 48 variables are relevant, why do you have 50 of them in the result above?
More...

Sorry for misunderstanding. You are probably viewing the file with a simple Windows text editor. The new lines in the file use Unix format. I recoded it to be viewable for Windows users (attachment). There are 10 patterns, one pattern per line, five variables per pattern.

48 variables are relevant because variables 109 and 179 appear twice.
- tst300_patterns_win.txt
  
  File size:
  
  251 bytes
  
  Views:
  
  93
#34 Dec 25, 2010

Share
tim888
- 111
  Posts
- 0
  Likes
Quote from Indrionas:

Sorry for misunderstanding. You are probably viewing the file with a simple Windows text editor. The new lines in the file use Unix format. I recoded it to be viewable for Windows users (attachment). There are 10 patterns, one pattern per line, five variables per pattern.

48 variables are relevant because variables 109 and 179 appear twice.
More...

Thanks for the clarifications. I still don't see the point. I understand you generated a number of targets based on a number of binary inputs with a few known to you patterns and then you populated the output with random entries. Am I right so far?

Then you asked in this forum if people could find the patterns. I think this is also right.

I still don't see how this can relate to trading. It sounds something that is more related to communications and expecially descrambling.

#35 Dec 26, 2010

Share
jack hershey
- 8,705
  Posts
- 402
  Likes
Quote from Indrionas:

Jack, what you wrote makes absolutely no sense.

The result file contains 10 patterns. One pattern per line.
Pattern is made up of variable numbers and sign before a number denotes if that variable's value must be true or false (1 or -1 in this case). For example, the first pattern:

-175 -7 +18 +35 -79

literally means: "if (var #175 is false) and (var #7 is false) and (var #18 is true) and (var #35 is true) and (var #79 is false) then target is true"

Variables are counted from zero. There are 300 input variables, therefore they are indexed from 0 to 299.

The model is this:
if at least one (out of those 10) patterns is true, then target is predicted to be true.

The parameters I used when generating the data set:
number of data points - 3000
target frequency - 10% (300 targets in 3000 rows)
pattern accuracy - 70% (this means that 30% of model predictions are false positive)
target predictability - 80% (this means that 20% of all targets (60 out of 300) are not predictable by any patterns, i.e. they are unexplainable)
Also, only 48 out of 300 input variables are relevant (used in the patterns)
More...

Thanks for the additional information and providing a file in your new format.

Now, I can see all you posted were ID's of data reductions.

I am binary oriented and what has always been significant to me is how the granularity og the market can be used to complete a deductive process that sets forth an ongoing evaluation of dtat which reveals the unfolding order of events in a context where no noise and no anomalies exist.

In my post I used the numbers (your pattern ID's) as segment values which wre "signed" to make "direction of the segment pertinent.

For me, it was a fun analysis and for you it did not mke sense since you were using a notation that you got from some process of conveting raw data into a contraint set.

You are combining binary with probability.

For me, fortunately, when I work with binary vectors, proability, noise and anomalies are eliminated.This is a nice orientation when it comes to continually extracting the market's offer fully.

#36 Dec 26, 2010

Share
rdg
- 251
  Posts
- 2
  Likes
OP, in the interest of seeing this thread continue, have you verified that you can get meaningful results from this data set when you know a priori which inputs are relevant?

#37 Dec 29, 2010

Share
Indrionas
- 250
  Posts
- 0
  Likes
Quote from rdg:

OP, in the interest of seeing this thread continue, have you verified that you can get meaningful results from this data set when you know a priori which inputs are relevant?
More...

Yes, that is the core of the problem. The NN comes up with similar results (i.e. generalization estimate) no matter if I train it with all 300 variables or with the (48) relevant only: sometimes a little better, sometimes a little worse, but no significant difference. And I tried as many training settings as I could think of with no breakthrough:
* weights initialization
* number of hidden units
* back-propagating RMS error and cross-entropy error
* learning rate from large to very low values
* with and without weight decay, turning it on from the beginning and later in the training, etc.

It just seems NN is not powerful enough for this kind of problem. But I'm not a neural networks expert so I may be missing something important here.

I'm currently investigating other mining methods.

#38 Dec 30, 2010

Share
rdg
- 251
  Posts
- 2
  Likes
Well, that doesn't surprise me:

With only 3000 examples, I can easily come up with sets of 16 variables that encode the entire target, noise and all. And the result is meaningless. With 48 variables, the state space is 2^48 (281,474,976,710,656). It shouldn't be hard to see why it is difficult to tease out the true relationship with only 3000 examples.

That said, other people and I could have told you interesting things about the signal, so all is not lost. Do you know at what point the NNs becomes useless for you? How about posting the simplest examples you can find that you don't know how to tackle and see what we come up with?

#39 Dec 30, 2010

Share
Indrionas
- 250
  Posts
- 0
  Likes
Quote from rdg:

Well, that doesn't surprise me:

With only 3000 examples, I can easily come up with sets of 16 variables that encode the entire target, noise and all. And the result is meaningless. With 48 variables, the state space is 2^48 (281,474,976,710,656). It shouldn't be hard to see why it is difficult to tease out the true relationship with only 3000 examples.

That said, other people and I could have told you interesting things about the signal, so all is not lost. Do you know at what point the NNs becomes useless for you? How about posting the simplest examples you can find that you don't know how to tackle and see what we come up with?
More...

Yes, the search space is huge. But it doesn't mean that there are that many patterns in the data. In fact, the search space is greatly limited simply because only a very small fraction of them all are "interesting". And by "interesting" pattern I mean those that have at least some support, are statistically significant, are not redundant and are validated.

Here's something to ponder about:
there are ((48!/43!) / 5!) * 2^5 = 54'793'728 patterns that have 5 attributes
3'113'280 - 4 attributes
138'368 - 3 attributes
4'512 - 2 attributes
94 - 1 attribute

total: 58'049'982 patterns that have 1 to 5 attributes, that can have values both "true" and "false".

That's not that big of a search space. Only a very small fraction of patterns will pass "interestingness" criteria. And only the true patterns will pass validation tests after that. That's actually nothing that a single CPU cannot do in a few moments. I have run operations that explored search space of the size in hundreds of billions.

The problem is still tractable with 300 attributes.

The problem becomes infeasible to solve with a simple brute-force apriori algorithm when you consider more than 300 attributes, and patterns more complex than having only 5 attributes. So, clear thinking suggests that some sort of trade-off must be made - i.e. complete results cannot be guaranteed and exploration of patterns becomes probabilistic. This is achieved by using some sort of heuristic. Genetic/evolutionary/neural algorithms are based on some heuristic.

#40 Dec 30, 2010

Share

(You must log in or sign up to reply here.)

Search