Log in or Sign up

ET News & Sponsor Info

General Topics

Markets

Technical Topics

Brokerage Firms

Company Specific

Interactive Brokers

Tools of the Trade

Trading for a Living

Community Lounge

Site Support

Feedback

Data mining challenge

Discussion in 'Data Sets and Feeds' started by Indrionas, Dec 18, 2010.

Indrionas
- 250
  Posts
- 0
  Likes
Quote from goodgoing:

(1) Is there a random process in the pattern that generated them

or

(2) You put them there yourself to make the problem harder for people?

More...

Sorry you worked up so much over my response. Didn't mean to attack you. I just don't want you to waste time over a problem you don't understand.

I will reiterate: the problem is simple. There are 300 input variables and 1 target. All variables are binary. There are patterns hidden in the data. They predict target better than random. Not all targets are predictable by those patterns. And not all inputs are relevant (used in the said patterns).

Once again about the data format. It is CSV and can be viewed with Excel. I'm very sorry if this sounds too complicated and/or impossible for you. There is a very basic reason why it's not .xls. I have Excel 2003 and it limits number of columns to 256. The data table has 301 column.

I'm not interested in flame wars and I will not lower myself to your level.

#11 Dec 19, 2010

Share
kut2k2
- 5,045
  Posts
- 455
  Likes
Indrionas, have you tried applying a regression analysis directly to the data?

#12 Dec 19, 2010

Share
goodgoing
- 478
  Posts
- 38
  Likes
Quote from Indrionas:

I just don't want you to waste time over a problem you don't understand.
More...

Really? Let us know see who is the one who doesn't understand his own problem and he is basically a phaken troll or a malicious poster..

Quote from Indrionas:

The data is synthetic. Means I have generated it and I know the rules (real model) that generated it.
More...

Quote from Indrionas:

Not all targets are predictable. Some part of them were added as random noise.
More...

Quote from Indrionas:

I will reveal the real model that generated the data to compare with your results.
More...

Quote from Indrionas:

I can understand this can be proprietary (and very expensive) information. My first priority is to find out if it's possible to crack the problem. The method itself is secondary.
More...

As it is proven beyond reasonable doubt from the above statements of yours you

(1) know the model that generated the output. At least that is what you claim

(2) You have injected the output with random noise. Question: why would you want to do that?

(3) You are fishing here for someone who can tackle this very hard program, which is possibly either a Ph.D thesis or a problem from a contest that carries a very high prize.

This is a stochastic data mining problem where the output is contaminated with noise, either for security or other purposes, and you want to see if someone will be able to isolate the noise and find the deterministic pattern that generated the output.

Either you do not understand the problem and you are a troll, or you are a malicious poster with an agenda.

Since you know the pattern that generated the data, why are you asking people to find it in the first place? Are you testing people for their data mining skills?

Don't avoid to answer the key questions once more.

#13 Dec 19, 2010

Share
Indrionas
- 250
  Posts
- 0
  Likes
Quote from goodgoing:

(1) know the model that generated the output. At least that is what you claim
More...

True.

(2) You have injected the output with random noise. Question: why would you want to do that?
More...

True. This is part of the data generation process. The reason is very simple: to simulate data that occurs in reality and to test data mining techniques that are powerful enough to deal with this issue.

(3) You are fishing here for someone who can tackle this very hard program, which is possibly either a Ph.D thesis or a problem from a contest that carries a very high prize.
More...

This is just plain stupidity. Nice conspiracy theory you've got there

The data is generated by me. I wrote a small program that does that. I can generate any data set I want to test mining techniques against. I specify how many data points to generate, how many input variables, what percentage of data points must have target = true, what percentage of data points are predictable by the patterns, how many patterns to generate, how complex the patterns should be and what's the accuracy (target hit %) of the patterns.

There's nothing special about this particular data set that I attached in the second post. Here's another example (attaching). It has 30 input variables and only two of them are relevant in predicting the target. Trains very well with a simple backpropagation NN.

This is a stochastic data mining problem where the output is contaminated with noise, either for security or other purposes, and you want to see if someone will be able to isolate the noise and find the deterministic pattern that generated the output.

Either you do not understand the problem and you are a troll, or you are a malicious poster with an agenda.
More...

Here we go again. Another conspiracy theory. I will not comment these attacks in the future.

Since you know the pattern that generated the data, why are you asking people to find it in the first place? Are you testing people for their data mining skills?
More...

It's written in the very first post: "My first priority is to find out if it's possible to crack the problem."

I don't understand why are you trolling my thread. You do not contribute anything constructive and if you continue this way then I will simply ignore you. Not to mention your disrespectful tone which I find very immature.
- test04.txt
  
  File size:
  
  124 KB
  
  Views:
  
  91
#14 Dec 19, 2010

Share
rosy2
- 1,398
  Posts
- 3
  Likes
can't you just plug this data into a neural net library and get your results? there are many free ones. whats the challenge here?

#15 Dec 19, 2010

Share
jack hershey
- 8,705
  Posts
- 402
  Likes
Quote from goodgoing:

Yep, it is not for me. You said it. Listen you imbecile, I aksed you a specififc question and you came back attacking me. I asked you what the purpose of the random entries

(1) Is there a random process in the pattern that generated them

or

(2) You put them there yourself to make the problem harder for people?

Excel files, you imbecile, help people to see the data in an orderly way. FYI, you can output the data in .txt format from excel.

Now, I am not going to work on your homework problem. I know which class you are taking you imbecile. You are soooooooooooooooooo stupid you cannot solve a trivial problem.

"Introduction to Data Mining"
More...

You are having trouble getting up to speed here.

Apparently, you are not used to dealing with binary systems not binanry math.

Hardly anyone syeps out of the box in the financial indurtry. there is no reason for you to step out either.

#16 Dec 19, 2010

Share
jack hershey
- 8,705
  Posts
- 402
  Likes
Quote from rosy2:

can't you just plug this data into a neural net library and get your results? there are many free ones. whats the challenge here?
More...

He said the process is relativley unimportant, so he could follow your suggestion.

After that, he gets to where Lo got in his oft cited paper. As Lo did, he selects a noise filter and then surmises the utility of various non random patterns.

Lo actualy got his results published in the Journal of Finance and anyone who accepts them (Like NSF and the financial industry, gets the consequences).

Look at the sample charts in Lo. There are 17 OB's in 77 bars. The project was destroyed right then and there.

So far the OP has swung to each side (300 to 30) of the date indepentent variable set possible sizing

This means he has failed to recognize How the process of getting high utility works.

I mentioned that he undervalued the result he is seeking.

Lo didn't make it; the literature is absent of the solution; but the riddle of induction has been proven by using paradigm theory.

At least, as a person and researcher, the OP he has grasp that there is a powerful result simply because data can be put in a "natural" indicator form as a consequence of NOT being able to use continuous functions because of raw market's data being granular.

Doesn't reason say that if a system exits, then it can be observed and, through reason, it can be absolutely and wholly defined?

The tools set for doing this, creates the process that makes an abosolute and wholistic definition perfectly possible.

Noise is NOT going to be present. Anomalies are NOT going to be present. Both appear in the work of those who are beginning.

What is wrong with successively changing a problem to a smaller problem?

One person adds noise (the OP here); another person (Lo) arbitrarily "subtracts noise" improperly.

The OP may be getting ready to depart from a given constraint; LO did and he did it immediately by dropping half of the market variables out of the problem. lol. See Lo speaking in the movie "Inside Job". lol

Very early on in TA convergence, divergnece and stochastics were introduced. The primary effoct was to attain a non solution to an opportunity. Its like Edison or Greenspan at work. Tesla replaced Edison and no one replaced Greenspan.

Bolivia has all the Lithium but they can't get it an deliver it. lol; Its a 15 minute problem.

#17 Dec 19, 2010

Share
kut2k2
- 5,045
  Posts
- 455
  Likes
Quote from goodgoing:

From the naive website you linked to:

"However, Chande and Kroll in "The New Technical Trader" (page 9) have shown that common indicators like momentum, RSI, stochastics, ADX, etc. have correlations with each other ranging from 0.77 to 0.93 (r-squared values). Consequently, these forecast models are being fed an empty diet of numbers mostly saying the same thing. "

Can I call the authors (burning) Candle and Troll?
More...

Why not? You're immature enough.

Quote from goodgoing:

Why do people who do not understand TA write books? For one, what is important when using indicators is not so much their values and their corellations but the patterns of divergences. These do not corellate that high, actually they may not corellate at all.
More...

But they do correlate (note the correct spelling). And since when are divergences guaranteed to be true signals? I've seen a multitude of false-alarm divergences.

Quote from goodgoing:

Then, the main objective of DDR is to de-corellate the inputs.
More...

Not for those of us who can read well. De-correlation is just one of the objectives. The other one is dimension reduction.

See what happens when you post with an angry agenda and don't comprehend the situation fully?

#18 Dec 19, 2010

Share
jack hershey
- 8,705
  Posts
- 402
  Likes
Quote from kut2k2:

Why not? You're immature enough.
But they do correlate (note the correct spelling). And since when are divergences guaranteed to be true signals? I've seen a multitude of false-alarm divergences.
Not for those of us who can read well. De-correlation is just one of the objectives. The other one is dimension reduction.

See what happens when you post with an angry agenda and don't comprehend the situation fully?
More...

small addition.

Encription of financial real time trading data (including signals) is not uncommon.

A carrier approahc could be used.

Additional comment.

Slower fractal are defacto carriers of faster fractals. This synthesis of information can be systematically extracted. what could be better than using a noise and annomaly free approach?

Markets are NOT rocket science in any way.

Does it seems logical the use a binary target for making money. There is no other; look at the order system for participating in markets.

#19 Dec 20, 2010

Share
intradaybill
- 2,961
  Posts
- 11
  Likes
The fact that most of the time indicators correlate high is irrelevant. It is when they correlate low that they have some value.

But to understand this one must have real trading experience.

#20 Dec 21, 2010

Share

(You must log in or sign up to reply here.)

Search