Data mining challenge

Discussion in 'Data Sets and Feeds' started by Indrionas, Dec 18, 2010.


  1. Sorry you worked up so much over my response. :D Didn't mean to attack you. I just don't want you to waste time over a problem you don't understand.

    I will reiterate: the problem is simple. There are 300 input variables and 1 target. All variables are binary. There are patterns hidden in the data. They predict target better than random. Not all targets are predictable by those patterns. And not all inputs are relevant (used in the said patterns).

    Once again about the data format. It is CSV and can be viewed with Excel. I'm very sorry if this sounds too complicated and/or impossible for you. There is a very basic reason why it's not .xls. I have Excel 2003 and it limits number of columns to 256. The data table has 301 column.

    I'm not interested in flame wars and I will not lower myself to your level.
     
    #11     Dec 19, 2010
  2. kut2k2

    kut2k2

    Indrionas, have you tried applying a regression analysis directly to the data?
     
    #12     Dec 19, 2010
  3. Really? Let us know see who is the one who doesn't understand his own problem and he is basically a phaken troll or a malicious poster..

    As it is proven beyond reasonable doubt from the above statements of yours you

    (1) know the model that generated the output. At least that is what you claim

    (2) You have injected the output with random noise. Question: why would you want to do that?

    (3) You are fishing here for someone who can tackle this very hard program, which is possibly either a Ph.D thesis or a problem from a contest that carries a very high prize.

    This is a stochastic data mining problem where the output is contaminated with noise, either for security or other purposes, and you want to see if someone will be able to isolate the noise and find the deterministic pattern that generated the output.

    Either you do not understand the problem and you are a troll, or you are a malicious poster with an agenda.

    Since you know the pattern that generated the data, why are you asking people to find it in the first place? Are you testing people for their data mining skills?

    Don't avoid to answer the key questions once more.
     
    #13     Dec 19, 2010
  4. True.

    True. This is part of the data generation process. The reason is very simple: to simulate data that occurs in reality and to test data mining techniques that are powerful enough to deal with this issue.



    This is just plain stupidity. Nice conspiracy theory you've got there :D

    The data is generated by me. I wrote a small program that does that. I can generate any data set I want to test mining techniques against. I specify how many data points to generate, how many input variables, what percentage of data points must have target = true, what percentage of data points are predictable by the patterns, how many patterns to generate, how complex the patterns should be and what's the accuracy (target hit %) of the patterns.

    There's nothing special about this particular data set that I attached in the second post. Here's another example (attaching). It has 30 input variables and only two of them are relevant in predicting the target. Trains very well with a simple backpropagation NN.



    Here we go again. Another conspiracy theory. I will not comment these attacks in the future.



    It's written in the very first post: "My first priority is to find out if it's possible to crack the problem."

    I don't understand why are you trolling my thread. You do not contribute anything constructive and if you continue this way then I will simply ignore you. Not to mention your disrespectful tone which I find very immature.
     
    #14     Dec 19, 2010
  5. rosy2

    rosy2

    can't you just plug this data into a neural net library and get your results? there are many free ones. whats the challenge here?
     
    #15     Dec 19, 2010
  6. You are having trouble getting up to speed here.

    Apparently, you are not used to dealing with binary systems not binanry math.

    Hardly anyone syeps out of the box in the financial indurtry. there is no reason for you to step out either.
     
    #16     Dec 19, 2010
  7. He said the process is relativley unimportant, so he could follow your suggestion.

    After that, he gets to where Lo got in his oft cited paper. As Lo did, he selects a noise filter and then surmises the utility of various non random patterns.

    Lo actualy got his results published in the Journal of Finance and anyone who accepts them (Like NSF and the financial industry, gets the consequences).

    Look at the sample charts in Lo. There are 17 OB's in 77 bars. The project was destroyed right then and there.

    So far the OP has swung to each side (300 to 30) of the date indepentent variable set possible sizing

    This means he has failed to recognize How the process of getting high utility works.

    I mentioned that he undervalued the result he is seeking.

    Lo didn't make it; the literature is absent of the solution; but the riddle of induction has been proven by using paradigm theory.

    At least, as a person and researcher, the OP he has grasp that there is a powerful result simply because data can be put in a "natural" indicator form as a consequence of NOT being able to use continuous functions because of raw market's data being granular.

    Doesn't reason say that if a system exits, then it can be observed and, through reason, it can be absolutely and wholly defined?

    The tools set for doing this, creates the process that makes an abosolute and wholistic definition perfectly possible.

    Noise is NOT going to be present. Anomalies are NOT going to be present. Both appear in the work of those who are beginning.

    What is wrong with successively changing a problem to a smaller problem?

    One person adds noise (the OP here); another person (Lo) arbitrarily "subtracts noise" improperly.

    The OP may be getting ready to depart from a given constraint; LO did and he did it immediately by dropping half of the market variables out of the problem. lol. See Lo speaking in the movie "Inside Job". lol

    Very early on in TA convergence, divergnece and stochastics were introduced. The primary effoct was to attain a non solution to an opportunity. Its like Edison or Greenspan at work. Tesla replaced Edison and no one replaced Greenspan.

    Bolivia has all the Lithium but they can't get it an deliver it. lol; Its a 15 minute problem.
     
    #17     Dec 19, 2010
  8. kut2k2

    kut2k2

    Why not? You're immature enough.
    But they do correlate (note the correct spelling). And since when are divergences guaranteed to be true signals? I've seen a multitude of false-alarm divergences.
    Not for those of us who can read well. De-correlation is just one of the objectives. The other one is dimension reduction.

    See what happens when you post with an angry agenda and don't comprehend the situation fully?
     
    #18     Dec 19, 2010
  9. small addition.

    Encription of financial real time trading data (including signals) is not uncommon.

    A carrier approahc could be used.

    Additional comment.

    Slower fractal are defacto carriers of faster fractals. This synthesis of information can be systematically extracted. what could be better than using a noise and annomaly free approach?

    Markets are NOT rocket science in any way.

    Does it seems logical the use a binary target for making money. There is no other; look at the order system for participating in markets.
     
    #19     Dec 20, 2010
  10. The fact that most of the time indicators correlate high is irrelevant. It is when they correlate low that they have some value.

    But to understand this one must have real trading experience.
     
    #20     Dec 21, 2010