Data mining challenge

Discussion in 'Data Sets and Feeds' started by Indrionas, Dec 18, 2010.

  1. ronblack

    ronblack

    Why just 1 to 5 attributes? Did you pull that out of nowhere or it is related to the problem in a way you only knew? If you know in advance how many attributes are significant in a mining problem yes you can reduce the search space but in reality you don't know.
     
    #41     Dec 31, 2010

  2. Yes, that's what I am talking about. In this example problem I know in advance that there are 5 attributes max.

    The problem is what to do when it is not known in advance.
    One approach could be to use this "max number of attributes" as a training parameter. Then run training gradually increasing this parameter until you either reach satisfactory result or computation becomes too expensive.

    And even if that's the case, there's usually not that many patterns in the data as the theoretical size of search space. It all depends on the density of data.
    For example, you don't have to explore patterns with attributes (A=1 AND B=1 AND C=1 AND X=x) if (A=1 AND B=1 AND C=1) occurs only a few times and adding complexity will reduce occurrence below the desired level of significance.
     
    #42     Dec 31, 2010
  3. Several years back (7, I think) I was working on an AI project where I built a genetic algorithms tool kit. I know this will sound like the maxim, "If the tool you know is a hammer, every problem looks like a nail." From my recollection, this would be a good means of attach.

    By now there are lots of these kits around. I found some free ones on SourceForge.net. Check them out.

    I wish I had the time to pursue your challenge.
     
    #43     Dec 31, 2010

  4. Evolutionary algorithms are not suitable for this kind of problem. They rely on a fitness function. In order for an evolutionary algorithm (such as GA) to work productively, the fitness function must be monotonic. This is not the case with searching for exact patterns. A simple example:

    Pattern ABCDE is significant. Meanwhile patterns ABCD, BCDE, DEF, ABDE, BDEFG are insignificant, therefore their fitness function values will be low and they won't be picked as parents to produce a new generation of patterns, even though they would have been good candidates.
     
    #44     Jan 7, 2011
  5. Good points about evolutionary algorithms. What do you mean by exact patterns?
     
    #45     Jan 7, 2011
  6. It's been a while since I was intimate with this field so my memory could be faulty. However, I don't believe the fitness function had to be monotonic with proper gene pool management.

    GA's lack of fitness for pattern recognition is beyond my recollection.
     
    #46     Jan 7, 2011
  7. rdg

    rdg

    What makes you say this? I can only see that being the case when your patterns are constructed with nonlinearities (ie, xor).
     
    #47     Jan 7, 2011
  8. Hugin

    Hugin

    The problem is not constrained to evolutionary algorithms (and generally there is no requirement on the fitness function being a monotone function).

    I think all optimization/classification algorithms require some structural information to guide the search. Traditional algorithms often use the gradient (e.g. in back-propagation training of neural networks).

    Very simplified, genetic algorithms relies on some similarity (using some measure) between strings corresponding to good solutions. All that is required is a fitness function that can tell whether a solution is better than another. But, and I guess this is what you refer to, in order for the search to be efficient the fitness functions must be able to order the different solutions. In a ”sparse” solution space, e.g. where only a few solutions have a non zero fitness values, the search can degenerate to a (slow) random search.

    You could argue that if there is no structure present then random search is the best method.

    What method to use depends on how you know about the underlying structure and in academic examples, like this one, how much of this knowledge you are allowed to use when creating the search algorithm. I can definitely see GA/GP being used to solve this kind of problem provided that there are enough examples to explain the underlying generating algorithm. Whether this is the most efficient method is different question (probably not).
     
    #48     Jan 10, 2011
  9. Why don't you read his post carefully before pulling the trigger? He said for the GA's to work productively the FF must be monotic. It doesn't have to be.
     
    #49     Jan 10, 2011
  10. One of us needs remedial English classes.

    I agree that it doesn't have to be.
     
    #50     Jan 10, 2011