Document folding using genetic algorithms

I have a little problem with my project for the university.

I need to implement document classification using a genetic algorithm.

I looked at this example and (let's say) understood the principles of genetic algorithms but I'm not sure how they can be implemented in the classification of documents. Unable to determine fitness function.

Here is what I have managed to think so far (perhaps this is completely wrong)

Recognize that I have categories, and each category is described by some keywords.
Divide the file into words.
Create the first set of arrays (for example, 100 arrays, but it depends on the file size) filled with random words from the file.
1:
Choose the best category for each child in the population (by counting the keywords in it).
Crossover every 2 children in the population (new array containing half of each child) - "crossover"
Fill in the remaining children left over from the crossover with random unused words from the file - "evolution ??"
Replace random words in a random child from a new population with a random word from a file (used or not) - "mutation"
Copy the best results to a new population.
Go to 1 until a population limit is reached or a time category is found.

I'm not sure if this is correct, and we will be happy to get some advice guys.
I really appreciate it!

+4
source share
2 answers

Ivane, to correctly apply GA to classify documents:

  • You need to reduce the problem to a component system that can be developed.
  • You cannot complete GA training to classify documents in one document.

So, the steps you described are on the right track, but I will give you some improvements:

  • Have enough training data: you need a set of documents that are already classified and diverse enough to cover the range of documents that you are likely to encounter.
  • Train your GA to correctly classify a subset of these documents, as well as a set of training data.
  • On each generation, test your best sample based on the verification dataset and stop training if the accuracy of the verification begins to decline.

So what do you want to do:

prevValidationFitness = default; currentValidationFitness = default; bestGA = default; while(currentValidationFitness.IsBetterThan( prevValidationFitness ) ) { prevValidationFitness = currentValidationFitness; // Randomly generate a population of GAs population[] = randomlyGenerateGAs(); // Train your population on the training data set bestGA = Train(population); // Get the validation fitness fitness of the best GA currentValidationFitness = Validate(bestGA); // Make your selection (ie half of the population, roulette wheel selection, or random selection) selection[] = makeSelection(population); // Mate the specimens in the selection (each mating involves a crossover and possibly a mutation) population = mate(selection); } 

Whenever you receive a new document (one that has not been previously classified), you can now classify it with your best GA:

 category = bestGA.Classify(document); 

So, this is not an all-all-all-all-all solution, but it should give you a decent start. Pozdravi, Cyril

+3
source

Source: https://habr.com/ru/post/1335877/


All Articles