Neural network input optimization

I am trying to create an application to detect images that are advertisements on web pages. As soon as I discover that I will not allow them to appear on the client side.

I mainly use the backpropagation algorithm to train the neural network using the dataset shown here: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements .

But there is no data in this dataset. The attributes are very high. In fact, one of the project mentors told me that if you train a Neural Network with so many attributes, it will take a lot of time to train. So, is there a way to optimize the input dataset? Or do I just need to use many attributes?

+1
source share
3 answers

1558 is actually a small number of signs / attributes. The number of copies (3279) is also small. The problem is not with the data set, but with the learning algorithm.

ANN is slow, I suggest you use logistic regression or svm. Both of them train very quickly. In particular, svm has many fast algorithms.

In this dataset, you are actually analyzing the text, but not the image. I think the linear family classifier, i.e. Logistic regression or svm is better for your work.

If you use for production, and you can not use open source. Logistic regression is very easy to use compared to good ANN and SVM.

If you decide to use logistic regression or SVM, I will recommend some articles or source code for you in the future.

+5
source

If you really use a backpropagation network with 1558 input nodes and only 3279 samples, then training time is the least of your problems: even if you have a very small network with one hidden layer containing 10 neurons, you have 1558 * 10 weights between the input layer and hidden layer. How can you expect to get a good grade for 15580 degrees of freedom from only 3279 samples? (And this simple calculation does not even take into account the “curse of dimension”)

You need to analyze your data to find out how to optimize it. Try to understand your baseline data: What (tuples) functions are (jointly) statistically significant? (use standard statistical methods for this) Are some functions redundant? (An analysis of the main components is a good guide for this.) Don't expect an artificial neural network to do this for you.

Also: remeber Duda & Hart famous “theorem without a free lunch”: the classification algorithm does not work for every problem. And for any classification algorithm X, there is a problem when turning a coin leads to better results than X. Given this, deciding which algorithm to use before analyzing your data might not be a smart idea. You may have chosen an algorithm that actually works worse than blind guessing for your specific problem! (By the way: Duda & Hart & Storks book on sample classification is a great starting point to find out if you haven't read it yet.)

+1
source

aplly separate ANN for each function category e.g. 457 inputs 1 output for URLs (ANN1) 495 inputs 1 output for origurl (ANN2) ...

then train them all use a different core ANN to join the results

0
source

Source: https://habr.com/ru/post/1301624/


All Articles