Should I keep / delete identical training examples that represent different objects?

I prepared a data set for recognizing a certain type of object (about 2240 examples of negative objects and only about 90 examples of positive objects). However, after calculating 10 attributes for each object in the data set, the number of unique training instances dropped to about 130 and 30, respectively.

Since identical training instances actually represent different objects, can I say that this duplication contains relevant information (for example, the distribution of the values โ€‹โ€‹of object objects) that can be useful in one way or another?

+5
source share
1 answer

If you omit duplicates, this will distort the base speed of each individual object. If the training data is a representative sample of the real world, then you do not want this, because you will really train in a slightly different world (one with different base rates).

To clarify this point, consider a scenario in which there are only two different objects. Your source data contains 99 objects A and 1 object B. After throwing duplicates, you have 1 object A and 1 object B. The classifier trained on deduplicated data will differ significantly from the one trained on the source data.

My advice is to leave duplicates in the data.

+10
source

Source: https://habr.com/ru/post/1204025/


All Articles