Tag detection using forecolor / backcolor detection, as you already did. To determine the size, calculate the average text size and remove outliers. Also set predefined limits for the text (for example, you have already done this).
The structure of the blobs tag is shown below. For your first moment, you can simply count the words, and if they are found too often (perhaps 5 times more often than the 2nd word), you can mark this as a repeating tag.
When adding en-mass tags, the user often adds them all in one place, so you can see that similar “fraud tags” appear next to them (possibly with one or two words between them). A.
If you could identify at least some common “fraud tags” and want to get a little more advanced, you can do the following:
- Divide the document into parts with the same text / font and analyze each part separately. For more accurate results, parts of the group that use almost the same font / size, and not just those that have EXACTLY the same font / size.
- Count the appearance of each known tag and when the limit you have defined is exceeded, this part of the document is deleted or the document is marked as “bad” (as in “uses extra tags”)
No matter how advanced your discovery is, once people know it there and more or less know how it works, they will find ways to get around it.
When this happens, you just have to mark the violating documents and see them yourself. Then, if you notice that your detection algorithm has received a false positive, you will improve it.
source share