How to check unreadable OCRed text with NLTK

I use NLTK to analyze a case that has been OCRed. I am new to NLTK. Most OCR is good - but sometimes I come across lines that are clearly garbage-like. For instance:oomfi ow Ba wmnondmam BE wBwHo<oBoBm. Bowman as: Ham: 8 ooww om $5

I want to identify (and filter out) such lines from my analysis.

How do NLP practitioners deal with this situation? Something like: if 70% of the words in the sentence are not in wordnet, discard. Or if NLTC cannot identify part of speech for 80% of a word, then discard? What algorithms work for this? Is there a gold standard method?

+4
source share
1 answer

n-, , . n- google, n-, nltk. , , . , . .

, , n-.

EDIT: nltk :

import math
from nltk import NgramModel
from nltk.corpus import brown
from nltk.util import ngrams
from nltk.probability import LidstoneProbDist

n = 2
est = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(n, brown.words(categories='news'), estimator=est)

def sentenceprob(sentence):
    bigrams = ngrams(sentence.split(), n)
    sentence = sentence.lower()
    tot = 0
    for grams in bigrams:
        score = lm.logprob(grams[-1], grams[:-1])
        tot += score
    return tot

sentence1 = "This is a standard English sentence"
sentence2 = "oomfi ow Ba wmnondmam BE wBwHo<oBoBm. Bowman as: Ham: 8 ooww om $5"

print sentenceprob(sentence1)
print sentenceprob(sentence2)

:

>>> python lmtest.py
  42.7436688972
  158.850086668

. (, ).

+6

Source: https://habr.com/ru/post/1537815/


All Articles