NLP - When to type lowercase text during preprocessing

I want to build a model for modeling languages ​​that should predict the next words in a sentence, given the previous word (s) and / or previous sentence.

Use case: I want to automate the recording of reports. Therefore, the model should automatically complete the sentence that I am writing. Therefore, it is important that nouns and words at the beginning of a sentence be capitalized.

Data . The data is in German and contains a lot of technical jargon.

My text body is in German , and I'm currently working on preprocessing. Since my model should predict grammatically correct sentences, I decided to use / not use the following preprocessing steps:

  • without stopping playback
  • no lemmatization

  • replace all expressions with NUMBER numbers

  • normalization of synonyms and abbreviations
  • replace rare words with RARE

However, I'm not sure if the case needs to be converted to lowercase. When searching the Internet, I found different opinions. Although the lower shell is fairly common, it will make my model erroneously predict the capitalization of nouns, the beginning of sentences, etc.

.

? ? , ? ?

!

+4
3

, , , . , . , , . . (, ). , , , , - . . , . , .

: , , .

, Spacey . .

+3

, :

  • , . MIT, MIT, ( ), .
  • , . J. A. Snow
  • , . (I),(II),(III),APPENDIX A

<RARE>, <RARE>, ?

, , . , -

spacy, , ( )

+2

Source: https://habr.com/ru/post/1684351/


All Articles