Check if every word exists in the database

Question

I need to check if each word of the string is spelled correctly by searching the mongoDB collection for each word.

  • Performing a minimum number of database queries
  • The first word of each sentence must be in upper case, but this word can be upper or lower case in the dictionary. Therefore, for each word, I need a case-sensitive register. Only the first word of each sentence should be an argument to .

Line example

This is a simple example. Example. This is another example. 

Dictionary Structure

Suppose there is a collection of dictionaries like this

 { word: 'this' }, { word: 'is' }, { word: 'a' }, { word: 'example' }, { word: 'Name' } 

In my case, there are 100,000 words in this dictionary. Of course, names are stored in upper case, verbs are stored in lower case, etc ...

Expected Result

The words simple and another should be recognized as a misspelled word because they do not exist in the database.

An array with all existing words should be in this case: ['This', 'is', 'a', 'example'] . This is uppercase since it is the first word of a sentence; in the database is stored as lowercase This .

My attempt so far (updated)

 const sentences = string.replace(/([.?!])\s*(?= [AZ])/g, '$1|').split('|'); let search = [], words = [], existing, missing; sentences.forEach(sentence => { const w = sentence.trim().replace(/[^a-zA-Z0-9äöüÄÖÜß ]/gi, '').split(' '); w.forEach((word, index) => { const regex = new RegExp(['^', word, '$'].join(''), index === 0 ? 'i' : ''); search.push(regex); words.push(word); }); }); existing = Dictionary.find({ word: { $in: search } }).map(obj => obj.word); missing = _.difference(words, existing); 

Problem

  • Insensitive matches do not work as expected: /^Example$/i will give me the result. But in existing original lowercase example will appear, which means that example will go to missing -Array. Thus, case-insensitive searches work as expected, but there are missmatch in the result arrays. I do not know how to solve this.
  • Is code optimization possible? Since I use two forEach -loops and difference ...
+6
source share
1 answer

Here's how I would run into this problem:

  • Use regex to get each word after a space (including ".") In the array.

     var words = para.match(/(.+?)(\b)/g); //this expression is not perfect but will work 
  • Now add all the words from your collection to the array using find (). Let's say the name of this array is wordOfColl.

  • Now check if the words are the way you want or not

     var prevWord= ""; //to check first word of sentence words.forEach(function(word) { if(wordsOfColl.toLowerCase().indexOf(word.toLowerCase()) !== -1) { if(prevWord.replace(/\s/g, '') === '.') { //this is first word of sentence if(word[0] !== word[0].toUpperCase()) { //not capital, so generate error } } prevWord = word; } else { //not in collection, generate error } }); 

I have not tested it, so please let me know in the comments if there is any problem. Or some of your request that I missed.

Update

As the author of the question suggested that he does not want to download the whole collection on the client, you can create a method on the server that returns an array of words instead of providing access to the client of the collection.

0
source

Source: https://habr.com/ru/post/1012930/


All Articles