Maximum reasonable size for stemCompletion in tm?

I have a body of 26 text files, each of which is from 12 to 148 KB, only 1.2 MB. I am using R on a windows 7 laptop.

I did all the usual cleaning things (stop words, custom stop words, lowercase letters, numbers) and I want to complete the work. I use the source case as a dictionary, as shown in the examples. I tried a couple of simple vectors to make sure that it would work at all (with about 5 terms), and it happened very quickly.

exchanger <- function(x) stemCompletion(x, budget.orig) budget <- tm_map(budget, exchanger) 

He worked from yesterday at 16:00! In R Studio, when diagnosed, the query log displays new queries with different query numbers. The task manager shows this using some memory, but not a crazy amount. I do not want to stop him, because if he is almost there? Any other ideas on how to test progress is an unstable case, unfortunately? Ideas on how long it will take? I thought about using the dtm name vector as a dictionary cut off with the most frequent (or high tf-idf), but I don't want to kill this process.

This is a regular laptop with Windows 7 with many other features.

Is this enclosure too large for stemCompletion? Before reaching Python, is there a better way to do stemCompletion or lemmatize vice stem - my search on the Internet did not give any answers.

+6
source share
2 answers

I cannot give you a definite answer without data that reproduces your problem, but I would suggest that the bottleneck comes from the following line of source code stemCompletion :

 possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE)) 

After that, given that you saved the default completion heuristic "prevalent", this happens:

 possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), decreasing = TRUE)) structure(names(sapply(possibleCompletions, "[", 1)), names = x) 

This first line crosses every word in your corpus and checks it on your dictionary for possible completions. I assume that you have many words that appear many times in your corpus. This means that the function is called many times just to give the same answer. Perhaps a faster version (depending on how many words are repeated and how often they are repeated) would look something like this:

 y <- unique(x) possibleCompletions <- lapply(y, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE)) possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), decreasing = TRUE)) z <- structure(names(sapply(possibleCompletions, "[", 1)), names = y) z[match(x, names(z))] 

Thus, it skips only unique x values, and not every x value. To create this revised version of the code, you will need to download the source from CRAN and change the function (I found it in the .R completion file in the R folder).

Or you can just use Python for this.

+1
source

Christine, following the Shawn, I recommend that you use only a unique word for the use of stem kits. I mean, it’s easy for your PC to fulfill its unique words that they execute in all your cases (with all repetitions).

  • First of all, take unique words from your body. For example:

    unique $ text <- unique (budget)

  • You can get unique words from your source text.

    unique_budget.orig <- unique (budget.orig)

  • Now you can apply stemcomplection to your unieque words

    unique $ completition <- budget%>% stemCompletion (dictionary = unique_budget.orig)

  • Now you have an object with all the words from your corpus and their completion. you just need to apply the connection between your case and a unique object. Make sure that both objects have the same variable name for words without completion: this gonne will be the key.

This will reduce the number of operations that your PC must perform.

0
source

Source: https://habr.com/ru/post/946793/


All Articles