I am trying to use koRpusin R to perform lemmatization on a Linux server with RHEL6. Last week, when I installed MRO (Microsoft R Open) 3.2.3, the code below worked just fine:
library(koRpus)
lw = c("dancing","flying","flew")
res = treetag(lw,treetagger="manual",format="obj",TT.tknz = F, lang="en",
TT.options=list(path="/usr/local/bin/TreeTagger",preset="en"))
Now when I run MRO 3.3.0, I get the following error:
Error in grepl("(^\\p{P}*\\p{L}\\p{M}*\\.)", tkn, perl = TRUE) :
invalid regular expression '(^\p{P}*\p{L}\p{M}*\.)'
In addition: Warning message:
In grepl("(^\\p{P}*\\p{L}\\p{M}*\\.)", tkn, perl = TRUE) :
PCRE pattern compilation error
'support for \P, \p, and \X has not been compiled'
at 'p{P}*\p{L}\p{M}*\.)'
OK, so my PCRE needs to be recompiled with Unicode support. In fact, when I run the code below, I see that this is the exact problem. I also see that I am running version 8.37.
pcre_config()
#> UTF-8 Unicode properties JIT
#> TRUE FALSE FALSE
extSoftVersion()
#> zlib bzlib xz
#> "1.2.8" "1.0.6, 6-Sept-2010" "5.2.2"
#> PCRE ICU TRE
#> "8.37 2015-04-28" "57.1" "TRE 0.8.0 R_fixes (BSD)"
#> iconv
#> "glibc 2.12"
Now I went ahead and installed 8.39 and with (hopefully) set the necessary flags.
./configure --enable-utf8 --enable-unicode-properties
make
make install
So now, when I run pcretest -C, I get
PCRE version 8.39 2016-06-14
Compiled with
8-bit support
UTF-8 support
Unicode properties support
No just-in-time compiler support
Newline sequence is LF
\R matches all Unicode newlines
Internal link size = 2
POSIX malloc threshold = 10
Parentheses nest limit = 250
Default match limit = 10000000
Default recursion depth limit = 10000000
Match recursion uses stack
But when I run R again, mine pcre_config()gives the same results, the call treetagfails, and extSoftVersion()still reports 8.37.
R, PCRE?
...
R, -, PCRE ( https://mran.microsoft.com/news/#r330) 3.3.0, () PCRE . PCRE, ( ), R PCRE 8.37 2015-04-28, , MRO 3.3.0 RHEL 6 PCRE, , . , grepl , extSoftVersion.