Openrefine: Split multi-valued cells with a token / word?

I have a large amount of textual data that I preprocess to classify documents using MALLET using openrefine .

Some of the cells are long (> 150,000 characters), and I'm trying to break them into segments of 1,000 words / tokens.

I can divide long cells into 6,000 chunks of characters using "Split multi-valued cells" along the length of the field, which roughly translates into 1000 words / tokens, but splits the words into lines, so I lose some of my data.

Is there a function that I could use to separate long cells with the first space ("") after every 6000th character, or even better, split every 1000 words?

+4
source share
2 answers

Here is my simple solution:

Go to Edit Cells β†’ Transformation and type

value.replace(/((\s+\S+?){999})\s+/,"$1@@@")

This will replace every 1000th spaces (consecutive spaces are considered single and replaced if they appear on the separation border) with @@@ (you can choose any marker that you like if it does not appear in the original text).

Go to Edit Cells β†’ Split multi-valued cells and split using the @@@ separator as the separator.

+2
source

, , ( ) 1000 , , " " ( ).

GREL, , "Python/Jython" script.

: Edit cells β†’ Transform β†’ Python/Jython:

my_list = value.split(' ')

n = 1000
i = n
while i < len(my_list):
    my_list.insert(i, '|||')
    i+= (n+1)

return " ".join(my_list)

( script . )

:

text = value.split(' ')
n = 1000
return "|||".join([' '.join(text[i:i+n]) for i in range(0,len(text),n)])

, ||| .


, , textwrap:

import textwrap

return "|||".join(textwrap.wrap(value, 6000))
+1

Source: https://habr.com/ru/post/1695789/


All Articles