UTF-8 string array using graphemes and partitions

Question

UTF-8 string array using graphemes and partitions

Is there any advantage to using graphemesover splitto create an array from a UTF-8 string?

For example, consider the following:

# Define a UTF-8 string with a bunch of multibyte characters
s = "{(-n↑⍵÷⊃⊖⍵),⍨⍉1↓⍉∘.=⍨⍳n←1-⍨≢⍵}"

# Create an array using split
split(s, "")

# Create an array using graphemes (v0.4+)
collect(graphemes(s))

Both approaches give the expected result. And indeed

split(s, "") == collect(graphemes(s))

returns true.

Both approaches seem to consistently produce equivalent results. Is one approach usually preferable to another, whether for performance, style, or otherwise?

(Note that graphemesiterator returns, not an array, therefore collect.)

+4

string arrays utf-8 julia-lang

Alex A. Oct 7 '15 at 10:03

source share

1 answer

Josh Durham · Accepted Answer · 2015-10-07T22:25:09+0000

, . graphemes() , , ; , - . split().

a + ◌. split() , graphemes() .

UTF-8 string array using graphemes and partitions

More articles: