String.replace returns a binary representation of a string

I study the elixir and stumbled upon something that did not make sense to me ...

I'm trying to remove punctuation marks

"Freude schöner Götterfunken" |> String.replace(~r/[^\s\w]/, "") #=> <<70, 114, 101, 117, 100, 101, 32, 115, 99, 104, 195, 110, 101, 114, 32, 71, 195, 116, 116, 101, 114, 102, 117, 110, 107, 101, 110>> "Freude schöner Götterfunken" |> String.replace(~r/[^\w]/, "") #=> <<70, 114, 101, 117, 100, 101, 32, 115, 99, 104, 195, 110, 101, 114, 32, 71, 195, 116, 116, 101, 114, 102, 117, 110, 107, 101, 110>> "Freude schöner Götterfunken" |> String.replace(~r/\p{P}/, "") #=> <<70, 114, 101, 117, 100, 101, 32, 115, 99, 104, 195, 110, 101, 114, 32, 71, 195, 116, 116, 101, 114, 102, 117, 110, 107, 101, 110>> "Freude schöner Götterfunken" |> String.replace(~r/\s/, "") #=> FreudeschönerGötterfunken "Hi my name is bob" |> String.replace(~r/\w/, "") #=> " " Regex.run(~r/[^\w]/, "Freude schöner Götterfunken") #=> [<<182>>] 

It seems like a mistake, but as a noob, I accept ignorance. Why doesn't the replacement return a string?

+5
source share
2 answers

You are correct that String.replace / 2 does not return a string, since Elixir defines strings as utf-8 encoded binary codes. However, this is not a mistake, because Elixir expects you to pass or perform valid operations on the arguments, since it will not check all the results (due to the high cost).

For example, if you transfer any of the binaries above String.downcase/1 , Elixir will replace the parts that it knows about, ignoring the rest. The reason it works is because UTF-8 is automatically synchronized, so if we see something strange, we can skip the strange byte and continue the operation.

In other words, Elixir's philosophy of processing String is to check at the borders (for example, when opening files, doing I / O, or reading from the database) and assume that we are working and doing the actual operations.

Ok, with all that said, why is your code not working? The reason is that unicode is not included in your regex. Add the u modifier, then:

 iex> "Freude schöner Götterfunken" |> String.replace(~r/[^\s\w]/u, "") "Freude schöner Götterfunken" 

Well, this does not solve your problem, but at least the result is valid. Reading about Unicode categories here means that we really cannot solve this problem with unicode properties, because ö in your example is a single code that matches the \p{L} property.

Perhaps the simplest solution in this case, if you want to allow it only for German, is to just go through the binary code, keeping the bytes <= 127. Something like:

 iex> for <<x <- "Freude schöner Götterfunken">>, x <= 127, into: "", do: <<x>> "Freude schner Gtterfunken" 

If you want a more complete solution, you should probably study unicode transliteration.

+17
source

String.replace returns a "string", but double- String.replace strings are actually stored as binary files in Elixir. For some reason, the output cannot be displayed as a regular string, therefore, it returns to displaying the binary representation.

0
source

Source: https://habr.com/ru/post/1205831/


All Articles