How to speed up UTF-8 string processing

I am analyzing values ​​separated by tabs:

pub fn parse_tsv(line: &str) -> MyType { for (i, value) in line.split('\t').enumerate() { // ... } // ... } 

perf top contains str.find . When I look in the generated assembly code, there is a lot of work involved in encoding UTF-8 characters in &str .

And this is relatively bad. This takes 99% of the execution time.

But finding \t I can't just look for a single-byte \t in a UTF-8 string.

What am I doing wrong? What makes Rust stdlib wrong?

Or maybe Rust has some kind of string library that can represent strings just byte 'u8'? But with all split() , find() and other methods?

+5
source share
1 answer

As long as your ASCII string or you don’t need to match UTF-8 scalars (for example, for example, in your case when you are viewing tabs), you can simply treat it as bytes using the as_bytes() method and then u8 characters (bytes) work instead char (UTF-8 scanners). It should be much faster. With &[u8] , which is slice , you can still use methods applicable to &str slices, such as split() , find() , etc.

 let line = String::new(); let bytes = line.as_bytes(); pub fn parse_tsv(line: &[u8]) { for (i, value) in line.split(|c| *c == b'\t').enumerate() { } } fn main() { let line = String::new(); let bytes = line.as_bytes(); parse_tsv(&bytes) } 
+7
source

Source: https://habr.com/ru/post/1262680/


All Articles