Comparing a character in a Rust string using indexing

Question

Comparing a character in a Rust string using indexing

I want to read lines from "input.txt" and leave only those that do not have the # character (comment) at the beginning of the line. I wrote this code:

 use std::io::{BufRead, BufReader}; use std::fs::File; fn main() { let file = BufReader::new(File::open("input.txt").unwrap()); let lines: Vec<String> = file.lines().map(|x| x.unwrap()).collect(); let mut iter = lines.iter().filter(|&x| x.chars().next() != "#".chars().next()); println!("{}", iter.next().unwrap()); }

But this line

 |&x| x.chars().next() != "#".chars().next()

it smells bad to me because it may look like |x| x[0] == "#" |x| x[0] == "#" and I cannot check the second character in the string.

So how can I reorganize this code?

+5

iterator string rust

Pavlo Razumovskyi Oct 13 '14 at 18:05

source share

1 answer

Vladimir Matveev · Accepted Answer · 2014-10-13T18:37:59+0000

Rust strings are stored as a sequence of bytes representing UTF-8 encoded characters. UTF-8 is variable-width encoding, so indexing bytes can leave you inside a character, which is clearly unsafe. But getting the code point by index is an O (n) operation. Moreover, indexing code points is not what you really want to do, because there are code points that do not even have associated characters, such as diacritics or other modifiers. Indexing grapheme clusters is closer to the correct approach, but usually it is necessary for text rendering or, possibly, for processing the language.

What I mean is that row indexing is hard to determine correctly, and what most people usually want is wrong. Consequently, Rust does not provide the operation of indexing rows in rows.

Sometimes, however, you need to index rows. For example, if you know in advance that your string contains only ASCII characters or if you work with binary data. In this case, Rust, of course, provides all the necessary tools.

First, you can always get an idea of the basic sequence of bytes. &str has an as_bytes() method that returns &[u8] , the piece of bytes that the string consists of. Then you can use the usual indexing operation:

 x.as_bytes()[0] != b'#'

Note the special notation: b'#' means "ASCII character # type u8 ", that is, it is a byte character (also note that you do not need to write "#".chars().next() to get the # character, you can simply write '#' - a literal of a simple character). However, this is unsafe since &str is a UTF-8 encoded string, and the first character can consist of more than one byte.

The right way to process ASCII data in Rust is to use the ascii container . You can go from &str to &AsciiStr using the as_ascii_str() method. Then you can use it as follows:

 extern crate ascii; use ascii::{AsAsciiStr, AsciiChar}; // ... x.as_ascii_str().unwrap()[0] != AsciiChar::Hash

This way you will need a bit more text input, but you will get much more security in return, because as_ascii_str() checks that you only work with ASCII data.

Sometimes, however, you simply want to work with binary data without interpreting it as characters, even if the source contains some ASCII characters. This can happen, for example, when you write a parser for some markup language such as Markdown. In this case, you can consider the entire input as a sequence of bytes:

 use std::io::{Read, BufReader}; use std::fs::File; fn main() { let mut file = BufReader::new(File::open("/etc/hosts").unwrap()); let mut buf = Vec::new(); file.read_to_end(&mut buf).unwrap(); let mut iter = buf.split(|&c| c == b'\n').filter(|line| line[0] != b'#'); println!("{:?}", iter.next().unwrap()); }

Comparing a character in a Rust string using indexing

More articles: