Tokenizing a data string into vector structs?

So, I have the following line of data, which is accepted through the TCP winsock connection and wants to do extended tokenization, into the vector structs, where each structure represents one record.

std::string buf = "44:william:adama:commander:stuff\n33:luara:roslin:president:data\n" struct table_t { std::string key; std::string first; std::string last; std::string rank; std::additional; }; 

Each entry in the line is limited to carriage returns. My attempt to split records, but have not yet divided the fields:

  void tokenize(std::string& str, std::vector< string >records) { // Skip delimiters at beginning. std::string::size_type lastPos = str.find_first_not_of("\n", 0); // Find first "non-delimiter". std::string::size_type pos = str.find_first_of("\n", lastPos); while (std::string::npos != pos || std::string::npos != lastPos) { // Found a token, add it to the vector. records.push_back(str.substr(lastPos, pos - lastPos)); // Skip delimiters. Note the "not_of" lastPos = str.find_first_not_of("\n", pos); // Find next "non-delimiter" pos = str.find_first_of("\n", lastPos); } } 

It seems completely unnecessary to repeat all of this code again to further label each entry with a colon (internal field separator) in the structure and push each structure into a vector. I am sure there is a better way to do this, or the design itself is wrong.

Thanks for any help.

+2
source share
2 answers

To break a line into records, I would use istringstream, if only because it will simplify the changes later when I want to read the file. For tokenization, the most obvious solution is boost :: regex, so:

 std::vector<table_t> parse( std::istream& input ) { std::vector<table_t> retval; std::string line; while ( std::getline( input, line ) ) { static boost::regex const pattern( "\([^:]*\):\([^:]*\):\([^:]*\):\([^:]*\):\([^:]*\)" ); boost::smatch matched; if ( !regex_match( line, matched, pattern ) ) { // Error handling... } else { retval.push_back( table_t( matched[1], matched[2], matched[3], matched[4], matched[5] ) ); } } return retval; } 

(I assumed a logical constructor for table_t. Also: there are very long traditions in C whose names end with _t are typedef, so you probably better find a different convention.)

+1
source

My decision:

 struct colon_separated_only: std::ctype<char> { colon_separated_only(): std::ctype<char>(get_table()) {} static std::ctype_base::mask const* get_table() { typedef std::ctype<char> cctype; static const cctype::mask *const_rc= cctype::classic_table(); static cctype::mask rc[cctype::table_size]; std::memcpy(rc, const_rc, cctype::table_size * sizeof(cctype::mask)); rc[':'] = std::ctype_base::space; return &rc[0]; } }; struct table_t { std::string key; std::string first; std::string last; std::string rank; std::string additional; }; int main() { std::string buf = "44:william:adama:commander:stuff\n33:luara:roslin:president:data\n"; stringstream s(buf); s.imbue(std::locale(std::locale(), new colon_separated_only())); table_t t; std::vector<table_t> data; while ( s >> t.key >> t.first >> t.last >> t.rank >> t.additional ) { data.push_back(t); } for(size_t i = 0 ; i < data.size() ; ++i ) { cout << data[i].key <<" "; cout << data[i].first <<" "<<data[i].last <<" "; cout << data[i].rank <<" "<< data[i].additional << endl; } return 0; } 

Output:

 44 william adama commander stuff 33 luara roslin president data 

Online demo: http://ideone.com/JwZuk


The method I used here is described in my other solution for another question:

Elegant ways to count the frequency of words in a file

+2
source

Source: https://habr.com/ru/post/1346082/


All Articles