Analysis of a complex CSV file using C ++

I have a large CSV file that looks like this:

23456, The End Near Near, a silly description that makes no sense, http://www.example.com , 45332, July 5, 1998 Sunday, 45.332

This is only one line of the CSV file. There are about 500 thousand

I want to parse this file using C ++. The code I started with is:

#include <iostream> #include <fstream> #include <string> #include <sstream> using namespace std; int main() { // open the input csv file containing training data ifstream inputFile("my.csv"); string line; while (getline(inputFile, line, ',')) { istringstream ss(line); // declaring appropriate variables present in csv file long unsigned id; string url, title, description, datetaken; float val1, val2; ss >> id >> url >> title >> datetaken >> description >> val1 >> val2; cout << url << endl; } inputFile.close(); } 

The problem is that it does not print the correct values.

I suspect that he is not able to handle spaces within the field. So what do you suggest me to do?

thanks

+4
source share
5 answers

In this example, we need to parse the string using two getline . The first line of the line cvs text getline(cin, line) uses the default delimiter for the new line. The second getline(ss, line, ',') restricts the use of commas to separate lines.

 #include <iostream> #include <sstream> #include <string> #include <vector> float get_float(const std::string& s) { std::stringstream ss(s); float ret; ss >> ret; return ret; } int get_int(const std::string& s) { std::stringstream ss(s); int ret; ss >> ret; return ret; } int main() { std::string line; while (getline(cin, line)) { std::stringstream ss(line); std::vector<std::string> v; std::string field; while(getline(ss, field, ',')) { std::cout << " " << field; v.push_back(field); } int id = get_int(v[0]); float f = get_float(v[6]); std::cout << v[3] << std::endl; } } 

+4
source

Using std::istream to read std::strings using an overloaded insert statement will not work. The whole line is a line, so it will not be visible that the default field change will change. A quick fix would be to separate line with a comma and assign values ​​to the appropriate fields (instead of using std::istringstream ).

NOTE. In addition to jrok point near std::getline

+1
source

Within the specified limitations, I think I would do something like this:

 #include <locale> #include <iostream> #include <sstream> #include <string> #include <vector> #include <iterator> // A ctype that classifies only comma and new-line as "white space": struct field_reader : std::ctype<char> { field_reader() : std::ctype<char>(get_table()) {} static std::ctype_base::mask const* get_table() { static std::vector<std::ctype_base::mask> rc(table_size, std::ctype_base::mask()); rc[','] = std::ctype_base::space; rc['\n'] = std::ctype_base::space; return &rc[0]; } }; // A struct to hold one record from the file: struct record { std::string key, name, desc, url, zip, date, number; friend std::istream &operator>>(std::istream &is, record &r) { return is >> r.key >> r.name >> r.desc >> r.url >> r.zip >> r.date >> r.number; } friend std::ostream &operator<<(std::ostream &os, record const &r) { return os << "key: " << r.key << "\nname: " << r.name << "\ndesc: " << r.desc << "\nurl: " << r.url << "\nzip: " << r.zip << "\ndate: " << r.date << "\nnumber: " << r.number; } }; int main() { std::stringstream input("23456, The End is Near, A silly description that makes no sense, http://www.example.com, 45332, 5th July 1998 Sunday, 45.332"); // use our ctype facet with the stream: input.imbue(std::locale(std::locale(), new field_reader())); // read in all our records: std::istream_iterator<record> in(input), end; std::vector<record> records{ in, end }; // show what we read: std::copy(records.begin(), records.end(), std::ostream_iterator<record>(std::cout, "\n")); } 

It is, without a doubt, longer than most others, but it is all broken up into small, mostly reusable pieces. After you have other parts in place, the code for reading data is trivial:

  std::vector<record> records{ in, end }; 

Another point that I find convincing: when I first compiled the code, it also worked correctly (and I find this usual procedure for this programming style).

+1
source

I just solved this problem for myself and want to share!
It may be a bit overkill, but it shows a working example of how Boost Tokenizer and vectors deal with a big problem.

 /* * ALfred Haines Copyleft 2013 * convert csv to sql file * csv2sql requires that each line is a unique record * * This example of file read and the Boost tokenizer * * In the spirit of COBOL I do not output until the end * when all the print lines are ouput at once * Special thanks to SBHacker for the code to handle linefeeds */ #include <sstream> #include <boost/tokenizer.hpp> #include <boost/iostreams/device/file.hpp> #include <boost/iostreams/stream.hpp> #include <boost/algorithm/string/replace.hpp> #include <vector> namespace io = boost::iostreams; using boost::tokenizer; using boost::escaped_list_separator; typedef tokenizer<escaped_list_separator<char> > so_tokenizer; using namespace std; using namespace boost; vector<string> parser( string ); int main() { vector<string> stuff ; // this is the data in a vector string filename; // this is the input file string c = ""; // this holds the print line string sr ; cout << "Enter filename: " ; cin >> filename; //filename = "drwho.csv"; int lastindex = filename.find_last_of("."); // find where the extension begins string rawname = filename.substr(0, lastindex); // extract the raw name stuff = parser( filename ); // this gets the data from the file /** I ask if the user wants a new_index to be created */ cout << "\n\nMySql requires a unique ID field as a Primary Key \n" ; cout << "If the first field is not unique (no dupicate entries) \nthan you should create a " ; cout << "New index field for this data.\n" ; cout << "Not Sure! try no first to maintain data integrity.\n" ; string ni ;bool invalid_data = true;bool new_index = false ; do { cout<<"Should I create a New Index now? (y/n)"<<endl; cin>>ni; if ( ni == "y" || ni == "n" ) { invalid_data =false ; } } while (invalid_data); cout << "\n" ; if (ni == "y" ) { new_index = true ; sr = rawname.c_str() ; sr.append("_id" ); // new_index field } // now make the sql code from the vector stuff // Create table section c.append("DROP TABLE IF EXISTS `"); c.append(rawname.c_str() ); c.append("`;"); c.append("\nCREATE TABLE IF NOT EXISTS `"); c.append(rawname.c_str() ); c.append( "` ("); c.append("\n"); if (new_index) { c.append( "`"); c.append(sr ); c.append( "` int(10) unsigned NOT NULL,"); c.append("\n"); } string s = stuff[0];// it is assumed that line zero has fieldnames int x =0 ; // used to determine if new index is printed // boost tokenizer code from the Boost website -- tok holds the token so_tokenizer tok(s, escaped_list_separator<char>('\\', ',', '\"')); for(so_tokenizer::iterator beg=tok.begin(); beg!=tok.end(); ++beg) { x++; // keeps number of fields for later use to eliminate the comma on the last entry if (x == 1 && new_index == false ) sr = static_cast<string> (*beg) ; c.append( "`" ); c.append(*beg); if (x == 1 && new_index == false ) { c.append( "` int(10) unsigned NOT NULL,"); } else { c.append("` text ,"); } c.append("\n"); } c.append("PRIMARY KEY (`"); c.append(sr ); c.append("`)" ); c.append("\n"); c.append( ") ENGINE=InnoDB DEFAULT CHARSET=latin1;"); c.append("\n"); c.append("\n"); // The Create table section is done // Now make the Insert lines one per line is safer in case you need to split the sql file for (int w=1; w < stuff.size(); ++w) { c.append("INSERT INTO `"); c.append(rawname.c_str() ); c.append("` VALUES ( "); if (new_index) { string String = static_cast<ostringstream*>( &(ostringstream() << w) )->str(); c.append(String); c.append(" , "); } int p = 1 ; // used to eliminate the comma on the last entry // tokenizer code needs unique name -- stok holds this token so_tokenizer stok(stuff[w], escaped_list_separator<char>('\\', ',', '\"')); for(so_tokenizer::iterator beg=stok.begin(); beg!=stok.end(); ++beg) { c.append(" '"); string str = static_cast<string> (*beg) ; boost::replace_all(str, "'", "\\'"); // boost::replace_all(str, "\n", " -- "); c.append( str); c.append("' "); if ( p < x ) c.append(",") ;// we dont want a comma on the last entry p++ ; } c.append( ");\n"); } // now print the whole thing to an output file string out_file = rawname.c_str() ; out_file.append(".sql"); io::stream_buffer<io::file_sink> buf(out_file); std::ostream out(&buf); out << c ; // let the user know that they are done cout<< "Well if you got here then the data should be in the file " << out_file << "\n" ; return 0;} vector<string> parser( string filename ) { typedef tokenizer< escaped_list_separator<char> > Tokenizer; escaped_list_separator<char> sep('\\', ',', '\"'); vector<string> stuff ; string data(filename); ifstream in(filename.c_str()); string li; string buffer; bool inside_quotes(false); size_t last_quote(0); while (getline(in,buffer)) { // --- deal with line breaks in quoted strings last_quote = buffer.find_first_of('"'); while (last_quote != string::npos) { inside_quotes = !inside_quotes; last_quote = buffer.find_first_of('"',last_quote+1); } li.append(buffer); if (inside_quotes) { li.append("\n"); continue; } // --- stuff.push_back(li); li.clear(); // clear here, next check could fail } in.close(); //cout << stuff.size() << endl ; return stuff ; } 

0
source

You are right to suspect that your code is not behaving as it wishes, because there are spaces inside the field values.

If you really have a β€œsimple” CSV, where no field can contain a comma inside the field value, I would step back from the stream operators and, possibly, C ++ all together. The sample program in the question simply reorders the fields. There is no need to actually interpret or convert the values ​​to their respective types (unless the validation was also the goal). Reordering in itself is very easy to do with awk. For example, the following command will change the 3 fields found in the simplest CSV file.

 cat infile | awk -F, '{ print $3","$2","$1 }' > outfile 

If the goal is to really use this piece of code as a launch pad for bigger and better ideas ... then I would draw a line by searching for commas. The std :: string class has a built-in method for finding offsets of certain characters. You can make this approach as elegant or inelegant as you want. The most elegant approaches end up looking about the same as boost tokenization code.

A quick and dirty approach is to know that your program has N fields and is looking for the positions of the corresponding N-1 commas. When you have these positions, it is pretty simple to call std :: string :: substr to extract the fields of interest.

0
source

Source: https://habr.com/ru/post/1497437/


All Articles