How to cut one CSV file into several smaller ones grouped by field?

I have a large dataset from the development goals of Millenium World Bank as CSV. Data is displayed as follows:

Country Code Country Name Indicator ABW Aruba % Forest coverage ADO Andorra % Forest coverage AFG Afghanistan % Forest coverage ... ABW Aruba % Literacy rate ADO Andorra % Literacy rate AFG Afghanistan % Literacy rate ... ABW Aruba % Another indicator ADO Andorra % Another indicator AFG Afghanistan % Another indicator 

The file is currently in 8.2 MB format. I am going to program a web interface for this data, and I would like to slice the data by country so that I can make an ajax request so that I can download separate CSVs for each country.

I get lost in how to do this programmatically or using any tool. I don't necessarily need Python, but this is what I know best. I don’t need a complete script, a generic pointer to how to approach this problem is appreciated.

The source data source I'm working with is here:

http://duopixel.com/stack/data.csv

+4
source share
4 answers

Single line:

 awk -F "," 'NF>1 && NR>1 {print $0 >> ("data_" $1 ".csv"); close("data_" $1 ".csv")}' data.csv 

This creates new files called data_ABW , etc., containing the relevant information. Part NR>1 skips the title bar. Then for each row, she adds this whole row ( $0 ) to a file named Data_$1 , where $1 is replaced by the text in the first column of this row. Finally, the close statement ensures that there are not too many open files. If you do not have many countries, you can get rid of this and significantly increase the speed of the team.

In response to the @Lenwood comment below, to include a header in each output file, you can do this:

 awk -F "," 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$1]) {print header >> ("data_" $1 ".csv"); files[$1]=1}; print $0 >> ("data_" $1 ".csv"); close("data_" $1 ".csv")}' data.csv 

(Perhaps you need to avoid the exclamation mark ...) The first new part is NR==1 {header=$0}; just saves the first line of the input file as a header variable. Then another new part if(! files[$1]) ... files[$1]=1}; uses the associative files array to track everything whether it included the header in the given file, and if not, it puts it there.

Note that this adds files, so if these files already exist, they are simply added. Therefore, if you get new data in your main file, you probably want to delete these other files before you run this command again.

(If this is not obvious, if you want the files to be named as data_Aruba , you can change $1 to $2 )

+4
source

You can use the python module csv and itertools.groupby .
The following example was tested in Python 2.7.1
Edit: An updated response to the account to add new information to the question.

 import csv, itertools as it, operator as op csv_contents = [] with open('yourfile.csv', 'rb') as fin: dict_reader = csv.DictReader(fin) # default delimiter is comma fieldnames = dict_reader.fieldnames # save for writing for line in dict_reader: # read in all of your data csv_contents.append(line) # gather data into a list (of dicts) # input to itertools.groupby must be sorted by the grouping value sorted_csv_contents = sorted(csv_contents, key=op.itemgetter('Country Name')) for groupkey, groupdata in it.groupby(sorted_csv_contents, key=op.itemgetter('Country Name')): with open('slice_{:s}.csv'.format(groupkey), 'wb') as fou: dict_writer = csv.DictWriter(fou, fieldnames=fieldnames) dict_writer.writeheader() # new method in 2.7; use writerow() in 2.6- dict_writer.writerows(groupdata) 

Other notes:

  • You can use a regular csv reader and writer, but DictReader and DictWriter are good because you can refer to columns by name.
  • Always use the β€œb” flag when reading or writing .csv files, because on Windows it affects how the lines are processed.
  • If something is unclear, let me know and I will explain further!
+4
source

It is very simple with the pandas Python data analysis library :

 from pandas.io.parsers import read_csv df = read_csv(input_file, header=1, sep='\t', index_col=[0,1,2]) for (country_code, country_name), group in df.groupby(level=[0,1]): group.to_csv(country_code+'.csv') 

Result

 $ for f in *.csv ; do echo $f; cat $f; echo; done ABW.csv Country Code,Country Name,Indicator ABW,Aruba,% Forest coverage ABW,Aruba,% Literacy rate ABW,Aruba,% Another indicator ADO.csv Country Code,Country Name,Indicator ADO,Andorra,% Forest coverage ADO,Andorra,% Literacy rate ADO,Andorra,% Another indicator AFG.csv Country Code,Country Name,Indicator AFG,Afghanistan,% Forest coverage AFG,Afghanistan,% Literacy rate AFG,Afghanistan,% Another indicator 
+2
source

In shell scripts.

First, awk '{print $1}' | sort | uniq > code.lst awk '{print $1}' | sort | uniq > code.lst awk '{print $1}' | sort | uniq > code.lst will provide you with a list of country codes. Then you can youfilename.csv over the country code and select all the lines youfilename.csv that match the code with grep.

 for c in `ls code.lst` do grep $c youfilename.csv > youfilename_$c.csv done 
+1
source

Source: https://habr.com/ru/post/1400071/


All Articles