Parsing a CSV file in bash

Question

Parsing a CSV file in bash

I have a file formatted as follows:

string1,string2,string3,...
...

I need to analyze the second column, counting the occurrences of each row, and create a file formatted as follows:

"number of occurrences of x",x
"number of occurrences of y",y        
...

I managed to write the following script that works fine:

#!/bin/bash

> output
regExp='^\s*([0-9]+) (.+)$'
while IFS= read -r line
do
    if [[ "$line" =~ $regExp ]]
    then
        printf "${BASH_REMATCH[1]},${BASH_REMATCH[2]}\n" >> output
    fi
done <<< "`gawk -F , '!/^$/ {print $2}' $1 | sort | uniq -c`"

My question is: Is there a better and easier way to do this job?

In particular, I do not know how to fix it:

gawk -F , '!/^$/ {print $2}' miocsv.csv | sort | uniq -c | gawk '{print $1","$2}'

The problem is that string2 can contain spaces, and if so, the second call to gawk truncates the string. I also don't know how to print the entire "2 to NF" field, supporting a separator that can happen several times in a row.

Thanks a lot, bye

EDIT:

As already mentioned, here are some examples of data:

(This exercise, sorry for the inventor)

Input:

*,*,*
test,  test  ,test
prova, * , prova
test,test,test
prova,  prova   ,prova
leonardo,da vinci,leonardo
in,o    u   t   ,pr
, spaces ,
, spaces ,
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
in,o    u   t   ,pr
test,  test  ,test
,   tabs    ,
,   tabs    ,
po,po,po
po,po,po
po,po,po
prova, * , prova
prova, * , prova
*,*,*
*,*,*
*,*,*
, spaces ,
,   tabs    ,

Output:

3, * 
4,*
4,da vinci
2,o u   t   
3,po
1,  prova   
3, spaces 
3,  tabs    
1,test
2,  test

+4

bash regex awk csv gawk

Luca 08 . '15 18:11

3

awk:

gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'

sed :

sed 's/ *\([0-9]*\) /\1,/'

+1

meuh 08 . '15 18:25

Perl, Filipe awk:

perl -F, -lane '$x{$F[1]}++; END{ for $i (sort keys %x) { print "$x{$i},$i" } }' input.csv

.
autosplit @F $F[0], awk- $1

0

Chris Koknat 08 . '15 21:37

Filipe Gonçalves · Accepted Answer · 2015-09-08T18:25:47+0000

awk:

awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv

x, - .

, , sort(1), , :

awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2

, , , ,

Parsing a CSV file in bash

More articles: