Perl script to process a CSV file, aggregate properties distributed across multiple records

Question

Perl script to process a CSV file, aggregate properties distributed across multiple records

Sorry for the vague question, I'm trying my best to think how best to say it!

I have a CSV file that is a bit like this, but a lot more:

550672,1 656372,1 766153,1 550672,2 656372,2 868194,2 766151,2 550672,3 868179,3 868194,3 550672,4 766153,4

The values in the first column are identification numbers, and the second column can be described as a property (due to the lack of a better word ...). Identification number 550672 has properties 1,2,3,4. Can someone tell me how I can start deciding how to create strings, such as for all ID numbers? My ideal output would be a new csv file that looks something like this:

 550672,1;2;3;4 656372,1;2 766153,1;4

and etc.

I am a very adult child (only 3 days!), So I will really appreciate the direction, not the direct decision, I intend to study this material, even if it takes me the rest of my days! I tried to investigate it myself as best as possible, although I think I was saddled, not knowing what to really look for. I can read and parse CSV files (I even got to removing duplicate values!), But this is really what it falls for me. Any help would be greatly appreciated!

+4

perl csv

user1597452 Sep 16 '12 at 21:32

source share

5 answers

First, the props for finding an approach is not a solution. As you probably already found with perl, there is more than one way to do this.

The approach I would take would be:

 use strict; # will save you big time in the long run my %ids # Use a hash table with the id as the key to accumulate the properties open a file handle on csv or die while (read another line from the file handle){ split line into ID and property variable # google the split function append new property to existing properties for this id in the hash table # If it doesn't exist already, it will be created } foreach my $key (keys %ids) { deduplicate properties print/display/do whatever you need to do with the result }

This approach means that you will need to repeat the entire set twice (once in memory), therefore, depending on the size of the data set, which can be a problem. A more complex approach would be to use a hash table of hash tables to perform duplication in the initial step, but depending on how fast you want / need it to work, this might not be worth it in the first case.

Check out this question to discuss how to do deduplication.

+3

TaninDirect Sep 16 '12 at 10:04

source share

Ok, open the file as stdin in perl, suppose each row has two columns, then iterates over all the rows using the left column as a hash identifier, and collects the right column into the array pointed to by the hash key. At the end of the input file, you get a hash of the arrays, so iterate over it, print the hash key, and assign the elements of the array separated by the ";" or any other sign you wish.

and here you go

 dtpwmbp:~ pwadas$ cat input.txt 550672,1 656372,1 766153,1 550672,2 656372,2 868194,2 766151,2 550672,3 868179,3 868194,3 550672,4 766153,4 dtpwmbp:~ pwadas$ cat bb2.pl #!/opt/local/bin/perl my %hash; while (<>) { chomp; my($key, $value) = split /,/; push @{$hash{$key}} , $value ; } foreach my $key (sort keys %hash) { print $key . "," . join(";", @{$hash{$key}} ) . "\n" ; } dtpwmbp:~ pwadas$ cat input.txt | perl -f bb2.pl 550672,1;2;3;4 656372,1;2 766151,2 766153,1;4 868179,3 868194,2;3 dtpwmbp:~ pwadas$

+2

Piotr wadas Sep 16 '12 at 21:47

source share

 perl -F"," -ane 'chomp($F[1]);$X{$F[0]}=$X{$F[0]}.";".$F[1];if(eof){for(keys %X){$X{$_}=~s/;//;print $_.",".$X{$_}."\n"}}'

+2

Vijay Sep 17 '12 at 9:56

source share

Another (not perl) way, which, incidentally, is shorter and more elegant:

 #!/opt/local/bin/gawk -f BEGIN {FS=OFS=",";} NF > 0 { IDs[$1]=IDs[$1] ";" $2; } END { for (i in IDs) print i, substr(IDs[i], 2); }

The first line (after specifying the interpreter) sets the input FIELD SEPARATOR and SEPARATOR OUTPUT FIELD SEPARATOR to a comma. In the second line, we have more than zero fields, and if you do this, then the identifier ($ 1) will contain the key and the value $ 2. You do this for all rows.

The END statement prints these pairs in an unspecified order. If you want to sort them, you must select the asorti gnu awk option or connect the output of this fragment with the channel to sort sort -t, -k1n,1n .

+1

user1666959 Sep 16 '12 at 23:13

source share

Borodin · Accepted Answer · 2012-09-16T23:37:41+0000

I think it's better if I offer you a work program, rather than a few tips. Tips may still interest you, and if you take the time to understand this code, it will give you good training.

It is best to use Text::CSV whenever you process CSV data, since all debugs are already done for you

 use strict; use warnings; use Text::CSV; my $csv = Text::CSV->new; open my $fh, '<', 'data.txt' or die $!; my %data; while (my $line = <$fh>) { $csv->parse($line) or die "Invalid data line"; my ($key, $val) = $csv->fields; push @{ $data{$key} }, $val } for my $id (sort keys %data) { printf "%s,%s\n", $id, join ';', @{ $data{$id} }; }

Output

 550672,1;2;3;4 656372,1;2 766151,2 766153,1;4 868179,3 868194,2;3

Perl script to process a CSV file, aggregate properties distributed across multiple records

More articles: