Effective data processing in a text file

Question

Effective data processing in a text file

Suppose I have a (text) file with the following structure (name, rating):

a 0 a 1 b 0 c 0 d 3 b 2

And so on. My goal is to summarize the scores for each name and order them from the highest score to the lowest score. Therefore, in this case, I need the following output:

  d 3 b 2 a 1 c 0

I don’t know in advance what names will be in the file.

I was wondering if there is an effective way to do this. My text file can contain up to 50,000 entries.

The only way I can think of is only starting at line 1, remember that name, and then iterate over the entire file to look for that name and amount. It looks terribly inefficient, so I was wondering if there is a better way to do this.

+5

python file

Nigel overmars Dec 04 '15 at 11:23

source share

3 answers

This is a good use case for collections.Counter :

 from collections import Counter scores = Counter() with open('my_file') as f: for line in f: key, score = line.split() scores.update({key: int(score)}) for key, score in scores.most_common(): print(key, score)

+7

Sven marnach Dec 04 '15 at 11:32

source share

Pandas can do this quite easily:

 import pandas as pd data = pd.read_csv('filename.txt', names=['Name','Score']) sorted = data.groupby('Name').sum().sort_values('Score', ascending=False) print sorted

0

screenpaver Dec 04 '15 at 13:44

source share

Mike müller · Accepted Answer · 2015-12-04T11:31:03+0000

Read all the data in the dictionary:

 from collections import defaultdict from operator import itemgetter scores = defaultdict(int) with open('my_file.txt') as fobj: for line in fobj: name, score = line.split() scores[name] += int(score)

and sorting:

 for name, score in sorted(scores.items(), key=itemgetter(1), reverse=True): print(name, score)

prints:

 d 3 b 2 a 1 c 0

Performance

To test the performance of this answer and the value from @SvenMarnach, I included both approaches in the function. Here fobj is a file open for reading. I use io.StringIO , so I / O delays should hopefully not be measured:

 from collections import Counter def counter(fobj): scores = Counter() fobj.seek(0) for line in fobj: key, score = line.split() scores.update({key: int(score)}) return scores.most_common() from collections import defaultdict from operator import itemgetter def default(fobj): scores = defaultdict(int) fobj.seek(0) for line in fobj: name, score = line.split() scores[name] += int(score) return sorted(scores.items(), key=itemgetter(1), reverse=True)

Results for collections.Counter :

 %timeit counter(fobj) 10000 loops, best of 3: 59.1 µs per loop

Results for collections.defaultdict :

 %timeit default(fobj) 10000 loops, best of 3: 15.8 µs per loop

It looks like defaultdict is four times faster. I would not have guessed about it. But when it comes to performance, you need to to measure.

Effective data processing in a text file

Performance

More articles: