Assuming your input fits into memory, scroll through it. Keep track of the identifier for each line in the dictionary. If it does not fit into memory, use sorting and then group it to fulfill the same idea.
In python:
import sys
import sys next_id = 0 str_to_id = {} for line in sys.stdin: fields = line.strip().split(',') this_id = str_to_id.get(fields[0]) if this_id is None: next_id += 1 this_id = next_id str_to_id[fields[0]] = this_id fields[0] = str(this_id) print ','.join(fields)
source share