I have a file containing identifiers (one per line) that I want to execute through undefined analysis. The analysis will be, if everything works fine, print another list with the same identifiers (not necessarily in the same order).
However, it turns out that the analysis was not performed for some input identifiers, and they are not present in the output file. This was due to the fact that the Internet connection disappeared, and therefore the program skipped some of the identifiers and continued to successfully move further on the list when the Internet returned. This means that identifiers that were skipped alternate in the input file.
Thus, the input file will not contain lines that are not in the output file, and I would like to extract them and restart my analysis. This sounds like a simple grep -vf command, but here it gets a little confused.
I know that my input file has 100,000 lines. I also know that my output file has 9000 lines. The difference should reasonably be 100000-90000 = 10000 identifiers. But when I run
fgrep -vf output input | wc-l
I get 9990 instead of 10000. So 10 identifiers are missing somewhere. I checked if there was a problem due to Windows EOF or tabs (because of this there were other kinds of unexpected behavior from grep), but this is not a problem. My headings contain uppercase and lowercase letters, numbers, and underscores, but not other e characters. g.
Si_d14LSK_TRRt_Pmkk_comp10_c0_seq2
The output list of identifiers was generated using
ls -lh dir | sed "1d" | sed "/.* //" | sed "s/.xml//" > output
.
- - (, EOF), grep? fgrep -vf?
Ubuntu 12.04.4 LTS GNU grep 2.10.