Merge two columns of a text file on Linux

I have a text file with several columns of text and values. This structure:

CAR 38 DOG 42 CAT 89 CAR 23 APE 18 

If column 1 has a String, column 2 is not (or is it actually a String). And vice versa: if column 1 is empty, column 2 has a row. In other words, an β€œobject” (CAR, CAT, DOG, etc.) occurs either in column 1 or in column 2, but not in both cases.

I'm looking for an efficient way to consolidate columns 1 and 2 so that the file looks like this instead:

 CAR 38 DOG 42 CAT 89 CAR 23 APE 18 

I can do this in a Bash script using while and if, but I'm sure there is an easier way to do this. Can anyone help?

Hooray! Z

+6
source share
2 answers

Try the following:

 column -t file 

Output:

  CAR 38
 DOG 42
 CAT 89
 CAR 23
 APE 18
+17
source

Note. If:

  • You are looking for output with auto-sized columns with a fixed width on the left edge (the longest field value determines the width, with shorter values ​​obtained to the right of spaces)
  • and are happy with the two spaces as a column separator
  • and use files small enough to be read into memory in general,

use Cyrus easier; column response. .

See below how the column based approach is compared to the awk based approach below in terms of performance and resource consumption.


awk is your friend here:

 awk -v OFS=' ' '{ print $1, $2 }' file 
  • awk separates lines by field by default, so with your input, lines like CAR 38 and DOG 42 are parsed the same way ( CAR and DOG become fields 1, $1 , and 38 and 42 become fields 2, $2 ).
  • -v OFS=' ' sets the separator of the output field to two spaces (by default - one space); note that to produce aligned output there will be no filling of the output values.

To create a aligned output with fields of varying widths, use the Awk printf function, which gives you more control over the output; for example, the following outputs: 1st column 10-char along the entire left edge and 2nd char - the general right-aligned second column:

 awk '{ printf "%-10s %2s\n", $1, $2 }' file 
  • Note that column widths must be known in advance.
  • In contrast, column -t conveniently automatically determines the width of the column by first analyzing all the data, but has consequences for performance and resource consumption; see below.

Performance / resource comparison between column -t and Awk:

  • column -t it is necessary to analyze all the input data in front, in the first pass, in order to be able to determine the maximum width of the input columns; from what I can say, he does this by first reading the input as a whole into memory, which can be problematic with large input files.
  • In contrast, the Awk solution reads rows one by one, but relies on knowing the width of the columns ahead of time.

In this way,

  • column -t will consume memory in proportion to the size of the input, while awk will use a constant amount of memory.
  • column -t usually slower, depending on the Awk implementation used; mawk much faster, gawk slightly faster, BSD awk slower (!); results based on 10 million line input file; commands are executed on OSX 10.10.2 and Ubuntu 14.04.
+8
source

Source: https://habr.com/ru/post/984916/


All Articles