Replace all spaces with a line break / paragraph break to make a list of words

I am trying to use the vocals list for the Greek text that we translate in the classroom. I want to replace each space or tab character with a paragraph mark so that each word appears on a separate line. Can someone give me a sed command and explain what I'm doing? I'm still trying to figure out how to survive.

+48
regex sed
Dec 05 '09 at 18:31
source share
8 answers

For reasonably modern versions of sed, edit standard input to get standard output with

$ echo 'τέχνη βιβλίο γη κήπος' | sed -E -e 's/[[:blank:]]+/\n/g' τέχνη βιβλίο γη κήπος 

If your vocabulary words are in files named lesson1 and lesson2 , redirect seds standard output to the all-vocab file with

 sed -E -e 's/[[:blank:]]+/\n/g' lesson1 lesson2 > all-vocab 

What does it mean:

  • The character class [[:blank:]] matches either a single space or a single tab character.
    • Use [[:space:]] instead of matching any single space character (usually a space, tab, new line, carriage return, form feed and tab with a vertical tab).
    • A coefficient of + means matching one or more previous patterns.
    • So [[:blank:]]+ is a sequence of one or more characters that is a space or tab.
  • \n in the replacement is the new line you want.
  • The /g modifier at the end means replacing as often as possible, and not just once.
  • The -E option tells sed to use the POSIX extended regular expression syntax and, in particular, the + quantifier for this case. Without -E your sed command will become sed -e 's/[[:blank:]]\+/\n/g' . (Note the use of \+ , not the simple + .)

Perl Compatible Regexes

For those familiar with Perge-compatible regular expressions and sed with PCRE support, use \s+ to match runs of at least one space character, as in

 sed -E -e 's/\s+/\n/g' old > new 

or

 sed -e 's/\s\+/\n/g' old > new 

These commands read the input from the old file and write the result to a file named new in the current directory.

Maximum portability, maximum toughness

Returning to almost any version of sed with Version 7 of Unix , calling the command is a bit more baroque.

 $ echo 'τέχνη βιβλίο γη κήπος' | sed -e 's/[ \t][ \t]*/\ /g' τέχνη βιβλίο γη κήπος 

Notes:

  • Here we do not even assume the existence of a modest quantifier + and simulate it with a single space-or-tab ( [ \t] ), followed by zero or more ( [ \t]* ).
  • Similarly, if sed does not understand \n for a new line, we must include it in the command line verbatim.
    • \ and the end of the first line of the command is a continuation marker that comes out of the next line of the new line, and the rest of the command is on the next line.
      • Note. There should be no spaces preceding an escaped newline. That is, the end of the first line must be exactly the backslash, followed by the end of the line.
    • This error-prone process helps to understand why the world moves to visible characters, and you will want to be careful when trying to execute a copy and paste command.

Backslash and citation

Commands above all used single quotes ( '' ), not double quotes ( "" ). Consider:

 $ echo '\\\\' "\\\\" \\\\ \\ 

That is, the shell applies different escaping rules to single-frame strings compared to double-quoted strings. You usually want to protect all backslashes that are common in single-quoted regular expressions.

+72
Dec 05 '09 at 18:40
source share

Portable way to do this:

 sed -e 's/[ \t][ \t]*/\ /g' 

This is the actual newline between the backslash and the slash. Many sed implementations are unaware of \n , so you need a literal new line. The backslash before the newline prevents sed from being upset about the newline. (in sed scripts, commands usually end with newlines)

With GNU sed, you can use \n in substitution and \ s in regular expression:

 sed -e 's/\s\s*/\n/g' 

GNU sed also supports "extended" regular expressions (egrep style, not perl-style) if you give it the -r flag, so you can use + :

 sed -r -e 's/\s+/\n/g' 

If this is only for Linux, you can probably go with the GNU command, but if you want it to work on non-GNU sed systems (for example: BSD, Mac OS-X), you might need a more portable option .

+50
Dec 05 '09 at 19:13
source share

All the above examples for sed break on one platform or another. None of them work with the sed version sent to Mac.

However, the Perl regular expression works on any computer with Perl installed:

 perl -pe 's/\s+/\n/g' file.txt 

If you want to save the output:

 perl -pe 's/\s+/\n/g' file.txt > newfile.txt 

If you need only unique occurrences of words:

 perl -pe 's/\s+/\n/g' file.txt | sort -u > newfile.txt 
+7
Dec 18 '14 at 19:02
source share

This should do the job:

 sed -e 's/[ \t]+/\n/g' 

[ \t] means space OR tab. If you want any space, you can also use \s .

[ \t]+ means as many spaces as OR tabs as possible (but at least one)

s/x/y/ means replacing the pattern x with y (here \n is a new line)

g at the end means you have to repeat as many times as it appears on each line.

+4
Dec 05 '09 at 18:42
source share
  • option 1

     echo $(cat testfile) 
  • Option 2

     tr ' ' '\n' < testfile 
+4
Jan 31 '12 at 6:59
source share

You can use POSIX [[:blank:]] to match the horizontal space character.

 sed 's/[[:blank:]]\+/\n/g' file 

or you can use [[:space:]] instead of [[:blank:]] .

Example:

 $ echo 'this is a sentence' | sed 's/[[:blank:]]\+/\n/g' this is a sentence 
+3
Mar 27 '15 at 14:13
source share

Using gawk :

 gawk '{$1=$1}1' OFS="\n" file 
0
Dec 06 '09 at 5:03
source share

You can also do this with xargs :

 cat old | xargs -n1 > new 

or

 xargs -n1 < old > new 
0
Apr 30 '17 at 13:36 on
source share



All Articles