Why sed does not replace overlay templates

I have a database upload file with a field separated by a <TAB> symbol. I run this file through sed to replace any occurrences of <TAB> <TAB> with <TAB> \ N <TAB>. This is so that when a file is uploaded to MySQL, \ N is interpreted as NULL.

Command sed / \ t \ t / \ t \ N \ t / g; 'almost works, except that it replaces only the first instance, for example. "... <TAB> <TAB> <TAB> ..." becomes "... <TAB> \ N <TAB> <TAB> ...".

If I use 's / \ t \ t / \ t \ N \ t / g; s / \ t \ t / \ t \ N \ t / g; ' it replaces more instances.

I have an opinion that, despite the / g modifier, this is due to the fact that the end of one match is the beginning of another.

Can someone explain what is happening and suggest a sed command that will work, or do I need to execute a loop.

I know that maybe I can switch to awk, perl, python, but I want to know what happens in sed.

+6
source share
5 answers

I know that you want sed, but sed does not like it at all, it seems that it specifically (see here ) will not do what you want. However, perl will do this (AFAIK):

perl -pe 'while (s#\t\t#\t\n\t#) {}' <filename> 
+2
source

As a workaround, replace each tab with the + \ N tab; then delete all occurrences \ N that are not immediately followed by the tab.

 sed -e 's/\t/\t\\N/g' -e 's/\\N\([^\t]\)/\1/g' 

... if your sed uses a backslash before grouping parentheses (there are saddle dialects that don't want a backslash, try without them if that doesn't work for you.)

+2
source

That's right, even with /g sed will not match the text that he replaced again. So it reads <TAB><TAB> and outputs <TAB>\N<TAB> , and then reads the next from the input stream. See http://www.grymoire.com/Unix/Sed.html#uh-7

In a regex language that supports lookaheads, you can get around this with lookahead.

+1
source

Well, sed just works as designed. The input line is scanned once, not several times. Maybe this helps to look at the consequences if sed used re-scanning the input line to handle overlapping default patterns: in this case even simple replacements will work in a completely different way - some may say that it is intuitively intuitive, for example

  • s/^/ / Inserting a space at the beginning of a line will never end
  • s/$/foo/ adding foo to each line is also
  • s/[AZ][AZ]*/CENSORED/ replace upper case words with CENSORED - similarly

There are probably many other situations. Of course, all this could be fixed, say, with a substitution modifier, but at the time sed was designed, the current behavior was chosen.

+1
source

No different from perl solution, it works for me using pure sed

 sed ':repeat; /\t\t/{ s|\t\t|\t\n\t|g; b repeat }' 

Description

  • :repeat is the label used for branching commands, similar to batch
  • /\t\t/ means matching the tabs of template 2. If the template matches it, the command following the second / is executed.
  • {} - In this case, the command following the match command is a group. Thus, all the commands in the group are executed if the match pattern is completed.
  • s|\t\t|\t\n\t|g; - Standard replaces 2 tabs with tab-new line. I still use global, because if you say 15 tabs, you only need to loop twice, not 14 times.
  • b repeat means always goto (branch) label repeat

And so it goes. Continue to repeat (goto repeat ) if there is a match with a two-tab template.

While it can be argued that you could just make two identical global substitutions and call them good, the same method can work in more complex scenarios.

As @ thorn-blake notes, sed just doesn't support advanced features like lookahead, so you need to do such a loop.

Short version

Which can be reduced to

 sed ':r;/\t\t/{s|\t\t|\t\n\t|g; br}' 

MacOS

And the Mac version (still compatible with Linux / Windows):

 sed $':r\n/\t\t/{ s|\t\t|\t\\\n\t|g; br\n}' 
  • In BSD, sed must be literal
  • Newlines must be both literal and escaped at the same time, so a single slash (which \ before it is processed by the value $, making it the only literal slash) plus \ n, which becomes the actual new line
  • Label (: r) and branch (br) names must end with a newline. semicolons and spaces are used by the tag / branch name command in BSD, which makes it very confusing.
+1
source

Source: https://habr.com/ru/post/897341/


All Articles