Bash: remove headers from HTTP response

If I have text containing HTTP headers and body, for example:

HTTP/1.1 200 OK Cache-Control: public, max-age=38 Content-Type: text/html; charset=utf-8 Expires: Fri, 22 Nov 2013 06:15:01 GMT Last-Modified: Fri, 22 Nov 2013 06:14:01 GMT Vary: * X-Frame-Options: SAMEORIGIN Date: Fri, 22 Nov 2013 06:14:22 GMT <!DOCTYPE html> <html> <head> <title>My website</title> </head> <body> Hello world! </body> </html> 

and this text is passed from the command, how can I remove the headers to leave only the body?

(The headings use \r\n . \r\n\r\n as line breaks to indicate the end of the headings and the beginning of the body.)

Here is what I tried ( ... indicates any command, for example cat or curl ) that will output some HTTP headers and body to stdout):

SED

My first idea was to make a replacement with sed to remove everything until the first appearance of \r\n\r\n :

 ... | sed 's|^.*?\r\n\r\n||' 

But this does not work, mainly because sed only works on separate lines, so it cannot work with \r or \n . (Also, does it not support a non-greedy operator ? )

Grep

I also thought about using grep with a positive lookbehind for \r\n\r\n :

 ... | grep -oP '(?<=\r\n\r\n).*' 

But this does not work either (mainly because grep only works on separate lines).

pcregrep has multi-line mode ( -M ), but pcregrep often not available (it is not installed by default in Ubuntu 12.04, Mac OS X 10.7, etc.), and I need a solution that does not require any non-standard tools.

Perl

Then I thought about making a replacement using perl using the /s modifier to . corresponded to line breaks:

 ... | perl -pe 's/^.*?\r\n\r\n//s' 

I think this is closer to a working solution. However, I think the Perl Input Record Separator ( $/ ) defaults to \n and needs to be changed to \r\n , so . may match \r\n . The -0 option can be used to set $/ to one character, but not to several characters. I tried this, but I do not think this is correct:

 ... | perl -pe '$/ = "\r\n"; s/^.*?\r\n\r\n//s' 

Also, I think ^ matches “start of line”, but should match “start of file”.

Offset and substring

I had the idea of ​​getting the offset \r\n\r\n using:

 BodyOffset=$(expr index "$MyHttpText" "\r\n\r\n") 

and then extracting the body as a substring using:

 HttpBody=${MyHttpText:BodyOffset} 

Unfortunately, the Mac OS X expr version does not support index . In addition, if possible, I would like to get a solution that does not require the creation of variables.

Parameter Substitution

Another idea I had was to use parameter swapping, where # means "Remove from $MyHttpText shortest part *\r\n\r\n that matches the front end of $MyHttpText ":

 HttpBody=${MyHttpText#*\r\n\r\n} 

But I'm not sure how to use this in a sequence of commands, and I would prefer a solution that does not require variables.

+6
source share
5 answers

can do this:

 sed '1,/^$/d' data.txt 

This command deletes everything from line 1 and ends at the first occurrence of an empty string ( ^$ ). This works if you have \n as a newline. If you have \r\n as a newline, you can use dos2unix and unix2dos to convert them back and forth, or you can add the \r character to regex:

 sed '1,/^\r$/d' data.txt 

However, the last line will only work if you have \r\n as a newline character, to make it work with both types of newline lines, you can use:

 sed '1,/^\r\{0,1\}$/d' data.txt 

Here we are looking for an empty string with 0 or 1 characters \r .

+7
source

Your one-line Perl command (cannot) deletes the headers, because at that time it reads only one line of input. You need to disable the input delimiter to read all the input as one line.

 perl -0777 ... 
+2
source

Also interesting to do in bash (internal commands only):

 #!/bin/bash while read LINE #<-- while you can read line from input do #<-- do the following actions if [ $FLAG ] #<-- if: this flag is set then echo "$LINE" #<-- echo the input to output elif [ ${LINE:0:1} = $'\r' ] #<-- else: if line starts with \r then FLAG=true #<-- then raise the flag fi done 
+1
source
 ... | perl -ne 'print if $after_header; $after_header = 1 if /^\r$/' 
0
source

curl does not return the default headers from bash unless you specify the -I (capital i) or -D (dump headers) option. So make a cure, none of them are listed in your curl call!

0
source

Source: https://habr.com/ru/post/958733/


All Articles