If I have text containing HTTP headers and body, for example:
HTTP/1.1 200 OK Cache-Control: public, max-age=38 Content-Type: text/html; charset=utf-8 Expires: Fri, 22 Nov 2013 06:15:01 GMT Last-Modified: Fri, 22 Nov 2013 06:14:01 GMT Vary: * X-Frame-Options: SAMEORIGIN Date: Fri, 22 Nov 2013 06:14:22 GMT <!DOCTYPE html> <html> <head> <title>My website</title> </head> <body> Hello world! </body> </html>
and this text is passed from the command, how can I remove the headers to leave only the body?
(The headings use \r\n . \r\n\r\n as line breaks to indicate the end of the headings and the beginning of the body.)
Here is what I tried ( ... indicates any command, for example cat or curl ) that will output some HTTP headers and body to stdout):
SED
My first idea was to make a replacement with sed to remove everything until the first appearance of \r\n\r\n :
... | sed 's|^.*?\r\n\r\n||'
But this does not work, mainly because sed only works on separate lines, so it cannot work with \r or \n . (Also, does it not support a non-greedy operator ? )
Grep
I also thought about using grep with a positive lookbehind for \r\n\r\n :
... | grep -oP '(?<=\r\n\r\n).*'
But this does not work either (mainly because grep only works on separate lines).
pcregrep has multi-line mode ( -M ), but pcregrep often not available (it is not installed by default in Ubuntu 12.04, Mac OS X 10.7, etc.), and I need a solution that does not require any non-standard tools.
Perl
Then I thought about making a replacement using perl using the /s modifier to . corresponded to line breaks:
... | perl -pe 's/^.*?\r\n\r\n//s'
I think this is closer to a working solution. However, I think the Perl Input Record Separator ( $/ ) defaults to \n and needs to be changed to \r\n , so . may match \r\n . The -0 option can be used to set $/ to one character, but not to several characters. I tried this, but I do not think this is correct:
... | perl -pe '$/ = "\r\n"; s/^.*?\r\n\r\n//s'
Also, I think ^ matches “start of line”, but should match “start of file”.
Offset and substring
I had the idea of getting the offset \r\n\r\n using:
BodyOffset=$(expr index "$MyHttpText" "\r\n\r\n")
and then extracting the body as a substring using:
HttpBody=${MyHttpText:BodyOffset}
Unfortunately, the Mac OS X expr version does not support index . In addition, if possible, I would like to get a solution that does not require the creation of variables.
Parameter Substitution
Another idea I had was to use parameter swapping, where # means "Remove from $MyHttpText shortest part *\r\n\r\n that matches the front end of $MyHttpText ":
HttpBody=${MyHttpText#*\r\n\r\n}
But I'm not sure how to use this in a sequence of commands, and I would prefer a solution that does not require variables.