Problems parsing UTF8 characters in the request body?

Question

Problems parsing UTF8 characters in the request body?

When implementing HTTP services in node.js, there are many code examples, as shown below, to get the entire request object (data loaded by the client, for example POST with JSON data):

var http = require('http'); var server = http.createServer(function(req, res) { var data = ''; req.setEncoding('utf8'); req.on('data', function(chunk) { data += chunk; }); req.on('end', function() { // parse data }); });

Using req.setEncoding('utf8') automatically decodes the input bytes into a string, assuming the input is encoded in UTF8. But I feel it might break. What if we get a piece of data that ends in the middle of a multibyte UTF8 character? We can imitate this:

 > new Buffer("café") <Buffer 63 61 66 c3 a9> > new Buffer("café").slice(0,4) <Buffer 63 61 66 c3> > new Buffer("café").slice(0,4).toString('utf8') 'caf?'

Thus, we get the character erroneous instead of expecting that subsequent bytes will decode the last character correctly.

Therefore, if the request object does not care about this, make sure that only fully decoded characters are inserted into pieces, this universal code sample is broken.

An alternative would be to use buffers that handle the problem of buffer size limits:

 var http = require('http'); var MAX_REQUEST_BODY_SIZE = 16 * 1024 * 1024; var server = http.createServer(function(req, res) { // A better way to do this could be to start with a small buffer // and grow it geometrically until the limit is reached. var requestBody = new Buffer(MAX_REQUEST_BODY_SIZE); var requestBodyLength = 0; req.on('data', function(chunk) { if(requestBodyLength + chunk.length >= MAX_REQUEST_BODY_SIZE) { res.statusCode = 413; // Request Entity Too Large return; } chunk.copy(requestBody, requestBodyLength, 0, chunk.length); requestBodyLength += chunk.length; }); req.on('end', function() { if(res.statusCode == 413) { // handle 413 error return; } requestBody = requestBody.toString('utf8', 0, requestBodyLength); // process requestBody as string }); });

Am I right or will this class of http requests already take care?

+6

node.js

Nicolas lehuen Jan 28 '12 at 14:42

source share

3 answers

Just add response.setEncoding ('utf8'); for the callback function request.on ('response'). In my case, that was enough.

+1

seukim Feb 04 '14 at 7:33

source share

 // Post : 'tèéïst3 ùél' // Node return : 't%C3%A8%C3%A9%C3%AFst3+%C3%B9%C3%A9l' decodeURI('t%C3%A8%C3%A9%C3%AFst3+%C3%B9%C3%A9l'); // Return 'tèéïst3+ùél'

0

Liberateur Sep 23 '16 at 14:16

source share

loganfsmyth · Accepted Answer · 2012-01-28T14:54:24+0000

This is taken care of automatically. The node module has a string_decoder module that loads when setEncoding is called. The decoder will check the last bytes received and save them between the outputs of the "data" if they are not complete characters, so the data will always have the correct string. If you do not do setEncoding and do not use string_decoder yourself, then you can fix the released buffer, but you can indicate the problem.

The docs don't help much, http://nodejs.org/docs/latest/api/string_decoder.html , but you can see the module here https://github.com/joyent/node/blob/master/lib/string_decoder.js

The implementation of "setEncoding" and the logic for emitting also makes it clearer.

setEncoding: https://github.com/joyent/node/blob/master/lib/http.js#L270
_emitData https://github.com/joyent/node/blob/master/lib/http.js#L306

Problems parsing UTF8 characters in the request body?

More articles: