Just to listen, I ran into this problem (I have a very heavy ajax / js site) and I found this that might be of interest:
crawlme
I still have to try, but it looks like it will make the whole process a piece of cake if it is advertised! it is a piece of middleware / middleware that is simply pasted before any page calls and seems to take care of everyone else.
Edit:
Having tried crawlme, I had some success, but the firewall without the browser it uses (zombie.js) failed with some of my javascript content, probably because it works by emulting the DOM and therefore will not be ideal.
Sooo, instead, I took possession of a full browser without a browser, phantomjs and a set of node links for it, for example:
npm install phantomjs node-phantom
Then I created my own script, similar to crawlme, but using phantomjs instead of zombie.js. This approach seems to work just fine and render each of my ajax-based pages perfectly. The script I wrote to remove this can be found here . using it is simple:
var googlebot = require("./path-to-file");
and then before any other calls to your application (this is using express, but should also work with the connection:
app.use(googlebot());
the source is realistically simple, minus a few regular expressions, so you have a gander :)
Result: AJAX heavy website node.js / connect / express can be crawled by googlebot.
source share