I am trying to cross out HTML from this NCBI.gov page . I need to include the # see-all URL fragment so that I am guaranteed to get the search page instead of retrieving the HTML from the wrong gene page https://www.ncbi.nlm.nih.gov/gene/119016 .
URL fragments are not transmitted to the server, but instead client side javascript is used to (in this case) create completely different HTML that you get when you go to the page in the browser and "View" "Page Source", i.e. HTML The code I want to get. R readLines () ignores URL tags followed by #
At first I tried using phantomJS, but it just returned the error described here. ReferenceError: I can not find the variable: Map , and it seems that this is due to the fact that phantomJS does not support some functions that NCBI uses, thus eliminating this path to the solution.
I did better with Puppeteer using the following Javascript evaluated with node.js:
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(
'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');
var HTML = await page.content()
const fs = require('fs');
var ws = fs.createWriteStream(
'TempInterfaceWithChrome.js'
);
ws.write(HTML);
ws.end();
var ws2 = fs.createWriteStream(
'finishedFlag'
);
ws2.end();
browser.close();
})();
however, this returned what seemed to be pre-rendered html. how do I (programmatically) get the final HTML that I get in the browser?
source
share