Extract HTML using Puppeteer

Question

Extract HTML using Puppeteer

I am trying to cross out HTML from this NCBI.gov page . I need to include the # see-all URL fragment so that I am guaranteed to get the search page instead of retrieving the HTML from the wrong gene page https://www.ncbi.nlm.nih.gov/gene/119016 .

URL fragments are not transmitted to the server, but instead client side javascript is used to (in this case) create completely different HTML that you get when you go to the page in the browser and "View" "Page Source", i.e. HTML The code I want to get. R readLines () ignores URL tags followed by #

At first I tried using phantomJS, but it just returned the error described here. ReferenceError: I can not find the variable: Map , and it seems that this is due to the fact that phantomJS does not support some functions that NCBI uses, thus eliminating this path to the solution.

I did better with Puppeteer using the following Javascript evaluated with node.js:

const puppeteer = require('puppeteer');
(async() => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(
    'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');
  var HTML = await page.content()
  const fs = require('fs');
  var ws = fs.createWriteStream(
    'TempInterfaceWithChrome.js'
  );
  ws.write(HTML);
  ws.end();
  var ws2 = fs.createWriteStream(
    'finishedFlag'
  );
  ws2.end();
  browser.close();
})();

however, this returned what seemed to be pre-rendered html. how do I (programmatically) get the final HTML that I get in the browser?

+5

javascript node.js web-scraping google-chrome-headless puppeteer

Sir_Zorg Aug 24 '17 at 21:29

source share

4 answers

Carol-Theodor Pelu · Answer 1 · 2017-08-29T14:37:46+0000

You can try changing this:

await page.goto(
  'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');

in it:

  await page.goto(
    'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all', {waitUntil: 'networkidle'});

listenFor() :

function listenFor(type) {
  return page.evaluateOnNewDocument(type => {
    document.addEventListener(type, e => {
      window.onCustomEvent({type, detail: e.detail});
    });
  }, type);
}`

await listenFor('custom-event-ready'); // Listen for "custom-event-ready" custom event on page load.

LE:

:

await page.waitForSelector('h3'); // replace h3 with your selector

Evgeniy Grabelsky · Answer 2 · 2017-08-26T04:39:46+0000

,

await page.waitForNavigation(5);

let html = await page.content();

Darren Hall · Answer 3 · 2018-06-18T14:29:08+0000

, HTML-, .

const browser = await puppeteer.launch();
try {
  const page = await browser.newPage();
  await page.goto(url);
  await page.waitFor(2000);
  let html_content = await page.evaluate(el => el.innerHTML, await page.$('.element-class-name'));
  console.log(html_content);
} catch (err) {
  console.log(err);
}

.

mflodin · Answer 4 · 2019-05-16T16:34:19+0000

, .

const page = await browser.newPage();

/**
  * Attach an event listener to page to capture a custom event on page load/navigation.
  * @param {string} type Event name.
  * @return {!Promise}
  */
function addListener(type) {
  return page.evaluateOnNewDocument(type => {
    // here we are in the browser context
    document.addEventListener(type, e => {
      window.onCustomEvent({ type, detail: e.detail });
    });
  }, type);
}

const evt = await new Promise(async resolve => {
  // Define a window.onCustomEvent function on the page.
  await page.exposeFunction('onCustomEvent', e => {
    // here we are in the node context
    resolve(e); // resolve the outer Promise here so we can await it outside
  });

  await addListener('app-ready'); // setup listener for "app-ready" custom event on page load
  await page.goto('http://example.com');  // N.B! Do not use { waitUntil: 'networkidle0' } as that may cause a race condition
});

console.log('${evt.type} fired', evt.detail || '');

https://github.com/GoogleChrome/puppeteer/blob/master/examples/custom-event.js.

Extract HTML using Puppeteer

More articles: