In this tutorial, we'll take a look at how to scrape a web page with Node.js and also how to handle dynamically generated content such as that created on-page with JavaScript.
Originally published here on YouTube
Please give me a thumbs up on the YouTube video and subscribe to the channel if you found this useful ๐
Introduction To Web Scraping With Node.js
00:00 Introduction
01:06 Setup
03:26 Part 1 - Retrieving web page contents
05:17 Part 2 - Node HTML Parser
09:16 Part 3 - Dynamic content
11:34 --- Using Pupeteer
15:14 --- Screenshots
15:39 --- PDF
โ Follow Me โ
Twitter: twitter.com/codebubb
Facebook: facebook.com/juniordevelopercentral
Blog: juniordevelopercentral.com
โ Thanks! โ
So in this JavaScript tutorial we're going to be learning how to scrape a web page with Node.js.
We'll start off by first of all creating a simple static HTML page and then using the Axios library to retrieve the contents of that page and storing it in our server-side Node.js script.
Once we've retrieved the HTML data, we're going to look at how you would extract certain information from that page by using the node html parser library (node-html-parser) which will essentially let you use familiar DOM based functions to extract elements from the retrieved page (think querySelector, querySelectorAll etc).
Whilst you will see in the tutorial that this works well for static HTML pages, we hit a limit with this approach when the web page we are scraping has dynamically generated content. In other words, there is JavaScript running which creates or updates HTML elements on the page.
To this end, we'll take a look in the final part of the tutorial at web scraping with node.js and puppeteer. Puppeteer is essentially a headless version of chrome running within your Node.js code which you can load various pages and have them be fully rendered - much in the same way a real browser would render a page with JavaScript on it and then return the updated HTML content back to use within our Node.js script.
You will see how Puppeteer allows you launch new tabs, navigate to a specific page and then use node.js to scrape a page. Finally, we'll see how you can use node.js and puppeteer to take a screenshot and also generate a PDF.