Introduction To Web Scraping With Node.js

Introduction To Web Scraping With Node.js

ยท

4 min read

In this tutorial, we'll take a look at how to scrape a web page with Node.js and also how to handle dynamically generated content such as that created on-page with JavaScript.

Originally published here on YouTube

Please give me a thumbs up on the YouTube video and subscribe to the channel if you found this useful ๐Ÿ™

Introduction To Web Scraping With Node.js

00:00 Introduction

01:06 Setup

03:26 Part 1 - Retrieving web page contents

05:17 Part 2 - Node HTML Parser

09:16 Part 3 - Dynamic content

11:34 --- Using Pupeteer

15:14 --- Screenshots

15:39 --- PDF

โ€” Follow Me โ€”

Twitter: twitter.com/codebubb

Facebook: facebook.com/juniordevelopercentral

Blog: juniordevelopercentral.com

โ€” Thanks! โ€”

So in this JavaScript tutorial we're going to be learning how to scrape a web page with Node.js.

We'll start off by first of all creating a simple static HTML page and then using the Axios library to retrieve the contents of that page and storing it in our server-side Node.js script.

Once we've retrieved the HTML data, we're going to look at how you would extract certain information from that page by using the node html parser library (node-html-parser) which will essentially let you use familiar DOM based functions to extract elements from the retrieved page (think querySelector, querySelectorAll etc).

Whilst you will see in the tutorial that this works well for static HTML pages, we hit a limit with this approach when the web page we are scraping has dynamically generated content. In other words, there is JavaScript running which creates or updates HTML elements on the page.

To this end, we'll take a look in the final part of the tutorial at web scraping with node.js and puppeteer. Puppeteer is essentially a headless version of chrome running within your Node.js code which you can load various pages and have them be fully rendered - much in the same way a real browser would render a page with JavaScript on it and then return the updated HTML content back to use within our Node.js script.

You will see how Puppeteer allows you launch new tabs, navigate to a specific page and then use node.js to scrape a page. Finally, we'll see how you can use node.js and puppeteer to take a screenshot and also generate a PDF.