How to Scrape Web Pages with Node.js and jQuery
Tutorial Details
- Program: Node.js, jQuery
- Difficulty: Intermediate
- Estimated Completion Time: 30 minutes
Node.js is growing rapidly; one of the biggest reasons for this is thanks to the developers who create amazing tools that significantly improve productivity with Node. In this article, we will go through the basic installation of Express, a development framework, and creating a basic project with it.
What We’re Going to Build Today
Node is similar in design to, and influenced by, systems like Ruby’s Event Machine or Python’s Twisted. Node takes the event model a bit further – it presents the event loop as a language construct instead of as a library.
In this tutorial, we will scrape the YouTube home page, get all the regular sized thumbnails from the page as well as links and video duration time, send all those elements to a jQueryMobile template, and play the videos using YouTube embed (which does a nice job of detecting device media support (flash/html5-video).
We will also learn how to begin using npm and Express, npm’s module installation process, basic Express routing and the usage of two modules of Node: request and jsdom.
For those of you who aren’t yet familiar with Node.js is and how to install it, please refer to the node.js home page
and the npm GitHub project page.
You should also refer to our “Node.js: Step by Step” series.
Note: This tutorial requires and assumes that you understand what Node.js is and that you already have node.js and npm installed.
Step 1: Setting Up Express
So what exactly is Express? According to its developers, it’s an..
Insanely fast (and small) server-side JavaScript web development framework built on Node and Connect.
Sounds cool, right? Let’s use npm to install express. Open a Terminal window and type the following command:
npm install express -g
By passing -g as a parameter to the install command, we’re telling npm to make a global installation of the module.

I’m using /home/node-server/nettuts for this example, but you can use whatever you feel comfortable with.
After creating our express project, we need to isntruct npm to install express’ dependencies.
cd nodetube npm install -d

If it ends with, “ok,” then you’re good to go. You can now run your project:
node app.js
In your browser, go to http://localhost:3000.

Step 2: Installing Needed Modules
JSDOM
A JavaScript implementation of the W3C DOM.
Go back to your Terminal and, after stopping your current server (ctr + c), install jsdom:
npm install jsdom
Request
Simplified HTTP request method.
Type the following into the Terminal:
npm install request

Everything should be setup now. Now, it’s time to get into some actual code!
Step 3: Creating a Simple Scraper
app.js
First, let’s include all our dependencies. Open your app.js file, and, in the very first lines, append the following code:
/**
* Module dependencies.
*/
var express = require('express')
, jsdom = require('jsdom')
, request = require('request')
, url = require('url')
, app = module.exports = express.createServer();
You will notice that Express has created some code for us. What you see in app.js is the most basic structure for a Node server using Express. In our previous code block, we told Express to include our recently installed modules: jsdom and request. Also, we’re including the URL module, which will help us parse the video URL we will scrape from YouTube later.
Scraping Youtube.com
Within app.js, search for the “Routes” section (around line 40) and add the following code (read through the comments to understand what is going on):
app.get('/nodetube', function(req, res){
//Tell the request that we want to fetch youtube.com, send the results to a callback function
request({uri: 'http://youtube.com'}, function(err, response, body){
var self = this;
self.items = new Array();//I feel like I want to save my results in an array
//Just a basic error check
if(err && response.statusCode !== 200){console.log('Request error.');}
//Send the body param as the HTML code we will parse in jsdom
//also tell jsdom to attach jQuery in the scripts and loaded from jQuery.com
jsdom.env({
html: body,
scripts: ['http://code.jquery.com/jquery-1.6.min.js']
}, function(err, window){
//Use jQuery just as in a regular HTML page
var $ = window.jQuery;
console.log($('title').text());
res.end($('title').text());
});
});
});
In this case, we’re fetching the content from the YouTube home page. Once complete, we’re printing the text contained in the page’s title tag (<title>). Return to the Terminal and run your server again.
node app.js
In your browser, go to: http://localhost:3000/nodetube

You should see, “YouTube – Broadcast Yourself,” which is YouTube’s title.
Now that we have everything set up and running, it is time to get some video URLs. Go to the YouTube homepage and right click on any thumbnail from the “recommended videos” section. If you have Firebug installed, (which is highly recommended) you should see something like the following:

There’s a pattern we can identify and which is present in almost all other regular video links:
div.vide-entry span.clip
Let’s focus on those elements. Go back to your editor, and in app.js, add the following code to the /nodetube route:
app.get('/nodetube', function (req, res) {
//Tell the request that we want to fetch youtube.com, send the results to a callback function
request({
uri: 'http://youtube.com'
}, function (err, response, body) {
var self = this;
self.items = new Array(); //I feel like I want to save my results in an array
//Just a basic error check
if (err && response.statusCode !== 200) {
console.log('Request error.');
}
//Send the body param as the HTML code we will parse in jsdom
//also tell jsdom to attach jQuery in the scripts
jsdom.env({
html: body,
scripts: ['http://code.jquery.com/jquery-1.6.min.js']
}, function (err, window) {
//Use jQuery just as in any regular HTML page
var $ = window.jQuery,
$body = $('body'),
$videos = $body.find('.video-entry');
//I know .video-entry elements contain the regular sized thumbnails
//for each one of the .video-entry elements found
$videos.each(function (i, item) {
//I will use regular jQuery selectors
var $a = $(item).children('a'),
//first anchor element which is children of our .video-entry item
$title = $(item).find('.video-title .video-long-title').text(),
//video title
$time = $a.find('.video-time').text(),
//video duration time
$img = $a.find('span.clip img'); //thumbnail
//and add all that data to my items array
self.items[i] = {
href: $a.attr('href'),
title: $title.trim(),
time: $time,
//there are some things with youtube video thumbnails, those images whose data-thumb attribute
//is defined use the url in the previously mentioned attribute as src for the thumbnail, otheriwse
//it will use the default served src attribute.
thumbnail: $img.attr('data-thumb') ? $img.attr('data-thumb') : $img.attr('src'),
urlObj: url.parse($a.attr('href'), true) //parse our URL and the query string as well
};
});
//let's see what we've got
console.log(self.items);
res.end('Done');
});
});
});
It’s time to restart our server one more time and reload the page in our browser (http://localhost:3000/nodetube). In your Terminal, you should see something like the following:

This looks good, but we need a way to display our results in the browser. For this, I will use the Jade template engine:
Jade is a high performance template engine heavily influenced by Haml, but implemented with JavaScript for Node.
In your editor, open views/layout.jade, which is the basic layout structure used when rendering a page with Express. It is nice but we need to modify it a bit.
views/layout.jade
!!! 5
html(lang='en')
head
meta(charset='utf-8')
meta(name='viewport', content='initial-scale=1, maximum-scale=1')
title= title
link(rel='stylesheet', href='http://code.jquery.com/mobile/1.0b3/jquery.mobile-1.0b3.min.css')
script(src='http://code.jquery.com/jquery-1.6.2.min.js')
script(src='http://code.jquery.com/mobile/1.0b3/jquery.mobile-1.0b3.min.js')
body!= body
If you compare the code above with the default code in layout.jade, you will notice that a few things have changed – doctype, the viewport meta tag, the style and script tags served from jquery.com. Let’s create our list view:
views/list.jade
Before we start, please browse through jQuery Mobile’s (JQM from now on) documentation on page layouts and anatomy.
The basic idea is to use a JQM listview, a thumbnail, title and video duration label for each item inside the listview along with a link to a video page for each one of the listed elements.
Note: Be careful with the indentation you use in your Jade documents, as it only accepts spaces or tabs – but not both in the same document.
div(data-role='page')
header(data-role='header')
h1= title
div(data-role='content')
//just basic check, we will always have items from youtube though
- if(items.length)
//create a listview wrapper
ul(data-role='listview')
//foreach of the collected elements
- items.forEach(function(item){
//create a li
li
//and a link using our passed urlObj Object
a(href='/watch/' + item['urlObj'].query.v, title=item['title'])
//and a thumbnail
img(src=item['thumbnail'], alt='Thumbnail')
//title and time label
h3= item['title']
h5= item['time']
- })
That is all we need to create our listing. Return to app.js and replace the following code:
//let's see what we've got
console.log(self.items);
res.end('Done');
with this:
//We have all we came for, now let's render our view
res.render('list', {
title: 'NodeTube',
items: self.items
});
Restart your server one more time and reload your browser:

Note: Because we’re using jQuery Mobile , I recommend using a Webkit based browser or an iPhone/Android cellphone (simulator) for better results.
Step 4: Viewing Videos
Let’s create a view for our /watch route. Create views/video.jade and add the following code:
div(data-role='page')
header(data-role='header')
h1= title
div(data-role='content')
//Our video div
div#video
//Iframe from youtube which serves the right media object for the device in use
iframe(width="100%", height=215, src="http://www.youtube.com/embed/" + vid, frameborder="0", allowfullscreen)
Again, go back to your Terminal, restart your server, reload your page, and click on any of the listed items. This time a video page will be displayed and you will be able to play the embed video!

Bonus: Using Forever to Run Your Server
There are ways we can keep our server running in the background, but there’s one that I prefer, called Forever, a node module we can easily install using npm:
npm install forever -g
This will globally install Forever. Let’s start our nodeTube application:
forever start app.js

You can also restart your server, use custom log files, pass environment variables among other useful things:
//run your application in production mode NODE_ENV=production forever start app.js
Final Thoughts
I hope I’ve demonstrated how easy it is to begin using Node.js, Express and npm. In addition, you’ve learned how to install Node modules, add routes to Express, fetch remote pages using the Request module, and plenty of other helpful techniques.
If you have any comments or questions, please let me know in the comments section below!

Awesome tutorial Jaime. I got to bookmark this. Thanks!
Great tutorial, very concise and to the point. Thank you.
There’s a typo in your step 3 header, “Creating a Simple Scrapper.”
Very informative and concise.
I think the “/watch” route is missing in this tutorial. Other than that, it looks to be complete.
“If it ends with, “ok,” then you’re good to go. You can now run your project:” … what if it doesnt … I tried to install the dependancies and got this error:
Sirwans-MacBook-Pro:grid sirwan$ npm install -d
npm info it worked if it ends with ok
npm info using npm@1.0.96
npm info using node@v0.5.11-pre
npm ERR! Couldn’t read dependencies.
npm ERR! Error: ENOENT, No such file or directory ‘/Users/sirwan/Desktop/Code/grid/package.json’
npm ERR! Report this *entire* log at:
npm ERR!
npm ERR! or email it to:
npm ERR!
npm ERR!
npm ERR! System Darwin 11.2.0
npm ERR! command “node” “/usr/local/bin/npm” “install” “-d”
npm ERR! cwd /Users/sirwan/Desktop/Code/grid
npm ERR! node -v v0.5.11-pre
npm ERR! npm -v 1.0.96
npm ERR! path /Users/sirwan/Desktop/Code/grid/package.json
npm ERR! code ENOENT
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /Users/sirwan/Desktop/Code/grid/npm-debug.log
npm not ok
Can you post what does /Users/sirwan/Desktop/Code/grid/npm-debug.log contains?
Great tutorial !
I just quickly skimmed through it and it seems pretty legit. I also reccomend google refine to scrape sites.
Thanks for this great tutorial.
There’s only one problem: Using Nodejs, V8 and jsdom results in massive memory leaks.
I do not have terminal at my disposal, so how would i proceed?
You cant!
Sorry Xander, this tutorial requires you to have access to terminal.
the /watch is not working.
This needs to be added to the tutorial for the watch route to function properly:
//Pass the video id to the video view
app.get(‘/watch/:id’, function(req, res){
res.render(‘video’, {
title: ‘Watch’,
vid: req.params.id
});
});
thanks man ! it works now
Thanks for helping others, I will make sure all files are attached to the source files.
Dear God, I hate jQuery fan boys.
WHY U NO UZE JUST SIZZLE BRO.
You can, it does not matter what you use to manipulate DOM.
This is really cool, but how would you scrape a site that requires a login first? Seems like you would need to POST the username and password data first, but I’m having trouble understanding how you do that. Any suggestion you have would be greatly appreciated!
You can use CURL for more complex tasks:
http://www.hacksparrow.com/using-node-js-to-download-files.html
Check that tutorial, it goes through some CURL concepts in Node.js.
It’s really nice to know about forever :)
I’m having a number of problems by step 3, and I believe the tutorial itself already contains errors.
To start with, you’re not explicitly mentioning the ‘express nodetube’ command that needs to be executed in text. It’s only visible in the first screenshot of the tutorial.
Next, your modified block of dependencies seems to leave out a crucial ‘routes = require(‘./routes’)’ var definition, without which the app would fail.
Even after I fixed both of those things, I get an error hitting http://localhost:3000/nodetube (the root / works just fine):
/Users/jschuur/Code/Node/nodetube/node_modules/jsdom/lib/jsdom/browser/index.js:267
Contextify(window);
^
TypeError: undefined is not a function
Full output at http://pastie.org/2785729, and as you can see there, I already encountered warnings installing some of the libraries…
npm WARN jsdom@0.2.8 package.json: bugs['web'] should probably be bugs['url']
npm WARN request@2.1.1 package.json: bugs['web'] should probably be bugs['url']
npm WARN htmlparser@1.7.3 package.json: bugs['web'] should probably be bugs['url']
…and later launching the app:
jschuur@Paige:nodetube node app.js
The “sys” module is now called “util”. It should have a similar interface.
Express server listening on port 3000 in development mode
I’m running node v0.5.11-pre and npm 1.0.103.
Let me check that, I used Node v0.4.10 so probably there’s something different from that version to v0.5.*, did you installed npm as root?
You’re right about the Express command I’m calling at the beginning of the tutorial, I will check with the editors to fix that.
Thanks for reading, I will post back with some updates.
Thanls for couble checking. npm is installed into /usr/local/bin under my normal login and not root.
I sense memory leaks.
Ooh very nice. I have been using CURL but this may make me cheat on php
Why is this approach subject to memory leaks? And if so, is there another alternative that doesn’t?
Am getting an error like this :
TypeError: Object # has no method ‘end’
Any help ?
it says : urlObj is not defined
any help ?
Nice tutorial, I had to make a couple tweaks to the walkthrough code but so far so good, not sure if I’m underestimating the sample, but I would like to see a more “real life” example, finally, can someone explain to me what’s the Memory Leak issue we have with this approach? Thanks!
wow that was GREAT! more node tuts please! :) when next one? hope not in 2 months!
This is a Professional Tutorial. Thanks!
You can also use jquery on the server to work with the browser DOM in real-time (rather than for scraping) Check out nQuery https://github.com/tblobaum/nodeQuery
net tuts should write a tutorial on that next
Start node app.js, but end up with following error when try to access notetube url
————-error message————
Segmentation fault
Any idea?
An alternative to JSDom (+ jQuery) is the cheerio library, which is significantly faster than the method described here.
The API is the same as jQuery, so it’s a one line change. I’ve posted a follow-up video to this tutorial here: http://vimeo.com/31950192.
The github repo is here: https://github.com/MatthewMueller/cheerio.
What is the best strategy to scrap more than one page simultaneously?
I followed your example and it worked very well. Thank you very much. :)
But in my case when a user submit the form, I need to scrap more than one page.
My code seems like that
app.get(‘/lookup’, function (req, res) {
var pagesToScrap = [];
var callbackCounter = 0;
var items = [];
var callback = function(){
if(pagesToScrap == callbackCounter){
res.render(‘list’, {
title: “Hello World”,
items: items
});
}
callbackCounter++;
}
var pageAResolver = function() {
request.get({
uri: ‘http://a.com’,
//…
items.push[jsonData];
callback();
);
}
var pageBResolver = function() {
request.get({
uri: ‘http://b.com’,
//…
items.push[jsonData];
callback();
);
}
var pageCResolver = function() {
request.get({
uri: ‘http://c.com’,
//…
items.push[jsonData];
callback();
);
}
pagesToScrap[0] = {url: “http://a.com”, resolver: pageAResolver}
pagesToScrap[1] = {url: “http://b.com”, resolver: pageBResolver}
pagesToScrap[2] = {url: “http://c.com”, resolver: pageCResolver}
for(var i = 0; i < pagesToScrap.length; i++){
pagesToScrap[i].resolver();
}
});
When all requests return I send the response to the browser. Sometimes it can take lot of time. What is the best strategy without caching to show this data faster?
I’m thinking about socket.io, maybe I can emit the data simultaneously? Guys, what do you think about it?
Cheers,
Pablo Cantero
Just a heads up… if anyone is encountering a Bad Argument error trying to get the dependencies for express to install, this is a known issue with the latest version of npm (1.1.0-alpha-2). If you revert npm to version(1.0.106) this tutorial works.
Here are instructions from the Google Groups page to revert
*******************
npm uninstall -g npm
cd /usr/local
git clone git://github.com/isaacs/npm.git
cd npm
sudo git checkout v1.0.106
make install
*******************
This was the only way I could get mine to work.
Also the solution to seeing this bug is as follows:
npm WARN jsdom@0.2.8 package.json: bugs['web'] should probably be bugs['url']
npm WARN request@2.1.1 package.json: bugs['web'] should probably be bugs['url']
npm WARN htmlparser@1.7.3 package.json: bugs['web'] should probably be bugs['url']
So I had to run:
npm request -g
This gets the request module which I wasn’t sure was there or not. You can check if its there by running:
npm view request version
I then had to update my ~/.profile with this line (this was something I messed up in installing node)
export NODE_PATH=/usr/local/lib/node:/usr/local/lib/node_modules
Then refresh the profile with:
. ~/.profile
After that I could run node app.js and hit localhost:3000/nodetube with success. Hope that helps some other noobs out there like myself.
This was the only way I could get mine to work.wow that was GREAT! more node tuts please! :) when next one? hope not in 2 months!
am thinking that video-entry class has been dropped from youtube due to layout update. Could that be true?
Yeah, the HTML on Youtube has been updated so you’ll need to change the CSS selectors for the tutorial to work. Something like this:
var $a = $(‘.feed-item-thumb a’, item)[0], // first anchor element child of item
$title = $(‘.feed-item-content h4′, item).text(), // video title
$time = $(‘.feed-item-time’, item).text(), // video duration time
$img = $(‘.feed-item-container .video-thumb .clip .clip-inner img’, item)[0]; // thumbnail
I received a bunch of error messages when trying to install JSDOM, Request, and trying to use forever…
I see that this article isn’t old (its from last October) I use Windows 7, 64 bit OS. I know I’ve successfully installed Node in the past.
I havent gone far In the tut but Im wondering how long does it take to install JSDOM or does it depend on the internet connection speeds? My terminal output got stuck at
info: it worked if it ends with ok
info: downloading: http://nodejs.org/dist/v0.6.3/node-v0.6.3.tar.gz
That is the last terminal output. Someone help Thanks
UPDATE: SOLVED PREVIOUS ISSUE
It turns out that my internet speeds had dropped make the process slow now continuing with the Tut
$ npm install -d
npm ERR! Error: ENOENT, open ‘/var/www/myNode/WebScraper/package.json’
—-
Should that be in the package download or are there steps missing above which create it??
Oh dang – I see it now. Might want to put the line “express nodetube” in the “text” part of the page instead of just the screenshot.
Nevermind – you really should take this buggy thing down or fix it – it is impossible to follow and/or full of bugs – I can’t tell which, but I’ve wasted 2 hours now …
node.js:201
throw e; // process.nextTick error, or ‘error’ event on first tick
^
Error: listen EADDRINUSE
at errnoException (net.js:646:11)
at Array.0 (net.js:747:26)
at EventEmitter._tickCallback (node.js:192:40)