How to Scrape Web Pages with Node.js and jQuery

How to Scrape Web Pages with Node.js and jQuery

Tutorial Details
  • Program: Node.js, jQuery
  • Difficulty: Intermediate
  • Estimated Completion Time: 30 minutes

Node.js is growing rapidly; one of the biggest reasons for this is thanks to the developers who create amazing tools that significantly improve productivity with Node. In this article, we will go through the basic installation of Express, a development framework, and creating a basic project with it.


What We’re Going to Build Today

Node is similar in design to, and influenced by, systems like Ruby’s Event Machine or Python’s Twisted. Node takes the event model a bit further – it presents the event loop as a language construct instead of as a library.

In this tutorial, we will scrape the YouTube home page, get all the regular sized thumbnails from the page as well as links and video duration time, send all those elements to a jQueryMobile template, and play the videos using YouTube embed (which does a nice job of detecting device media support (flash/html5-video).

We will also learn how to begin using npm and Express, npm’s module installation process, basic Express routing and the usage of two modules of Node: request and jsdom.

For those of you who aren’t yet familiar with Node.js is and how to install it, please refer to the node.js home page
and the npm GitHub project page.

You should also refer to our “Node.js: Step by Step” series.

Note: This tutorial requires and assumes that you understand what Node.js is and that you already have node.js and npm installed.


Step 1: Setting Up Express

So what exactly is Express? According to its developers, it’s an..

Insanely fast (and small) server-side JavaScript web development framework built on Node and Connect.

Sounds cool, right? Let’s use npm to install express. Open a Terminal window and type the following command:

npm install express -g

By passing -g as a parameter to the install command, we’re telling npm to make a global installation of the module.

I’m using /home/node-server/nettuts for this example, but you can use whatever you feel comfortable with.

After creating our express project, we need to isntruct npm to install express’ dependencies.

cd nodetube
npm install -d

If it ends with, “ok,” then you’re good to go. You can now run your project:

node app.js

In your browser, go to http://localhost:3000.


Step 2: Installing Needed Modules

JSDOM

A JavaScript implementation of the W3C DOM.

Go back to your Terminal and, after stopping your current server (ctr + c), install jsdom:

npm install jsdom

Request

Simplified HTTP request method.

Type the following into the Terminal:

npm install request

Everything should be setup now. Now, it’s time to get into some actual code!


Step 3: Creating a Simple Scraper

app.js

First, let’s include all our dependencies. Open your app.js file, and, in the very first lines, append the following code:

/**
 * Module dependencies.
 */

var express = require('express')
, jsdom = require('jsdom')
, request = require('request')
, url = require('url')
, app = module.exports = express.createServer();

You will notice that Express has created some code for us. What you see in app.js is the most basic structure for a Node server using Express. In our previous code block, we told Express to include our recently installed modules: jsdom and request. Also, we’re including the URL module, which will help us parse the video URL we will scrape from YouTube later.

Scraping Youtube.com

Within app.js, search for the “Routes” section (around line 40) and add the following code (read through the comments to understand what is going on):

app.get('/nodetube', function(req, res){
	//Tell the request that we want to fetch youtube.com, send the results to a callback function
        request({uri: 'http://youtube.com'}, function(err, response, body){
                var self = this;
		self.items = new Array();//I feel like I want to save my results in an array

		//Just a basic error check
                if(err && response.statusCode !== 200){console.log('Request error.');}
                //Send the body param as the HTML code we will parse in jsdom
		//also tell jsdom to attach jQuery in the scripts and loaded from jQuery.com
		jsdom.env({
                        html: body,
                        scripts: ['http://code.jquery.com/jquery-1.6.min.js']
                }, function(err, window){
			//Use jQuery just as in a regular HTML page
                        var $ = window.jQuery;

                        console.log($('title').text());
                        res.end($('title').text());
                });
        });
});

In this case, we’re fetching the content from the YouTube home page. Once complete, we’re printing the text contained in the page’s title tag (<title>). Return to the Terminal and run your server again.

node app.js

In your browser, go to: http://localhost:3000/nodetube

You should see, “YouTube – Broadcast Yourself,” which is YouTube’s title.

Now that we have everything set up and running, it is time to get some video URLs. Go to the YouTube homepage and right click on any thumbnail from the “recommended videos” section. If you have Firebug installed, (which is highly recommended) you should see something like the following:

There’s a pattern we can identify and which is present in almost all other regular video links:

div.vide-entry
span.clip

Let’s focus on those elements. Go back to your editor, and in app.js, add the following code to the /nodetube route:

app.get('/nodetube', function (req, res) {
    //Tell the request that we want to fetch youtube.com, send the results to a callback function
    request({
        uri: 'http://youtube.com'
    }, function (err, response, body) {
        var self = this;
        self.items = new Array(); //I feel like I want to save my results in an array

		  //Just a basic error check
        if (err && response.statusCode !== 200) {
            console.log('Request error.');
        }

		  //Send the body param as the HTML code we will parse in jsdom
        //also tell jsdom to attach jQuery in the scripts
        jsdom.env({
            html: body,
            scripts: ['http://code.jquery.com/jquery-1.6.min.js']
        }, function (err, window) {
            //Use jQuery just as in any regular HTML page
            var $ = window.jQuery,
                $body = $('body'),
                $videos = $body.find('.video-entry');

				//I know .video-entry elements contain the regular sized thumbnails
            //for each one of the .video-entry elements found
            $videos.each(function (i, item) {

					 //I will use regular jQuery selectors
                var $a = $(item).children('a'),

						  //first anchor element which is children of our .video-entry item
                    $title = $(item).find('.video-title .video-long-title').text(),

						  //video title
                    $time = $a.find('.video-time').text(),

						  //video duration time
                    $img = $a.find('span.clip img'); //thumbnail

					 //and add all that data to my items array
                self.items[i] = {
                    href: $a.attr('href'),
                    title: $title.trim(),
                    time: $time,

						  //there are some things with youtube video thumbnails, those images whose data-thumb attribute
                    //is defined use the url in the previously mentioned attribute as src for the thumbnail, otheriwse
                    //it will use the default served src attribute.
                    thumbnail: $img.attr('data-thumb') ? $img.attr('data-thumb') : $img.attr('src'),
                    urlObj: url.parse($a.attr('href'), true) //parse our URL and the query string as well
                };
            });

				//let's see what we've got
            console.log(self.items);
            res.end('Done');
        });
    });
});

It’s time to restart our server one more time and reload the page in our browser (http://localhost:3000/nodetube). In your Terminal, you should see something like the following:

This looks good, but we need a way to display our results in the browser. For this, I will use the Jade template engine:

Jade is a high performance template engine heavily influenced by Haml, but implemented with JavaScript for Node.

In your editor, open views/layout.jade, which is the basic layout structure used when rendering a page with Express. It is nice but we need to modify it a bit.

views/layout.jade

!!! 5
html(lang='en')
  head
    meta(charset='utf-8')
    meta(name='viewport', content='initial-scale=1, maximum-scale=1')
    title= title
    link(rel='stylesheet', href='http://code.jquery.com/mobile/1.0b3/jquery.mobile-1.0b3.min.css')
    script(src='http://code.jquery.com/jquery-1.6.2.min.js')
    script(src='http://code.jquery.com/mobile/1.0b3/jquery.mobile-1.0b3.min.js')
  body!= body

If you compare the code above with the default code in layout.jade, you will notice that a few things have changed – doctype, the viewport meta tag, the style and script tags served from jquery.com. Let’s create our list view:

views/list.jade

Before we start, please browse through jQuery Mobile’s (JQM from now on) documentation on page layouts and anatomy.

The basic idea is to use a JQM listview, a thumbnail, title and video duration label for each item inside the listview along with a link to a video page for each one of the listed elements.

Note: Be careful with the indentation you use in your Jade documents, as it only accepts spaces or tabs – but not both in the same document.

div(data-role='page')
    header(data-role='header')
        h1= title
    div(data-role='content')
    	//just basic check, we will always have items from youtube though
        - if(items.length)
            //create a listview wrapper
            ul(data-role='listview')
                //foreach of the collected elements
                - items.forEach(function(item){
                    //create a li
                    li
                        //and a link using our passed urlObj Object
                        a(href='/watch/' + item['urlObj'].query.v, title=item['title'])
                            //and a thumbnail
                            img(src=item['thumbnail'], alt='Thumbnail')
                            //title and time label
                            h3= item['title']
                            h5= item['time']
                - })

That is all we need to create our listing. Return to app.js and replace the following code:

                        //let's see what we've got
                        console.log(self.items);
                        res.end('Done');

with this:

                        //We have all we came for, now let's render our view
			res.render('list', {
                        	title: 'NodeTube',
				               items: self.items
                        });

Restart your server one more time and reload your browser:

Note: Because we’re using jQuery Mobile , I recommend using a Webkit based browser or an iPhone/Android cellphone (simulator) for better results.


Step 4: Viewing Videos

Let’s create a view for our /watch route. Create views/video.jade and add the following code:

div(data-role='page')
    header(data-role='header')
        h1= title
    div(data-role='content')
    	//Our video div
        div#video
            //Iframe from  youtube which serves the right media object for the device in use
            iframe(width="100%", height=215, src="http://www.youtube.com/embed/" + vid, frameborder="0", allowfullscreen)

Again, go back to your Terminal, restart your server, reload your page, and click on any of the listed items. This time a video page will be displayed and you will be able to play the embed video!


Bonus: Using Forever to Run Your Server

There are ways we can keep our server running in the background, but there’s one that I prefer, called Forever, a node module we can easily install using npm:

npm install forever -g

This will globally install Forever. Let’s start our nodeTube application:

forever start app.js

You can also restart your server, use custom log files, pass environment variables among other useful things:

//run your application in production mode
NODE_ENV=production forever start app.js

Final Thoughts

I hope I’ve demonstrated how easy it is to begin using Node.js, Express and npm. In addition, you’ve learned how to install Node modules, add routes to Express, fetch remote pages using the Request module, and plenty of other helpful techniques.

If you have any comments or questions, please let me know in the comments section below!

Add Comment

Discussion 49 Comments

  1. Ozgur Corulu says:

    Awesome tutorial Jaime. I got to bookmark this. Thanks!

  2. Nico says:

    Great tutorial, very concise and to the point. Thank you.

  3. Sean says:

    There’s a typo in your step 3 header, “Creating a Simple Scrapper.”

  4. coderbay says:

    Very informative and concise.

  5. th says:

    I think the “/watch” route is missing in this tutorial. Other than that, it looks to be complete.

  6. Sirwan says:

    “If it ends with, “ok,” then you’re good to go. You can now run your project:” … what if it doesnt … I tried to install the dependancies and got this error:

    Sirwans-MacBook-Pro:grid sirwan$ npm install -d
    npm info it worked if it ends with ok
    npm info using npm@1.0.96
    npm info using node@v0.5.11-pre
    npm ERR! Couldn’t read dependencies.
    npm ERR! Error: ENOENT, No such file or directory ‘/Users/sirwan/Desktop/Code/grid/package.json’
    npm ERR! Report this *entire* log at:
    npm ERR!
    npm ERR! or email it to:
    npm ERR!
    npm ERR!
    npm ERR! System Darwin 11.2.0
    npm ERR! command “node” “/usr/local/bin/npm” “install” “-d”
    npm ERR! cwd /Users/sirwan/Desktop/Code/grid
    npm ERR! node -v v0.5.11-pre
    npm ERR! npm -v 1.0.96
    npm ERR! path /Users/sirwan/Desktop/Code/grid/package.json
    npm ERR! code ENOENT
    npm ERR!
    npm ERR! Additional logging details can be found in:
    npm ERR! /Users/sirwan/Desktop/Code/grid/npm-debug.log
    npm not ok

  7. vineeth says:

    Great tutorial !

  8. hiceram says:

    I just quickly skimmed through it and it seems pretty legit. I also reccomend google refine to scrape sites.

  9. Muhammed K K says:

    Thanks for this great tutorial.

  10. Javawerks says:

    There’s only one problem: Using Nodejs, V8 and jsdom results in massive memory leaks.

  11. Xander says:

    I do not have terminal at my disposal, so how would i proceed?

  12. Hakkai says:

    the /watch is not working.

  13. Zach says:

    This needs to be added to the tutorial for the watch route to function properly:

    //Pass the video id to the video view
    app.get(‘/watch/:id’, function(req, res){
    res.render(‘video’, {
    title: ‘Watch’,
    vid: req.params.id
    });
    });

  14. John says:

    Dear God, I hate jQuery fan boys.

    WHY U NO UZE JUST SIZZLE BRO.

  15. Leo Shmuylovich says:

    This is really cool, but how would you scrape a site that requires a login first? Seems like you would need to POST the username and password data first, but I’m having trouble understanding how you do that. Any suggestion you have would be greatly appreciated!

  16. nXqd says:

    It’s really nice to know about forever :)

  17. Joost Schuur says:

    I’m having a number of problems by step 3, and I believe the tutorial itself already contains errors.

    To start with, you’re not explicitly mentioning the ‘express nodetube’ command that needs to be executed in text. It’s only visible in the first screenshot of the tutorial.

    Next, your modified block of dependencies seems to leave out a crucial ‘routes = require(‘./routes’)’ var definition, without which the app would fail.

    Even after I fixed both of those things, I get an error hitting http://localhost:3000/nodetube (the root / works just fine):

    /Users/jschuur/Code/Node/nodetube/node_modules/jsdom/lib/jsdom/browser/index.js:267
    Contextify(window);
    ^
    TypeError: undefined is not a function

    Full output at http://pastie.org/2785729, and as you can see there, I already encountered warnings installing some of the libraries…

    npm WARN jsdom@0.2.8 package.json: bugs['web'] should probably be bugs['url']
    npm WARN request@2.1.1 package.json: bugs['web'] should probably be bugs['url']
    npm WARN htmlparser@1.7.3 package.json: bugs['web'] should probably be bugs['url']

    …and later launching the app:

    jschuur@Paige:nodetube node app.js
    The “sys” module is now called “util”. It should have a similar interface.
    Express server listening on port 3000 in development mode

    I’m running node v0.5.11-pre and npm 1.0.103.

    • Jaime says:
      Author

      Let me check that, I used Node v0.4.10 so probably there’s something different from that version to v0.5.*, did you installed npm as root?

      You’re right about the Express command I’m calling at the beginning of the tutorial, I will check with the editors to fix that.

      Thanks for reading, I will post back with some updates.

  18. Richard says:

    I sense memory leaks.

  19. Joseph says:

    Ooh very nice. I have been using CURL but this may make me cheat on php

  20. Anthony says:

    Why is this approach subject to memory leaks? And if so, is there another alternative that doesn’t?

  21. Rajkumar Jegannathan says:

    Am getting an error like this :

    TypeError: Object # has no method ‘end’

    Any help ?

  22. Rajkumar Jegannathan says:

    it says : urlObj is not defined

    any help ?

  23. Alberto Cole says:

    Nice tutorial, I had to make a couple tweaks to the walkthrough code but so far so good, not sure if I’m underestimating the sample, but I would like to see a more “real life” example, finally, can someone explain to me what’s the Memory Leak issue we have with this approach? Thanks!

  24. erminio ottone says:

    wow that was GREAT! more node tuts please! :) when next one? hope not in 2 months!

  25. Oscar says:

    This is a Professional Tutorial. Thanks!

  26. tom says:

    You can also use jquery on the server to work with the browser DOM in real-time (rather than for scraping) Check out nQuery https://github.com/tblobaum/nodeQuery

    net tuts should write a tutorial on that next

  27. ck says:

    Start node app.js, but end up with following error when try to access notetube url

    ————-error message————
    Segmentation fault

    Any idea?

  28. Matt says:

    An alternative to JSDom (+ jQuery) is the cheerio library, which is significantly faster than the method described here.

    The API is the same as jQuery, so it’s a one line change. I’ve posted a follow-up video to this tutorial here: http://vimeo.com/31950192.

    The github repo is here: https://github.com/MatthewMueller/cheerio.

  29. What is the best strategy to scrap more than one page simultaneously?

    I followed your example and it worked very well. Thank you very much. :)

    But in my case when a user submit the form, I need to scrap more than one page.

    My code seems like that

    app.get(‘/lookup’, function (req, res) {

    var pagesToScrap = [];
    var callbackCounter = 0;
    var items = [];

    var callback = function(){
    if(pagesToScrap == callbackCounter){
    res.render(‘list’, {
    title: “Hello World”,
    items: items
    });
    }
    callbackCounter++;
    }

    var pageAResolver = function() {
    request.get({
    uri: ‘http://a.com’,
    //…
    items.push[jsonData];
    callback();
    );
    }
    var pageBResolver = function() {
    request.get({
    uri: ‘http://b.com’,
    //…
    items.push[jsonData];
    callback();
    );
    }
    var pageCResolver = function() {
    request.get({
    uri: ‘http://c.com’,
    //…
    items.push[jsonData];
    callback();
    );
    }
    pagesToScrap[0] = {url: “http://a.com”, resolver: pageAResolver}
    pagesToScrap[1] = {url: “http://b.com”, resolver: pageBResolver}
    pagesToScrap[2] = {url: “http://c.com”, resolver: pageCResolver}

    for(var i = 0; i < pagesToScrap.length; i++){
    pagesToScrap[i].resolver();
    }
    });

    When all requests return I send the response to the browser. Sometimes it can take lot of time. What is the best strategy without caching to show this data faster?

    I’m thinking about socket.io, maybe I can emit the data simultaneously? Guys, what do you think about it?

    Cheers,
    Pablo Cantero

  30. gmills82 says:

    Just a heads up… if anyone is encountering a Bad Argument error trying to get the dependencies for express to install, this is a known issue with the latest version of npm (1.1.0-alpha-2). If you revert npm to version(1.0.106) this tutorial works.

    Here are instructions from the Google Groups page to revert
    *******************
    npm uninstall -g npm
    cd /usr/local
    git clone git://github.com/isaacs/npm.git
    cd npm
    sudo git checkout v1.0.106
    make install
    *******************

    This was the only way I could get mine to work.

    • gmills82 says:

      Also the solution to seeing this bug is as follows:
      npm WARN jsdom@0.2.8 package.json: bugs['web'] should probably be bugs['url']
      npm WARN request@2.1.1 package.json: bugs['web'] should probably be bugs['url']
      npm WARN htmlparser@1.7.3 package.json: bugs['web'] should probably be bugs['url']

      So I had to run:
      npm request -g

      This gets the request module which I wasn’t sure was there or not. You can check if its there by running:
      npm view request version

      I then had to update my ~/.profile with this line (this was something I messed up in installing node)
      export NODE_PATH=/usr/local/lib/node:/usr/local/lib/node_modules
      Then refresh the profile with:
      . ~/.profile

      After that I could run node app.js and hit localhost:3000/nodetube with success. Hope that helps some other noobs out there like myself.

  31. Air Max UK says:

    This was the only way I could get mine to work.wow that was GREAT! more node tuts please! :) when next one? hope not in 2 months!

  32. Geoff says:

    am thinking that video-entry class has been dropped from youtube due to layout update. Could that be true?

    • Gabriel says:

      Yeah, the HTML on Youtube has been updated so you’ll need to change the CSS selectors for the tutorial to work. Something like this:

      var $a = $(‘.feed-item-thumb a’, item)[0], // first anchor element child of item
      $title = $(‘.feed-item-content h4′, item).text(), // video title
      $time = $(‘.feed-item-time’, item).text(), // video duration time
      $img = $(‘.feed-item-container .video-thumb .clip .clip-inner img’, item)[0]; // thumbnail

  33. I received a bunch of error messages when trying to install JSDOM, Request, and trying to use forever…
    I see that this article isn’t old (its from last October) I use Windows 7, 64 bit OS. I know I’ve successfully installed Node in the past.

  34. mojo706 says:

    I havent gone far In the tut but Im wondering how long does it take to install JSDOM or does it depend on the internet connection speeds? My terminal output got stuck at

    info: it worked if it ends with ok
    info: downloading: http://nodejs.org/dist/v0.6.3/node-v0.6.3.tar.gz

    That is the last terminal output. Someone help Thanks

  35. Sam Smitter says:

    $ npm install -d

    npm ERR! Error: ENOENT, open ‘/var/www/myNode/WebScraper/package.json’

    —-
    Should that be in the package download or are there steps missing above which create it??

  36. Sam Smitter says:

    Oh dang – I see it now. Might want to put the line “express nodetube” in the “text” part of the page instead of just the screenshot.

  37. Sam Smitter says:

    Nevermind – you really should take this buggy thing down or fix it – it is impossible to follow and/or full of bugs – I can’t tell which, but I’ve wasted 2 hours now …

    node.js:201
    throw e; // process.nextTick error, or ‘error’ event on first tick
    ^
    Error: listen EADDRINUSE
    at errnoException (net.js:646:11)
    at Array.0 (net.js:747:26)
    at EventEmitter._tickCallback (node.js:192:40)

Add a Comment

To add a code snippet to your comment, please wrap your code like so: <pre name="code" class="html">YOUR CODE</pre>. You can replace the class name with "js," "css," "sql," or "php." If there are any "<" or ">" within your code, please search and replace them with: &lt; and &gt; respectively.