Screen Scraping with Node.js

Screen Scraping with Node.js

Tutorial Details
  • Difficulty:Intermediate
  • Estimated Completion Time:45 minutes

You may have used NodeJS as a web server, but did you know that you can also use it for web scraping? In this tutorial, we’ll review how to scrape static web pages – and those pesky ones with dynamic content – with the help of NodeJS and a few helpful NPM modules.



A Bit About Web Scraping

Web scraping has always had a negative connotation in the world of web development – and for good reason. In modern development, APIs are present for most popular services and they should be used to retrieve data rather than scraping. The inherent problem with scraping is that it relies on the visual structure of the page being scraped. Whenever that HTML changes – no matter how small the change may be – it can completely break your code.

Despite these flaws, it's important to learn a bit about web scraping and some of the tools available to help with this task. When a site does not reveal an API or any syndication feed (RSS/Atom, etc), the only option we're left with to get that content… is scraping.

Note: If you can't get the information you require through an API or a feed, it's a good sign that the owner does not want that information to be accessible. However, there are exceptions.


Why use NodeJS?

Scrapers can be written in any language, really. The reason why I enjoy using Node is because of its asynchronous nature, which means that my code is not blocked at any point in the process. I'm quite familiar with JavaScript so that's an added bonus. Finally, there are some new modules that have been written for NodeJS that makes it easy to scrape websites in a reliable manner (well, as reliable as scraping can get!). Let's get started!


Simple Scraping With YQL

Let's start with the simple use-case: static web pages. These are your standard run-of-the-mill web pages. For these, Yahoo! Query Language (YQL) should do the job very well. For those unfamiliar with YQL, it's a SQL-like syntax that can be used to work with different APIs in a consistent manner.

YQL has some great tables to help developers get HTML off a page. The ones I want to highlight are:

Let's go through each of them, and review how to implement them in NodeJS.

html table

The html table is the most basic way of scraping HTML from a URL. A regular query using this table looks like this:

select * from html where url="http://finance.yahoo.com/q?s=yhoo" and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'

This query consists of two parameters: the "url" and the "xpath". The url is self-explanatory. The XPath consists of an XPath string telling YQL what section of the HTML should be returned. Try this query here.

Additional parameters that you can use include browser (boolean), charset (string), and compat (string). I have not had to use these parameters, but refer to the documentation if you have specific needs.

Not comfortable with XPath?

Unfortunately, XPath is not a very popular way of traversing the HTML tree structure. It can be complicated to read and write for beginners.

Let's look at the next table, which does the same thing but lets you use CSS instead

data.html.cssselect table

The data.html.cssselect table is my preferred way of scraping HTML off a page. It works the same way as the html table but allows you to CSS instead of XPath. In practice, this table converts the CSS to XPath under the hood and then calls the html table, so it is a little slower. The difference should be negligible for scraping needs.

A regular query using this table looks like:

select * from data.html.cssselect where url="www.yahoo.com" and css="#news a"

As you can see, it is much cleaner. I recommend you try this method first when you're attempting to scrape HTML using YQL. Try this query here.

htmlstring table

The htmlstring table is useful for cases where you are trying to scrape a large chunk of formatted text from a webpage.

Using this table allows you to retrieve the entire HTML content of that page in a single string, rather than as JSON that is split based on the DOM structure.

For example, a regular JSON response that scrapes an <a> tag looks like this:

"results": {
   "a": {
     "href": "...",
     "target": "_blank",
     "content": "Apple Chief Executive Cook To Climb on a New Stage"
    }
 }

See how the attributes are defined as properties? Instead, the response from the htmlstring table would look like this:

"results": {
  "result": {
    "<a href=\"…\" target="_blank">Apple Chief Executive Cook To Climb on a New Stage</a>
   }
}

So, why would you use this? Well, from my experience, this comes in great use when you're trying to scrape a large amount of formatted text. For example, consider the following snippet:

<p>Lorem ipsum <strong>dolor sit amet</strong>, consectetur adipiscing elit.</p>
<p>Proin nec diam magna. Sed non lorem a nisi porttitor pharetra et non arcu.</p>

By using the htmlstring table, you are able to get this HTML as a string, and use regex to remove the HTML tags, which leaves you with just the text. This is an easier task than iterating through JSON that has been split into properties and child objects based on the DOM structure of the page.


Using YQL with NodeJS

Now that we know a little bit about some of the tables available to us in YQL, let's implement a web scraper using YQL and NodeJS. Fortunately, this is really simple, thanks to the node-yql module by Derek Gathright.

We can install the module using npm:

npm install yql

The module is extremely simple, consisting of only one method: the YQL.exec() method. It is defined as the following:

function exec (string query [, function callback] [, object params] [, object httpOptions])

We can use it by requiring it and calling YQL.exec(). For example, let's say we want to scrape the headlines from all the posts on the Nettuts main page:

var YQL = require("yql");

new YQL.exec('select * from data.html.cssselect where url="http://net.tutsplus.com/" and css=".post_title a"', function(response) {

    //response consists of JSON that you can parse

});

The great thing about YQL is its ability to test your queries and determine what JSON you are getting back in real-time. Go to the console to try this query out, or click here to see the raw JSON.

The params and httpOptions objects are optional. Parameters can contain properties such as env (whether you are using a specific environment for the tables) and format (xml or json). All properties passed into params are URI-encoded and appended to the query string. The httpOptions object is passed into the header of the request. Here, you can specify whether you want to enable SSL, for instance.

The JavaScript file, named yqlServer.js, contains the minimal code required to scrape using YQL. You can run it by issuing the following command in your terminal:

node yqlServer.js

Exceptions and other notable tools

YQL is my preferred choice for scraping content off static web pages, because it's easy to read and easy to use. However, YQL will fail if the web page in question has a robots.txt file that denies a response to it. In this case, you can look at some of the utilities mentioned below, or use PhantomJS, which we’ll cover in the following section.

Node.io is a useful Node utility that is specifically designed for data scraping. You can create jobs that take input, process it and return some output. Node.io is well-watched on Github, and has some helpful examples to get you started.

JSDOM is a very popular project that implements the W3C DOM in JavaScript. When supplied HTML, it can construct a DOM that you can interact with. Check out the documentation to see how you can use JSDOM and any JS library (such as jQuery) together to scrape data from web pages.


Scraping Pages With Dynamic Content

So far, we've looked at some tools that can help us scrape web pages with static content. With YQL, it's relatively easy. Unfortunately, we are often presented with pages that have content which is loaded dynamically with JavaScript. In these cases, the page is often empty initially, and then the content is appended afterwards. How can we deal with this issue?

An Example

Let me provide an example of what I mean; I have uploaded a simple HTML file to my own website, which appends some content, via JavaScript, two seconds after the document.ready() function is called. You can check out the page here. Here's what the source looks like:

<!DOCTYPE html>
<html>
    <head>
        <title>Test Page with content appended after page load</title>
    </head>

    <body>
        Content on this page is appended to the DOM after the page is loaded.

        <div id="content">

        </div>

    <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js"></script>
    <script>
        $(document).ready(function() {

            setTimeout(function() {
                $('#content').append("<h2>Article 1</h2><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p><h2>Article 2</h2><p>Ut sed nulla turpis, in faucibus ante. Vivamus ut malesuada est. Curabitur vel enim eget purus pharetra tempor id in tellus.</p><h2>Article 3</h2><p>Curabitur euismod hendrerit quam ut euismod. Ut leo sem, viverra nec gravida nec, tristique nec arcu.</p>");
            }, 2000);

        });
    </script>
    </body>
</html>

Now, let's try scraping the text inside the <div id="content"> using YQL.

var YQL = require("yql");

new YQL.exec('select * from data.html.cssselect where url="http://tilomitra.com/repository/screenscrape/ajax.html" and css="#content"', function(response) {

    //This will return undefined! The scraping was unsuccessful!
    console.log(response.results);

});

You'll notice that YQL returns undefined because, when the page is loaded, the <div id="content"> is empty. The content has not been appended yet. You can try the query out for yourself here.

Let's look at how we can get around this issue!

Enter PhantomJS

PhantomJS can load web pages and mimic a Webkit-based browser without the GUI.

My preferred method for scraping information from these sites is to use PhantomJS. PhantomJS describes itself as a "headless Webkit with a JavaScript API. In simplistic terms, this means that PhantomJS can load web pages and mimic a Webkit-based browser without the GUI. As a developer, we can call on specific methods that PhantomJS provides to execute code on the page. Since it behaves like a browser, scripts on the webpage run as they would in a regular browser.

To get data off our page, we are going to use PhantomJS-Node, a great little open-source project that bridges PhantomJS with NodeJS. Under the hood, this module runs PhantomJS as a child process.

Installing PhantomJS

Before you can install the PhantomJS-Node NPM module, you must install PhantomJS. Installing and building PhantomJS can be a little tricky, though.

First, head over to PhantomJS.org and download the appropriate version for your operating system. In my case, it was Mac OSX.

After downloading, unzip it to somewhere such as /Applications/. Next, you want to add it to your PATH:

sudo ln -s /Applications/phantomjs-1.5.0/bin/phantomjs /usr/local/bin/

Replace 1.5.0 with your downloaded version of PhantomJS. Be advised that not all systems will have /usr/local/bin/. Some systems will have: /usr/bin/, /bin/, or usr/X11/bin instead.

For Windows users, check the short tutorial here. You'll know you're all set up when you open your Terminal and write phantomjs, and you don't get any errors.

If you are uncomfortable editing your PATH, make a note of where you unzipped PhantomJS and I'll show another way of setting it up in the next section, although I recommend you edit your PATH.

Installing PhantomJS-Node

Setting up PhantomJS-Node is much easier. Provided you have NodeJS installed, you can install via npm:

npm install phantom

If you did not edit your PATH in the previous step when installing PhantomJS, you can go into the phantom/ directory pulled down by npm and edit this line in phantom.js.

ps = child.spawn('phantomjs', args.concat([__dirname + '/shim.js', port]));

Change the path to:

ps = child.spawn('/path/to/phantomjs-1.5.0/bin/phantomjs', args.concat([__dirname + '/shim.js', port]));

Once that is done, you can test it out by running this code:

var phantom = require('phantom');
phantom.create(function(ph) {
  return ph.createPage(function(page) {
    return page.open("http://www.google.com", function(status) {
      console.log("opened google? ", status);
      return page.evaluate((function() {
        return document.title;
      }), function(result) {
        console.log('Page title is ' + result);
        return ph.exit();
      });
    });
  });
});

Running this on the command-line should bring up the following:

opened google?  success
Page title is Google

If you got this, you're all set and ready to go. If not, post a comment and I'll try to help you out!

Using PhantomJS-Node

To make it easier for you, I've included a JS file, called phantomServer.js in the download that uses some of PhantomJS' API to load a webpage. It waits for 5 seconds before executing JavaScript that scrapes the page. You can run it by navigating to the directory and issuing the following command in your terminal:

node phantomServer.js

I'll give an overview of how it works here. First, we require PhantomJS:

var phantom = require('phantom');

Next, we implement some methods from the API. Namely, we create a page instance and then call the open() method:

phantom.create(function(ph) {
  return ph.createPage(function(page) {

    //From here on in, we can use PhantomJS' API methods
    return page.open("http://tilomitra.com/repository/screenscrape/ajax.html",          function(status) {

            //The page is now open      
            console.log("opened site? ", status);

        });
    });
});

Once the page is open, we can inject some JavaScript into the page. Let's inject jQuery via the page.injectJs() method:

phantom.create(function(ph) {
  return ph.createPage(function(page) {
    return page.open("http://tilomitra.com/repository/screenscrape/ajax.html", function(status) {
      console.log("opened site? ", status);         

            page.injectJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {
                //jQuery Loaded
                //We can use things like $("body").html() in here.

            });
    });
  });
});

jQuery is now loaded, but we don't know whether the dynamic content on the page has loaded yet. To account for this, I usually put my scraping code inside a setTimeout() function that executes after a certain time interval. If you want a more dynamic solution, the PhantomJS API lets you listen and emulate certain events. Let's go with the simple case:

setTimeout(function() {
    return page.evaluate(function() {

        //Get what you want from the page using jQuery. 
        //A good way is to populate an object with all the jQuery commands that you need and then return the object.

        var h2Arr = [], //array that holds all html for h2 elements
        pArr = []; //array that holds all html for p elements

        //Populate the two arrays
        $('h2').each(function() {
            h2Arr.push($(this).html());
        });

        $('p').each(function() {
            pArr.push($(this).html());
        });

        //Return this data
        return {
            h2: h2Arr,
            p: pArr
        }
    }, function(result) {
        console.log(result); //Log out the data.
        ph.exit();
    });
}, 5000);

Putting it all together, our phantomServer.js file looks like this:

var phantom = require('phantom');
phantom.create(function(ph) {
  return ph.createPage(function(page) {
    return page.open("http://tilomitra.com/repository/screenscrape/ajax.html", function(status) {
      console.log("opened site? ", status);         

            page.injectJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {
                //jQuery Loaded.
                //Wait for a bit for AJAX content to load on the page. Here, we are waiting 5 seconds.
                setTimeout(function() {
                    return page.evaluate(function() {

                        //Get what you want from the page using jQuery. A good way is to populate an object with all the jQuery commands that you need and then return the object.
                        var h2Arr = [],
                        pArr = [];
                        $('h2').each(function() {
                            h2Arr.push($(this).html());
                        });
                        $('p').each(function() {
                            pArr.push($(this).html());
                        });

                        return {
                            h2: h2Arr,
                            p: pArr
                        };
                    }, function(result) {
                        console.log(result);
                        ph.exit();
                    });
                }, 5000);

            });
    });
    });
});

This implementation is a little crude and disorganized, but it makes the point. Using PhantomJS, we are able to scrape a page that has dynamic content! Your console should output the following:

→ node phantomServer.js
opened site?  success
{ h2: [ 'Article 1', 'Article 2', 'Article 3' ],
  p: 
   [ 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
     'Ut sed nulla turpis, in faucibus ante. Vivamus ut malesuada est. Curabitur vel enim eget purus pharetra tempor id in tellus.',
     'Curabitur euismod hendrerit quam ut euismod. Ut leo sem, viverra nec gravida nec, tristique nec arcu.' ] }

Conclusion

In this tutorial, we reviewed two different ways for performing web scraping. If scraping from a static web page, we can take advantage of YQL, which is easy to set up and use. On the other hand, for dynamic sites, we can leverage PhantomJS. It's a little harder to set up, but provides more capabilities. Remember: you can use PhantomJS for static sites too!

If you have any questions on this topic, feel free to ask below and I'll do my best to help you out.

Tags: node js
Note: Want to add some source code? Type <pre><code> before it and </code></pre> after it. Find out more
  • http://http//www.garysieling.com/blog Gary

    Nice example. I don’t know that xpath is that hard to use though, because it’s easily available in Firebug.

    • http://www.tilomitra.com Tilo
      Author

      Yeah, you’re right. I use Chrome for most things, but my point is that we aren’t used to writing in XPath, rather than it being hard to use. I’m sorry if I didn’t explain that part well enough.

      In my view, I’d rather go “#content”, than //div[id="content"] and I think most people would agree.

  • Vernaldi Metayer

    Great tutorial! I noticed your quote about the api, but is there some guide on when web scraping is illegal? Thanks!

    • http://www.tilomitra.com Tilo
      Author

      It’s a very blurry line. Stuff on the web has copyright protection, so if you scrape someone else’s content and then put that inside a paywall and charge for it, I’m pretty sure it would be illegal, unless they gave you permission.

  • sirfilip

    Great tut bro, Love nodejs and getting to like phantomjs :)

  • http://twitter.com/aforavi Avinash

    node-scraper : https://github.com/mape/node-scraper this look Good too .. :)

    Thanks for the tut!

  • sh4n3

    Great tut! I have recently discovered Phantomjs but found running the Casper API a lot simper to implement over the top of Phantomjs. One question is, can Phantom be run on a remote server or VPS or on (for example) nodester? I ask because, I have a web application that needs to poll info from a password secure site? I can do this on my localhost but would really need to package phantom up to be incorporated with my web app. Thanks in advance for any ideas you may have.

    • http://www.tilomitra.com Tilo
      Author

      Yes, this can be done. Let’s take Heroku as an example since it’s similar to Nodester or a VPS. If you took your PhantomServer.js script and wrapped it all up in a setTimeout(), Heroku could keep running it as a process. You’re entitled to 1 background process in the Free plan in Heroku, AFAIK.

  • http://fabryz.com Fabryz

    I got sneak peek on crawling around with Node, http-agent and JSDOM months ago and this was the result: https://gist.github.com/2493656

    • http://www.tilomitra.com Tilo
      Author

      This looks good. From my experience, scraping regular sites is no problem. There are tons of tools out there, some of which you have used. The problem comes when you try to scrape dynamic sites, and that’s where PhantomJS really shines for me (Casper works well too, as sh4n3 mentioned).

  • John

    Gonna feel really dumb here, but I just can’t get this thing to work :-(. PhantomJS installs fine, I go to run the first snippet of code via “node test.js” in terminal and I keep getting this:

    node_modules/phantom/node_modules/dnode/index.js:118
    throw new Error(‘no port or path provided’);
    ^
    Error: no port or path provided

    Something to do with dnode? I can run your source file fine. (Running OSX).

    • Gabriel

      I have the same problem :/

    • http://www.tilomitra.com Tilo
      Author

      Paste line 118 from dnode/index.js, and let’s see what’s going on in there.

      • John

        That is line 118:

        throw new Error(‘no port or path provided’);

        Putting it into context:

        if (params.port) {
        server.listen(params.port, params.host);
        }
        else if (params.path) {
        server.listen(params.path);
        }
        else {
        throw new Error(‘no port or path provided’);
        }

      • http://www.tilomitra.com Tilo
        Author

        Here’s a solution from another reader who was having the same problem as you:

        Saw on tutsplus lots of comments about “No path or port provided”. I had this problem too and found a solution to fix it. As i have no account at Tutsplus i will post it here:

        - Go to the phantomjs version (the node.js version).
        - Remove node_modules/dnode + node_modules/dnode-protocol
        - open package.json
        - change :

        “dependencies”: {
        “dnode-protocol”: “*”,
        “dnode”: “*”,
        “express”: “*”
        },

        to

        “dependencies”: {
        “dnode-protocol”: “~0.2.2″,
        “dnode”: “~0.9.12″,
        “express”: “*”
        },
        - in the phantomjs folder excecute: “npm install phantom”
        - The error is gone

  • alFReD-NSH

    Nice article, but you forgot to mention cheerio( https://github.com/MatthewMueller/cheerio ), which is a faster implementation of jquery for node, but it’s much faster, because it doesn’t creates the whole DOM with the full specifications. Here using phantomjs creates the DOM, and then loads jQuery, so there would be a big hit on performance comparing to cheerio.

    About YQL, this is nice solution, since we don’t do any parsing, nor writing much code. Which just write a query and then it’s yahoo who has to parse the code and return parts we want. As long as yahoo does it’s service with low latency it would be an awesome solution.

    • http://www.tilomitra.com Tilo
      Author

      YQL is very fast and the best part is if you hit a website a lot of times, you may get blocked, but since Yahoo’s server’s are hitting it, you are fine.

      I hadn’t come across Cheerio, thanks for bringing that one up! I’ll take a look at it. However, when screen scraping, I don’t think performance is a factor. I don’t recommend real-time screen scraping. I usually scrape first, keep that in a database, and pull it up when needed so it’s quick.

      • http://twitter.com/ekanna ekanna

        Its really surprising to know that you are not aware of cheerio. It is one of the best tool available for web scrapping using nodeJS. Anyway i like your article.

      • alFReD-NSH

        Yes, you have a good point. In this case performance doesn’t matter as long it’s not real time.

  • http://www.amazing-web-design.co.uk/ Joe Elliott

    Hi there,

    Great post, I had heard of Node.js but never used it, Ill use this and give it a go, thanks :)

    Joe

  • Sh4n3

    Thanks for your answer Tilo, I’ll give it a go :)

  • http://my.opera.com/BS-Harou/ BS-Harou

    Hi, is there any way to get response headers with YQL? E.g. to get cookies on site when I need to first log in before reading the content.

  • angel

    nice article…I didn’t used yql before..by the way…would be interesting an article about coco (coffeescript fork) roy or livescript (coco fork)…they have a very nice syntax for callbacks than many people doesn’t know…

    var phantom = require(‘phantom’);
    phantom.create(function(ph) {
    return ph.createPage(function(page) {
    return page.open(“http://www.google.com”, function(status) {
    console.log(“opened google? “, status);
    return page.evaluate((function() {
    return document.title;
    }), function(result) {
    console.log(‘Page title is ‘ + result);
    return ph.exit();
    });
    });
    });
    });

    in livescript is:

    phantom = require ‘phantom’
    ph <-phantom.create
    page <- ph.createpage
    status document.title),( ->
    console.log (“page title is” + it)
    ph.exit()
    ))

    I think than is the same code (I compile this code in http://gkz.github.com/LiveScript/ and I get the same javascript) but is much more readable .D

  • angel

    it’s me again…I made a mistake when I copied the code..the code must be it:

    phantom = require ‘phantom’
    ph <-phantom.create
    page <- ph.createpage
    status document.title),
    ( ->
    console.log (“page title is” + it)
    ph.exit()
    ))

    no callbacks!!..:D

    sorry if I divert the article’ subject but when I saw that callbacks seemed like a good option for use livescript…

    • http://www.tilomitra.com Tilo
      Author

      You’re right, it is more readable! I considered presenting it in CoffeeScript, but not everyone knows it, so I thought presenting it in vanilla JS was the best choice. Thanks for your input though :)

  • http://webdesignpluscode.blogspot.com/ waqas

    Nice tutorial on node.js…I want to learn node.js from scratch and be expert in it…can you please provide me some guide to start with… and some useful link for easy learning … Thanks :)

  • Sumit Negi

    Hi ,
    Is it possible to scrap css of web page as html. For example I am scraping a piece of html code form web page and I want to scrap its CSS also.

  • Jason

    Any thoughts on how to spider a site structure?

    I mean, how to queue up the discovered links and process them in a timely way without spamming the site with hundreds of concurrent requests.

  • JonnyMitts

    I actually had to do something like this recently. I may have used this if I knew about it. I ended up writing a ruby script using Nokogiri. I highly recommend it.

  • lacks

    The website which loads the images such as when i send the an http request i ll get the response for only the few images which loaded first but i am not getting the remaining images which loads after. I am using node js. Please suggest me solution..

  • Bart

    It does not work. I have phantomJS installed. Why would this be?

    Warning: express.createServer() is deprecated, express
    applications no longer inherit from http.Server,
    please use:

    var express = require(“express”);
    var app = express();

    phantom stdout: TypeError: ‘undefined’ is not a function (evaluating ‘phantom.loadModuleSource(‘webpage’)')

    • Patrick

      I second Bart’s comment

      Warning: express.createServer() is deprecated, express
      applications no longer inherit from http.Server,
      please use:

      var express = require(“express”);
      var app = express();

      phantom stdout: TypeError: ‘undefined’ is not a function (evaluating ‘phantom.loadModuleSource(‘webpage’)')

      phantom stdout: /Users/mclenithan/Sites/node_modules/phantom/shim.js:1573

      phantom stdout: /Users/mclenithan/Sites/node_modules/phantom/shim.js:1691

      phantom stdout: /Users/mclenithan/Sites/node_modules/phantom/shim.js:156

      phantom stdout: /Users/mclenithan/Sites/node_modules/phantom/shim.js:7
      /Users/mclenithan/Sites/node_modules/phantom/shim.js:1694

  • dunn

    hi, your tutorial shows how to scrape the data loaded by js after a few second delay time. What about in the case you have to click on a link for data to load/insert by the link’s js such as:

    My question is: How to scrape data from this website http://vtis.vn/index.aspx But the data is not shown until you click on for example “Danh sách chậm”. I have tried very hard and carefully, when you click on “Danh sách chậm” this is onclick event which triggers some javascript functions one of the js functions is to get the data from the server and insert it to a tag/place holder and at this point you can use something like firefox to examine the data and yes, the data is display to users/viewers on the webpage. So again, how can we scrap this data programmatically?

    i wrote a scrapping function but ofcourse it does not get the data i want because the data is not available until i click on the button “Danh sách chậm”

    loadHTML($Page);
    $dom_xpath_admin = new DOMXpath($dom_document_admin);
    $elements = $dom_xpath->query(“*//td[@class='IconMenuColumn']“);
    //
    foreach ($elements as $element) {
    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
    echo (mb_convert_encoding($node->c14n(), ‘iso-8859-1′, mb_detect_encoding($content, ‘UTF-8′, true)));
    }
    }
    }
    ?>

    • Narek

      Use CasperJS for that .

  • Giancarlo

    I have a big problem about express applications and phantom :(

    Warning: express.createServer() is deprecated, express
    applications no longer inherit from http.Server,
    please use:

    var express = require(“express”);
    var app = express();

    phantom stderr: execvp(): No such file or directory

    help me pls.

  • Bob

    haha, u should have a look at casperjs and spookyjs :)

  • carlo

    How to fake a plugin presence to PhantomJS?
    Impossible with newer versions and some sites won’t render flash content if Flash plugin is missing from navigator.plugins :(
    No need for plugin to be really present…
    Any ideas

  • ramsich

    Just wanted to drop some words about scraping even though i see that this post is about node.js.
    There are even easier way to scrape if the main goal is to scrape. one of them is using python with pyquery.
    its actually a replication of jquery traversal and manipulation methods in python. Its pretty easy and cool.

  • http://www.facebook.com/atal.shukla.58 Atal Shukla

    Your post is nice. Query consists of two parametre the url & xpath and coding is effective..
    Web scraping is dynamic because it has a life of its own; it keeps growing strong; it continues to branch out; and it is progressing constantly..


    Web Scraping is Dynamic