HTML Parsing and Screen Scraping with the Simple HTML DOM Library

HTML Parsing and Screen Scraping with the Simple HTML DOM Library

Tutorial Details
  • Technology: PHP
  • Difficulty: Moderate
  • Estimated Completion Time: 40 Minutes

If you need to parse HTML, regular expressions aren’t the way to go. In this tutorial, you’ll learn how to use an open source, easily learned parser, to read, modify, and spit back out HTML from external sources. Using nettuts as an example, you’ll learn how to get a list of all the articles published on the site and display them.


Step 1. Preparation

The first thing you’ll need to do is download a copy of the simpleHTMLdom library, freely available from sourceforge.

There are several files in the download, but the only one you need is the simple_html_dom.php file; the rest are examples and documentation.

Download from Sourceforge

Step 2. Parsing Basics

This library is very easy to use, but there are some basics you should review before putting it into action.

Loading HTML

$html = new simple_html_dom();

// Load from a string
$html->load('<html><body><p>Hello World!</p><p>We're here</p></body></html>');

// Load a file
$html->load_file('http://net.tutsplus.com/');

You can create your initial object either by loading HTML from a string, or from a file. Loading a file can be done either via URL, or via your local file system.

A note of caution: The load_file() method delegates its job to PHP’s file_get_contents. If allow_url_fopen is not set to true in your php.ini file, you may not be able to open a remote file this way. You could always fall back on the CURL library to load remote pages in this case, then read them in with the load() method.

Accessing Information

Transforming your HTML

Once you have your DOM object, you can start to work with it by using find() and creating collections. A collection is a group of objects found via a selector – the syntax is quite similar to jQuery.

<html>
<body>
    <p>Hello World!</p>
    <p>We're Here.</p>
</body>
</html>

In this example HTML, we’re going to take a look at how to access the information in the second paragraph, change it, and then output the results.

# create and load the HTML
include('simple_html_dom.php');
$html = new simple_html_dom();
$html->load("<html><body><p>Hello World!</p><p>We're here</p></body></html>");

# get an element representing the second paragraph
$element = $html->find("p");

# modify it
$element[1]->innertext .= " and we're here to stay.";

# output it!
echo $html->save();

Using the find() method always returns a collection (array) of tags unless you specify that you only want the nth child, as a second parameter.

Lines 2-4: Load the HTML from a string, as explained previously.

Line 7: This line finds all <p> tags in the HTML, and returns them as an array. The first paragraph will have an index of 0, and subsequent paragraphs will be indexed accordingly.

line 10: This accesses the 2nd item in our collection of paragraphs (index 1), and makes an addition to its innertext attribute. Innertext represents the contents between the tags, while outertext represents the contents including the tag. We could replace the tag entirely by using outertext.

We’re going to add one more line, and modify the class of our second paragraph tag.

$element[1]->class = "class_name";
echo $html->save();

The resulting HTML of the save command would be:

<html>
<body>
    <p>Hello World!</p>
    <p class="class_name">We're here and we're here to stay.</p>
</body>
</html>

Other Selectors

Here are some other examples of selectors. If you’ve used jQuery, these will seem very familiar.

# get the first occurrence of id="foo"
$single = $html->find('#foo', 0);

# get all elements with class="foo"
$collection = $html->find('.foo');

# get all the anchor tags on a page
$collection = $html->find('a');

# get all anchor tags that are inside H1 tags
$collection = $html->find('h1 a');

# get all img tags with a title of 'himom'
$collection = $html->find('img[title=himom]');

The first example isn’t entirely intuitive – all queries by default return collections, even an ID query, which should only return a single result. However, by specifying the second parameter, we are saying “only return the first item of this collection”.

This means $single is a single element, rather then an array of elements with one item.

The rest of the examples are self-explanatory.

Documentation

Complete documentation on the library can be found at the project documentation page.

special properties

Step 3. Real World Example

To put this library in action, we’re going to write a quick script to scrape the contents of the Nettuts website, and produce a list of articles present on the site by title and description….only as an example. Scraping is a tricky area of the web, and shouldn’t be performed without permission.

Screen Scraping Nettuts
include('simple_html_dom.php');

$articles = array();
getArticles('http://net.tutsplus.com/page/76/');

We start by including the library, and calling the getArticles function with the page we’d like to start parsing. In this case we’re starting near the end and being kind to Nettuts’ server.

We’re also declaring a global array to make it simple to gather all the article information in one place. Before we begin parsing, let’s take a look at how an article summary is described on Nettuts+.

<div class="preview">
    <!-- Post Taxonomies -->
    <div class="post_taxonomy"> ... </div>
    <!-- Post Title -->
    <h1 class="post_title"><a>Title</a></h1>
    <!-- Post Meta -->
    <div class="post_meta"> ... </div>
    <div class="text"><p>Description</p></div>
</div>

This represents a basic post format on the site, including source code comments. Why are the comments important? They count as nodes to the parser.


Step 4. Starting the Parsing Function

function getArticles($page) {
    global $articles;

    $html = new simple_html_dom();
    $html->load_file($page);

    // ... more ...
}

We begin very simply by claiming our global, creating a new simple_html_dom object, then loading the page we want to parse. This function is going to be calling itself later, so we’re setting it up to accept the URL as a parameter.


Step 5. Finding the Information We Want

Count The Children
$items = $html->find('div[class=preview]');  

foreach($items as $post) {
    # remember comments count as nodes
    $articles[] = array($post->children(3)->outertext,
                        $post->children(6)->first_child()->outertext);
}

This is the meat of the getArticles function. It’s going to take a closer look to really understand what’s happening.

Line 1: Creates an array of elements – div’s with the class of preview. We now have a collection of articles stored in $items.

Line 5: $post now refers to a single div of class preview. If we look at the original HTML, we can see that the third child is the H1 containing the article title. We take that and assign it to $articles[index][0].

Remember to start at 0 and to count comments when trying to determine the proper index of a child node.

Line 6: The sixth child of $post is <div class=”text”>. We want the description text from within, so we grab the first child’s outertext – this will include the paragraph tag. A single record in articles now looks like this:

$articles[0][0] = "My Article Name Here";
$articles[0][1] = "This is my article description"

Step 6, Pagination

The first thing we do is determine how to find our next page. On Nettuts+, the URLs are easy to figure out, but we’re going to pretend they aren’t, and get the next link via parsing.

Find the next page to parse

If we look at the HTML, we see the following:

<a href="http://net.tutsplus.com/page/2/" class="nextpostslink">»</a>

If there is a next page (and there won’t always be), we’ll find an anchor with the class of ‘nextpostslink’. Now that information can be put to use.

if($next = $html->find('a[class=nextpostslink]', 0)) {
    $URL = $next->href;

    $html->clear();
    unset($html);

    getArticles($URL);
}

On the first line, we see if we can find an anchor with the class nextpostslink. Take special notice of the second parameter for find(). This specifies we only want the first element (index 0) of the found collection returned. $next will only be holding a single element, rather than a group of elements.

Next, we assign the link’s HREF to the variable $URL. This is important because we’re about to destroy the HTML object. Due to a php5 circular references memory leak, the current simple_html_dom object must be cleared and unset before another one is created. Failure to do so could cause you to eat up all your available memory.

Finally, we call getArticles with the URL of the next page. This recursion ends when there are no more pages to parse.


Step 7 Outputting the Results

First we’re going to set up a few basic stylings. This is completely arbitrary – you can make your output look however you wish.

Final Output
#main {
    margin:80px auto;
    width:500px;
}
h1 {
    font:bold 40px/38px helvetica, verdana, sans-serif;
    margin:0;
}
h1 a {
    color:#600;
    text-decoration:none;
}
p {
    background: #ECECEC;
    font:10px/14px verdana, sans-serif;
    margin:8px 0 15px;
    border: 1px #CCC solid;
    padding: 15px;
}
.item {
    padding:10px;
}

Next we’re going to put a small bit of PHP in the page to output the previously stored information.

<?php
    foreach($articles as $item) {
        echo "<div class='item'>";
        echo $item[0];
        echo $item[1];
        echo "</div>";
    }
?>

The final result is a single HTML page listing all the articles, starting on the page indicated by the first getArticles() call.


Step 8 Conclusion

If you’re parsing a great deal of pages (say, the entire site) it may take longer then the max execution time allowed by your server. For example, running from my local machine it takes about one second per page (including time to fetch).

On a site like Nettuts, with a current 78 pages of tutorials, this would run over one minute.

This tutorial should get you started with HTML parsing. There are other methods to work with the DOM, including PHP’s built in one, which lets you work with powerful xpath selectors to find elements. For easy of use, and quick starts, I find this library to be one of the best. As a closing note, always remember to obtain permission before scraping a site; this is important. Thanks for reading!

Add Comment

Discussion 86 Comments

Comment Page 2 of 2 1 2
  1. Elaine says:

    I like it. Thank you.

  2. Rlseu says:

    thanks nice site.

  3. james says:

    dam thx man

  4. Angelo says:

    Very nice.
    Thank you.

  5. LexieGreene says:

    Just wondering what I should do when the “Next Page” isn’t a link but calls a javascript function to load the new content?

  6. WeCode says:

    come on this is supper easy ….thx for sharing

  7. nonyck says:

    Nice tutorial !!
    I try to make a parse of a webimages but dont work when I have to show the images I parsed a corrupted pic appears, how I can store full links to images ???

  8. Methemer says:

    One thing worries me though – how is this doing in Performance ? For example compared to phpQuery or just simple preg_match() ? This of course looks a million times easier…

  9. ganesh says:

    How to read javascript function like html tags using load.function. How to get the contents inside in javascript function.

  10. mario says:

    This is the most helpful class i ever worked with. Thanks guys, you did a great job.

  11. saijin says:

    This is working fine when I set it up on localserver, but I only see a list of “Object id” once I upload it online.

    Can someone give me a hint on how can I fix it?

    It says:
    Object id #254 Object id #256 Object id #258
    Object id #271 Object id #273 Object id #278
    Object id #284 Object id #286 Object id #291
    Object id #297 Object id #299 Object id #304
    Object id #310 Object id #312 Object id #317

    Here is the link ==> http://floridawebexperts.net/All-Fantasy-Football-Experts/test/files/football.php

    It seems that scrapping is not allowed on my source site? Is that even possible?

  12. saijin says:

    I’m using PHP 5.1.6 Server and Simple HTML DOM 1.5. This script scrape or extract data from a football site, its fully working on PHP 5.2.17 Server. I need to know how I can fix it for PHP 5.1.6 server. Can someone give me a hint on how can I fix the error? Thanks in advance.

    My PHP 5.1.6 Server script output shows:
    ++++++++++++++++
    Object id #599 Object id #604 Object id #609 Object id #614 Object id #619
    Object id #627 Object id #632 Object id #637 Object id #642 Object id #647
    Object id #655 Object id #660 Object id #665 Object id #670 Object id #675
    Object id #683 Object id #688 Object id #693 Object id #698 Object id #703
    Object id #711 Object id #716 Object id #721 Object id #726 Object id #731
    ++++++++++++++++

    PHP 5.2.17 Server says
    ++++++++++++++++
    Rk Player Team POS OPPONENT
    1 Aaron Rodgers GB QB at CAR
    2 Tom Brady NE QB vs. SD
    3 Matt Schaub HOU QB at MIA
    4 Michael Vick PHI QB at ATL
    ++++++++++++++++

    I did applied the bug solution listed on https://sourceforge.net/tracker/index.php?func=detail&aid=3107230&group_id=218559&atid=1044037 but it is still not working. It says:
    ++++++++++++++++
    Details:

    I get compiler errors in PHP 5.2 when using this as an object.

    The offending lines are 609 and 940, which both contain this construct:

    if ($this->size>0) $this->char = $this->doc[0];

    This tries to get the first character of $this->doc, but PHP 5.2 sees it as trying to access it as an array. It’s easily fixed by this:

    if ($this->size>0) $this->char = substr($this->doc, 0, 1);

    Or you could probably use chr(ord($this->doc)) as well. Either way solves the compile error without changing functionality.
    ++++++++++++++++

    Here are my output codes:
    load_file($page);

    //$items = $html->find(‘div[class=preview]‘);
    $items = $html->find(‘tbody tr’);

    foreach($items as $post) {
    # remember comments count as nodes
    /*$articles[] = array($post->children(3)->outertext,
    $post->children(6)->first_child()->outertext);*/
    $articles[] = array($post->children(0), $post->children(1), $post->children(2), $post->children(3), $post->children(4));
    }

    # lets see if there’s a next page
    if($next = $html->find(‘a[class=nextpostslink]‘, 0)) {
    $URL = $next->href;
    echo “going on to $URL <<clear();
    unset($html);

    getArticles($URL);
    }
    }

    ?>

    <?php
    foreach($articles as $item) {
    echo "”;
    echo “” . $item[0] . “” . $item[1] . “” . $item[2] . “”;
    echo “” . $item[3] . “” . $item[4] . “”;
    echo “”;
    }
    ?>

  13. yoshi says:

    thanks for this..

  14. Yeasin says:

    This is okay when i scrape a website without login. But if the website is under a login form, than what is the way. Please Someone reply. [Every data under login form are not copyrighted data. Suppose, my facebook status updates are not any1 others copyright. So, don't call the function copyright :p]

    • Bombero says:

      @Yeasin
      You should use another means to login and get html (after login). Personaly I use PEAR::HTTP_Request2 library.
      Then simply pass it to your simple_html_dom object via $obj->load().

  15. acctman says:

    how would I use simple dom to extra C, E, Hazel from in between the span tags? I also want to individually set a variable for each i.e. $1_firstname, $2_firstname, $2_eyes)

    C
    E
    Hazel

  16. Dragonbird says:

    I am facing some problems with this
    my codes are as follows:
    load(‘Dragonbird ‘);
    if(isset($data))
    {
    $name = $data -> find(‘span[id="ctl00_ContentPlaceHolder1_lblName"]‘);
    if(isset($name))
    {
    echo $name -> plaintext;
    unset($name);
    }
    Else
    {
    echo “Name Not found”;
    }
    $data->clear();
    }
    ?>

    but it returns an error
    Trying to get property of non-object in C:\xampp\htdocs\Projects\scripts\getdetails.php on line 10

    any cookie for me ?

  17. Dragonbird says:

    This is filtering my codes and not letting me type any php or html tag here, how do I show you guys my codes ?

  18. Baliniz says:

    I wonder how to do it with PHP’s built it class: DOMDocument and DOMXPath??
    thanks.

  19. Rodrigo says:

    Amazing tips, thanks, the tip about pagination, using “prev/next” class to find next pages really save me.
    It’s funny how sometimes we “reinvented the wheel”. The next/prev link already in the page, and instead use it, I get all pagination links an put in one array, calling again in loops…
    Code is really poetry =)

  20. Aleks says:

    The code doesn’t work well. Can you update it please. It takes almost every tag from the page, not just the title and description. I guess something is wrong with the child element.

  21. Harvey says:

    would there be any way of saving the information crawled to a database?

  22. armand says:

    I need to access a text that is not wrapped in anything. Is there a way to do that with simple html dom without using regular expressions. The html is like this

    0737.326.856

    Email:
    imobiliaretopcasa@yahoo.com

    http://www.imobiliaretopcasa.ro

    I want to extract the phone 0737.326.856

Comment Page 2 of 2 1 2

Add a Comment

To add a code snippet to your comment, please wrap your code like so: <pre name="code" class="html">YOUR CODE</pre>. You can replace the class name with "js," "css," "sql," or "php." If there are any "<" or ">" within your code, please search and replace them with: &lt; and &gt; respectively.