Tutorial Details
- Technology: PHP
- Difficulty: Moderate
- Estimated Completion Time: 40 Minutes
If you need to parse HTML, regular expressions aren’t the way to go. In this tutorial, you’ll learn how to use an open source, easily learned parser, to read, modify, and spit back out HTML from external sources. Using nettuts as an example, you’ll learn how to get a list of all the articles published on the site and display them.
Step 1. Preparation
The first thing you’ll need to do is download a copy of the simpleHTMLdom library, freely available from sourceforge.
There are several files in the download, but the only one you need is the simple_html_dom.php file; the rest are examples and documentation.

Step 2. Parsing Basics
This library is very easy to use, but there are some basics you should review before putting it into action.
Loading HTML
$html = new simple_html_dom();
// Load from a string
$html->load('<html><body><p>Hello World!</p><p>We're here</p></body></html>');
// Load a file
$html->load_file('http://net.tutsplus.com/');
You can create your initial object either by loading HTML from a string, or from a file. Loading a file can be done either via URL, or via your local file system.
A note of caution: The load_file() method delegates its job to PHP’s file_get_contents. If allow_url_fopen is not set to true in your php.ini file, you may not be able to open a remote file this way. You could always fall back on the CURL library to load remote pages in this case, then read them in with the load() method.
Accessing Information

Once you have your DOM object, you can start to work with it by using find() and creating collections. A collection is a group of objects found via a selector – the syntax is quite similar to jQuery.
<html>
<body>
<p>Hello World!</p>
<p>We're Here.</p>
</body>
</html>
In this example HTML, we’re going to take a look at how to access the information in the second paragraph, change it, and then output the results.
# create and load the HTML
include('simple_html_dom.php');
$html = new simple_html_dom();
$html->load("<html><body><p>Hello World!</p><p>We're here</p></body></html>");
# get an element representing the second paragraph
$element = $html->find("p");
# modify it
$element[1]->innertext .= " and we're here to stay.";
# output it!
echo $html->save();
Using the find() method always returns a collection (array) of tags unless you specify that you only want the nth child, as a second parameter.
Lines 2-4: Load the HTML from a string, as explained previously.
Line 7: This line finds all <p> tags in the HTML, and returns them as an array. The first paragraph will have an index of 0, and subsequent paragraphs will be indexed accordingly.
line 10: This accesses the 2nd item in our collection of paragraphs (index 1), and makes an addition to its innertext attribute. Innertext represents the contents between the tags, while outertext represents the contents including the tag. We could replace the tag entirely by using outertext.
We’re going to add one more line, and modify the class of our second paragraph tag.
$element[1]->class = "class_name"; echo $html->save();
The resulting HTML of the save command would be:
<html>
<body>
<p>Hello World!</p>
<p class="class_name">We're here and we're here to stay.</p>
</body>
</html>
Other Selectors
Here are some other examples of selectors. If you’ve used jQuery, these will seem very familiar.
# get the first occurrence of id="foo"
$single = $html->find('#foo', 0);
# get all elements with class="foo"
$collection = $html->find('.foo');
# get all the anchor tags on a page
$collection = $html->find('a');
# get all anchor tags that are inside H1 tags
$collection = $html->find('h1 a');
# get all img tags with a title of 'himom'
$collection = $html->find('img[title=himom]');
The first example isn’t entirely intuitive – all queries by default return collections, even an ID query, which should only return a single result. However, by specifying the second parameter, we are saying “only return the first item of this collection”.
This means $single is a single element, rather then an array of elements with one item.
The rest of the examples are self-explanatory.
Documentation
Complete documentation on the library can be found at the project documentation page.

Step 3. Real World Example
To put this library in action, we’re going to write a quick script to scrape the contents of the Nettuts website, and produce a list of articles present on the site by title and description….only as an example. Scraping is a tricky area of the web, and shouldn’t be performed without permission.

include('simple_html_dom.php');
$articles = array();
getArticles('http://net.tutsplus.com/page/76/');
We start by including the library, and calling the getArticles function with the page we’d like to start parsing. In this case we’re starting near the end and being kind to Nettuts’ server.
We’re also declaring a global array to make it simple to gather all the article information in one place. Before we begin parsing, let’s take a look at how an article summary is described on Nettuts+.
<div class="preview">
<!-- Post Taxonomies -->
<div class="post_taxonomy"> ... </div>
<!-- Post Title -->
<h1 class="post_title"><a>Title</a></h1>
<!-- Post Meta -->
<div class="post_meta"> ... </div>
<div class="text"><p>Description</p></div>
</div>
This represents a basic post format on the site, including source code comments. Why are the comments important? They count as nodes to the parser.
Step 4. Starting the Parsing Function
function getArticles($page) {
global $articles;
$html = new simple_html_dom();
$html->load_file($page);
// ... more ...
}
We begin very simply by claiming our global, creating a new simple_html_dom object, then loading the page we want to parse. This function is going to be calling itself later, so we’re setting it up to accept the URL as a parameter.
Step 5. Finding the Information We Want

$items = $html->find('div[class=preview]');
foreach($items as $post) {
# remember comments count as nodes
$articles[] = array($post->children(3)->outertext,
$post->children(6)->first_child()->outertext);
}
This is the meat of the getArticles function. It’s going to take a closer look to really understand what’s happening.
Line 1: Creates an array of elements – div’s with the class of preview. We now have a collection of articles stored in $items.
Line 5: $post now refers to a single div of class preview. If we look at the original HTML, we can see that the third child is the H1 containing the article title. We take that and assign it to $articles[index][0].
Remember to start at 0 and to count comments when trying to determine the proper index of a child node.
Line 6: The sixth child of $post is <div class=”text”>. We want the description text from within, so we grab the first child’s outertext – this will include the paragraph tag. A single record in articles now looks like this:
$articles[0][0] = "My Article Name Here"; $articles[0][1] = "This is my article description"
Step 6, Pagination
The first thing we do is determine how to find our next page. On Nettuts+, the URLs are easy to figure out, but we’re going to pretend they aren’t, and get the next link via parsing.

If we look at the HTML, we see the following:
<a href="http://net.tutsplus.com/page/2/" class="nextpostslink">»</a>
If there is a next page (and there won’t always be), we’ll find an anchor with the class of ‘nextpostslink’. Now that information can be put to use.
if($next = $html->find('a[class=nextpostslink]', 0)) {
$URL = $next->href;
$html->clear();
unset($html);
getArticles($URL);
}
On the first line, we see if we can find an anchor with the class nextpostslink. Take special notice of the second parameter for find(). This specifies we only want the first element (index 0) of the found collection returned. $next will only be holding a single element, rather than a group of elements.
Next, we assign the link’s HREF to the variable $URL. This is important because we’re about to destroy the HTML object. Due to a php5 circular references memory leak, the current simple_html_dom object must be cleared and unset before another one is created. Failure to do so could cause you to eat up all your available memory.
Finally, we call getArticles with the URL of the next page. This recursion ends when there are no more pages to parse.
Step 7 Outputting the Results
First we’re going to set up a few basic stylings. This is completely arbitrary – you can make your output look however you wish.

#main {
margin:80px auto;
width:500px;
}
h1 {
font:bold 40px/38px helvetica, verdana, sans-serif;
margin:0;
}
h1 a {
color:#600;
text-decoration:none;
}
p {
background: #ECECEC;
font:10px/14px verdana, sans-serif;
margin:8px 0 15px;
border: 1px #CCC solid;
padding: 15px;
}
.item {
padding:10px;
}
Next we’re going to put a small bit of PHP in the page to output the previously stored information.
<?php
foreach($articles as $item) {
echo "<div class='item'>";
echo $item[0];
echo $item[1];
echo "</div>";
}
?>
The final result is a single HTML page listing all the articles, starting on the page indicated by the first getArticles() call.
Step 8 Conclusion
If you’re parsing a great deal of pages (say, the entire site) it may take longer then the max execution time allowed by your server. For example, running from my local machine it takes about one second per page (including time to fetch).
On a site like Nettuts, with a current 78 pages of tutorials, this would run over one minute.
This tutorial should get you started with HTML parsing. There are other methods to work with the DOM, including PHP’s built in one, which lets you work with powerful xpath selectors to find elements. For easy of use, and quick starts, I find this library to be one of the best. As a closing note, always remember to obtain permission before scraping a site; this is important. Thanks for reading!

Thank you, now I have a final component to build my empire of zillion of auto-generated sites to provide links for SEO purposes :-)
You’re welcome! Don’t forget to send the 10% royalties my way, of course, after you’ve put that zillion in the bank!
ei buddy 1% royalties my way won’t hurt :D
It’s funny that you say “regular expressions aren’t the way to go” while recommending a library that uses regular expressions.
Using regular expressions where appropriate and attempting to parse HTML solely with regular expressions are two entirely different things.
Read through the source and understand it – you’ll realize the difference.
Thanx. Very nice.
perfect tuts thanks a lot
Very nice tut. Reminds me a lot of the parsing methods and such for jQuery AJAX functions. jQuery seems to use the DOM and shortcuts for different arrangements of parameters and selectors to do a lot of those things outlined above. In conjunction, wonderful!! :)
could have done with this about a year ago at uni! damn!
Thanks @Erik.
Example of the use of proxy
$opts = array(
‘http’=>array(
‘method’=>”GET”,
‘request_fulluri’ => true,
‘user_agent’ => ‘youruseragent’,
‘proxy’ => ‘yourproxyadress’,
)
);
$context = stream_context_create($opts);
$html= file_get_html($url, false, $context);
how do you apply this to coding?
my test coding looks like this
$html = new simple_html_dom();
$html->load_file(‘test.htm’);
$first1 = $html->find(‘span[id=ctl00_ContentPlaceHolder1_phstats1_firstname]‘, 0);
$first2 = $html->find(‘span[id=ctl00_ContentPlaceHolder1_phstats2_firstname]‘, 0);
echo $first1->plaintext;
echo $first2->plaintext;
this is useful to create auto generated content, such us wp-robot :D
Thanks for your fully step by step tuts :)
Can anyone describe how we can remove the javascript stuff embedded in side html like
” ??
If you’re talking about javascript inside, for example, an inline onclick, you can simply access that particular element, and then change that property. For example:
$element->onclick= ”;
Would empty the onclick handler (if its inline).
Isn’t this the way hacker sites are built? Don’t know if this is a good tutorial for all to see.
No, you’re being silly. This is a very useful tutorial for people to see. I’ve used this library to scrape information off sites to generate statistics for them.
Eric, You should create a CMS using Simple HTML DOM library and sell it at codecanyon…
A CMS made with a tool for scraping data off websites, yeah sure that’s a great idea
*rolls eyes*
Aside from the screen-scraping aspect (which isn’t recommended except as a very last resort) – this is a great way to some complex manipulation of your CMS output from within the templates, assuming you have access to the HTML buffer. You also could use it to manipulate user-generated content that contains HTML structures.
I can see it being used to create custom mobile versions of sites you visit regularly.
I used this once to pull information from the latest post on the WordPress Development Blog.
Great article by the way.
+1 for last resort. I was blocked for screen-scraping sites with this tool as a junior. The example isnt so hot given the availability of a site-sanctionned RSS feed which exists for the same purpose and is much easier to parse and cache with MagPieRSS or similar. Dont get me wrong, though, I keep this in my library because its useful in a lot of cases, but as a way to grab data from another site its weak.
Good luck accessing historical content with the RSS feed. :)
Nice tutprial.
Thanks for sharing.
using this to create a list of articles on Net Tuts, I already had some of the entries but this makes it easy to finally create my index list of all articles on Net Tuts for reference
Al
pyquery is way cooler, just saying.
This tool is one of most powerful PHP source that I’ve used for working with HTML. The greatest thing of this is that it allows us to select element in jQuery-liked style – that’s very quick and easy. I’ve used it in some of my projects and feel very good with it.
Whilst a very good tutorial in terms that it is useful to many why not use DOM (Document Object Model) for PHP5? Its much more difficult but getting dirty with the core is best
I wouldn’t agree that using the DOM extension is much more difficult, however Simple HTML DOM does provide a simpler approach which may be more familiar to a wider audience (e.g. CSS selectors) or just more convenient to use (I’m struggling for an example on this point, perhaps friendliness to “bad” HTML).
This article is about using Simple HTML DOM; not a generic HTML-parsing and screen-scraping tutorial. If some wants to write a similarly themed article targeted towards using PHP’s DOM extension then I am sure that the powers that be would entertain the idea.
There are always more ways to crack a nut and this is a very neat way! I wrote a tool using jQuery and the DOM which is very easy given jQuery’s parsing capabilities. Maybe i’ll get an article together soon.
Is there an asp.net alternative?
Really, this is the tool that will generate a bunch of new useless scraping sites ;-)
But it could come handy sometimes, when you need to take something quickly.
Nice article
Great tutorial on screen scrapping.
Hi
good article. But i am concerned about performance of script. How do you think HTML Dom parsing library is faster than using Regular expression ? Do we have better performance using this method ??
@arslan: HTML is not a regular language – at least not the way its encountered in the wild. And as such, you can’t properly parse it with purely regular expressions. The example I gave in the tutorial was simple and for illustrating a point.
As far as performance goes — try it. I think you’ll find using either Simple HTML DOM or PHP’s DOMDocument will get you better performance results then trying to parse large amounts of data out of HTML via purely regex.
Hello guys I have problem
Its not working on my server
PHP Warning: file_get_contents(http://net.tutsplus.com/page/78/) [function.file-get-contents]: failed to open stream: no suitable wrapper could be found
How i can fix it
Its possible that this method of getting files is disabled on your server for security reasons. You can either contact your web host, change your php.ini yourself (if allowed) or use another method to fetch the files such as CURL (there’s a curl tutorial on the site if I recall correctly)
I am running XAMPP on a Windows XP machine and running into the same issue.
I have edited the ‘php.ini’ file to allow_url_fopen and allow_url_include, restarted the Apache server but still no dice.
Is there something else I am missing?
; Whether to allow the treatment of URLs (like http:// or ftp://) as files.
; http://php.net/allow-url-fopen
allow_url_fopen = On
; Whether to allow include/require to open URLs (like http:// or ftp://) as files.
; http://php.net/allow-url-include
allow_url_include = On
I’ve tested this small lib some days ago and it’s really easy to implement and use.
Great article, an eye opener. Thanks for sharing!
This tutorial shows in few steps how easy to use the simplehtmldom script, great!
I use this script for a long time and i love it more and more.
But .. in some cases, i’m dependent on preg_match(_all) to parse what i need :-/
cheers
Very interesting and relevant (for me) article. As an XML enthusiast this gets over the reason I don’t use simpleXML or phpDOM for content extraction (such a nicer term than scraping). Basically the (X)HTML on so many sites doesn’t validate, or even anywhere near validate (Yeah, I’m looking at you Google!), so when we parse it, the parser just fails.
Yahoo YQL service does a great version of this, and you can run XPath on the returned (now valid) (X)HTML, which is truly fantastic. Plus as it uses cached versions of the site where possible, and their own epic connectivity where not, the return times are good.
However YQLs bot follows robot.txt rules, so if you want data from a noindex page then this will definately help.
The best thing which I liked in this library is the resemblance of its selectors with jQuery :)
Very powerful one and thanks for sharing with us :)
Hi,
Great script ! But doesn’t work on 1and1 …
Does anybody tried to change the script with the CURL method ?
Thanks
I was reading through pages & pages of information related to extracting specific information from a table on manufacturer’s product pathis, finding all sorts of things that almost worked but not quite in the context I wanted – I didn’t exactly want to have to count the number of child objects on each page to pull the data I wanted when it had a specific class name that I could reference, as per the one working example I came across. I kept getting (array) returns when this little sideline on this page:
“Using the find() method always returns a collection (array) of tags unless you specify that you only want the nth child, as a second parameter.”
…made it all click into place. My code is now working 100%. Thank you very much for a great site.
If from the start:
http://net.tutsplus.com/page/50/
we have this error.
Fatal error: Maximum execution time of 30 seconds exceeded
How to fix?
very goood site. thank.
nice tutorial, quite rare tutorials which explain this stuff step by step…or maybe i need more times to googling :(
How to extract the content of the page after click (continue)?
Since information is not complete during extraction, how to solve this problem?
Now i am doing blog scraping.
Great tutorial. Tried it out on a few of my sites, but I couldn’t figure out what I was doing wrong until I read in the post that comments count as nodes! (that’s what I get for being too eager.)
I noticed that on a couple of my wordpress blogs, there were classes on the page number links, but none on the >> (next page) links. Is there a way to dynamically add a class to that link, before the dom parser does its job?
hi dear developer
Can u. give me some advices in parsing a simple site with PHP Simple HTML DOM Parser: See http://schulnetz.nibis.de/db/schulen/schule.php?schulnr=94468&lschb= this is very very simple. Can you give me a starting point to do parse the lables with the corresponding values… Love to hear from you!
First things first, thank you so much for posting this, it was amazingly helpful.
Unfortunately, I’m having trouble with the pagination step. It seems that a new html page isn’t loading, so each time that $getArticles($URL) is called it prints the content from the first page again. Thanks for any help you can give!
nevertheless, this is sometimes a good Tutorial, thank you
I have a question on print_r of those found elements
Say
$element = $html->find("p");
print_r($element);
it will print a recursion array instead of just those find p element
thanks, Great tutorial :)
can we get https page content ?
Thanks you for this great tutorials , it really gave me chills and helped a lot.
Great Work.
Thanks you for sharing this tutorial …
It work very good …
I’ve been using Simple HTML DOM for quite a while and it definitely beats using regular expressions (in most scenarios). Nice to see a full tutorial on it.
Hi great stuff. Any good solution to avoid the execution timeout for larger paged or multiple page scraping?
how get element with proxy?????
hi..
i want to read all div inner HTML like inner html with …how can i do this??
any help will be appreciated.
goog article, nice script. thanks for sharing
I used this class to design a web crawler for my office’s website to identify broken links, it’s pretty cool. I executed it via PHP-CLI with a couple of shell script to reduce memory overload. My company site consists 12,000 pages of content – pretty large. The challenge was how to re-construct the child URL ( Parent path + Child URL ) and parsing it to function to continue next level crawling, wonder if the class can do that for me.
Great!
Thank you