Syndicate Content from Sites without a Newsfeed (Using JQuery + PHP)

How to Syndicate Content Without Utilizing a News Feed

Jun 15th in PHP by Marc Loney

Many websites offer syndication formats such as RSS, JSON, or XML based services to allow for easy content delivery. But what happens when a website doesn’t offer one of these services? How do you syndicate content from a website that doesn’t offer a news feed? This is what I set out to solve.

PG

Author: Marc Loney

Marc Loney is a web developer working from Perth, Western Australia. He enjoys developing online applications using new web technologies and enjoys pondering on how far these technologies can be pushed.

I received a project lately from a client with an outline and brief of the website and the objectives they wish to accomplish. Along with this brief were notes indicating they were a real estate company and regularly posted property to a well known real estate website and wished to be able syndicate their content on this external site onto their own website without having to update both sites. The catch: This well known real estate site did not offer a syndication service or API for developers to access their listings.

Finished Project

Using JQuery’s load()

Visual Process Map 1

After scouring the internet I discovered that most solutions to this problem were inelegant and most of the time they were browser-specific or ineffective. I decided to code my own solution using the popular javascript library JQuery.

To access information from another website I needed to utilize the AJAX functions of the JQuery library.

  <script  src="http://code.jquery.com/jquery-latest.js"></script>
  <script type="text/javascript">
    $("document").ready(function()  {
      $("#content").load("http://net.tutsplus.com/”);
    });
  </script>
  

If you are familiar with JQuery the above shouldn’t be too difficult to understand. We are using the AJAX  load function to load a webpage’s content into an element with id #content. The solution seemed too easy but alas the problem, as you will soon realize, is that the code will only work in Internet Explorer 6 or 7. The reason for this soon became apparent – all other browsers block the loading in of websites from alternative domains due to local security settings. This meant we can only load relative pages not absolute URLs.

A Server-Side Solution

Visual Process Map 2

I looked around online for a solution to this problem and to my dismay, most people were either under the impression that it was not possible to bypass the local security settings of most browsers or it was too complicating a task so not worth doing. This is when I discovered the cURL library.

cURL is quite useful in that it allows you to communicate with other servers using URLs and standard web protocols such as HTTP, HTTPS or SSL. Using cURL I was able to build a bypass to our local security problem by using loading in the whole website to a local URL server-side.

  
  <?php
  $ch = curl_int(“http://net.tutsplus.com”);
  $html = curl_exec($ch);
  print “$html”;
  ?>

This code initiates cURL object from an external URL – the benefit being the URL is loaded on the server rather than on the client. The server security settings in the PHP environment are a lot more flexible than the local security settings of most modern browsers. After initiating the cURL object we simply print the whole contents of the URL. If we now save this document as ‘curl.php’ onto our web server we now have a local file that will load in the entire website contents of our external URL.

Let’s go back to our original code and put in our modifications:

  <script  src="http://code.jquery.com/jquery-latest.js"></script>
  <script type="text/javascript">
    $("document").ready(function()  {
      $("#content").load("curl.php”);
    });
  </script>
  

Our script now supports all browsers and isn’t accomplished using any unorthodox local security hacks.

Why use JQuery?

JQuery

Now you might wonder what are the advantages of working with this document in JQuery as compared to just manipulating our document using PHP? The main reason for my choice in using JQuery is the ability to use its CSS-styled selectors to choose what content on our page we actually want to syndicate, like the following:

  <script  src="http://code.jquery.com/jquery-latest.js"></script>
  <script type="text/javascript">
    $("document").ready(function()  {
      $("#content").load("curl.php #content”);
    });
  </script>

Rather than loading in the whole document we now just load in the contents of an element with id #content. We will get to the benefits of this later on in the article.

Images and Anchors

After playing around with this for a bit you may notice the next big problem. Although we have managed to syndicate an external sites content, all relative links and images are no longer working. Another reason for working in JQuery. Using the JQuery each() function we can create a loop that goes through all <a> and <img> elements grabbing the current HREF or SRC attribute and prepending the external domain onto it.

  <script  type="text/javascript">
  var domain = "http://www.google.com";
  $(document).ready(function(){
    $("a").each(function (i) {
      var href = $(this).attr('href');
      var new_href = domain + href;
      $(this).attr('href',new_href);
    });
  $("img").each(function (i) {     var src = $(this).attr('src');     var new_src = domain + src;     $(this).attr('src',new_src);   }); }); </script>

We first select all <a> elements and cycle through them extracting the href attribute and then prepending our chosen domain to it. We could also if we want add in an attribute to open all links in new windows, etc. Secondly we select all <img> elements and again cycle through them extracting the src attribute, etc.

Now the problem at this point we run into is where do we integrate our new code into our existing code? The problem I originally came across was no matter where you put it the external markup did not load quickly enough for our code to change the domain to come into effect after the fact. The solution involves combining the two into quite an elegant JQuery solution.

  $("document").ready(function()  {
     $("#content").load("curl.php #content",{},function(){
      $("a").each(function (i) {
        var  href = $(this).attr('href');
        var new_href = domain + href;
        $(this).attr('href',new_href);
      });
    $("img").each(function (i) {       var src = $(this).attr('src');       var new_src = domain + src;       $(this).attr('src',new_src);     });   }); });

The load function has two more properties it can take, one being variables you want to submit to your external URL. For example you could be trying to retrieve data from the results of a POST form. The other property being a callback function or what to do once the load() function has finished. In our case this is perfect – we place our code in the callback function which prevents it from running until we completely load in our external page.

Previews

As you can see now we are now able to simply pull into any element on our page content from another website. This is very practical for not just syndicating content like news feeds but any dynamically updated content.

Styling Our Content

Now that we can have pulled in our content the next step shows the superiority in using this code over say an <iframe>. While an <iframe> solves many of messy issues with links, etc we went through above we are not able to seamlessly integrate it into a website with a completely different style. The content will essentially always be just a window into another website. As seen earlier when I first introduced the idea of using CSS-styled selectors in style sheet we can select any id or class or any selector by just placing it in the load() function:

  $("document").ready(function()  {
     $("#content").load("curl.php #content",{},function(){
  ...

In this case we are only selecting a <div> from the homepage of Net Tuts+ which happens to correspond with the main content <div> We are now syndicating just an extract of the page, not pulling in any of styles (as they are contained in the <head>) nor any effects (if they exist). We are only pulling in markup.

We are now going to add some styles to our page using CSS.

body,a {
  font-family: 'Tahoma';
  color: #fff;
  background-color: #000;
  font-size: 12px;
}
#content {
  width: 600px;
}
#content small, #content span, #content .more-link {
  display:none;
}
#content img {
  float:left; margin-right: 5px;
}
#content h1 {
font-size: 14px;
}

This CSS is more about demonstrating a few important features than being aesthetically appealing. A few important things to note at this point is that we have to remember to assign styles exactly to the tags we are looking at styling -- I.E. don't style all <small> tags - we only want to style the ones in the #content <div>. The second thing to note is what I've done to the <small>,<span> tags and .more-link class. Rather than displaying all the content we have syndicated it may be useful to hide some of it - we could even use that content in dropdown effect or something similar. We are instead hiding the tags completely using the display property. We use display rather than visibility for a reason - visibility still leaves the outline of where the content was. Display hides this completely.

Preview

Modify Images using JQuery

Another thing we can do to make our news syndicator take up less space on our screen is modify the images. This could be done using CSS but instead I want to demonstrating using JQuery to modify the source of the image.

We are going to modify our JQuery to use the attr() function to modify the source of our image to one on our own server - a nice, little link button.

...
  $("#content img").each(function (i) {
var src = $(this).attr('src');
var new_src = domain + src;
$(this).attr('href',new_src);
});
$("#content img").attr('src','link.png');
});
});

Now lets modify our CSS slightly to make our image float nicely to the left.

#content img { float:left; margin-right: 5px;}
Preview

Now, using only content syndicated from the Net Tuts+ homepage, we have managed to build a news syndicator with completely different styling to the original site.

Preview

Preloader

What you may notice when you use this code is that it takes a while for JQuery to process and load the external site. A nice feature to add is a loading bar to the #content <div> while we wait for the content to load.

The easiest way to make our loading bar is to place a loading bar image inside our #content <div> in our mark up. Our loading image will appear when the site first loads but once the JQuery has finished loading our external content it will replace the current content, being the loading bar, with our new content. A site I use quite often when generating loading bars is http://www.ajaxload.info/. It has a very decent generator for creating a variety of loading images.

...
<h1>My Content Syndication Service</h1> <div id="content"><img src="ajax-loader.gif" alt="Loading..." /></div>
...

We now have a nice little application which will show a preloading image until our content is ready to show.

Preview

While the preloader is a nice feature it isn't a replacement for optimised code. In this tutorial we are using JQuery to choose what elements we should select or not when in actual fact the most speed optimal solution would be to do that in our PHP code. This though, is outside the scope of this tutorial.

Conclusion

There we have it – a simple solution using JQuery’s AJAX functions and PHP’s cURL library that allows us to syndicate external content. This is a simple solution if you require content from an external website. As I have already stated, although JQuery's easy syntax and CSS-selectors give us the convenience of styling and selecting what we want from the client-side, this is not speed optimized. The best thing would be for us to remove the tags we don't want using Regular Expressions in PHP. I would also note one of the most common mistakes is being too specific when styling; remember you have no control over whether or not the content creator changes what tags and classes they use, it is always best to style general elements that will be commonly used.

Another thing worth taking into account is that this tutorial is meant to generate a content syndicator - it is not intended for use as a site content 'scraper'. If you are going to implement this in a commercial project, make sure you have the permission of the copyright holder to use the content on your page.


Related Posts

Check out some more great tutorials and articles that you might like

Enjoy this Post?

Your vote will help us grow this site and provide even more awesomeness

Plus Members

Source Files, Bonus Tutorials and
More for $9 a month for all TUTS+
sites in one subscription.

Join Now

User Comments

( ADD YOURS )
  1. PG

    BB June 15th

    useful stuff. thanks for sharing.

    ( Reply )
  2. PG

    Pete June 15th

    Now that’s cool.

    ( Reply )
  3. PG

    Dario Gutierrez June 15th

    Very interesting, jQuery rocks! The best.

    ( Reply )
  4. PG

    Zé Miguel June 15th

    today most of websites already have a convenient feed, but this could be pretty useful for some projects!

    ( Reply )
  5. PG

    Joe June 15th

    Hi dude, rocks! Good problem solving! While I’m thinking if the external site changes its page code, then we have to change our selector.

    ( Reply )
  6. PG

    Joe June 15th

    I downloaded the source code, but when I tested it in the WAMP servver, I could not get the content, only black background.

    ( Reply )
    1. PG

      Sumit June 18th

      you need enable the curl extension in u’r php.ini file..normally it’s disabled by default in WAMP SERVER…

      ( Reply )
      1. PG

        Dustin September 26th

        Confirmed also disabled in XAMPP

  7. PG

    Yoosuf June 15th

    its a grat one

    ( Reply )
  8. PG

    Muhammad Adnan June 15th

    nice tut. Jquery Rocks !

    ( Reply )
  9. PG

    Myfacefriends June 15th

    Thanks for sharing this very useful.

    ( Reply )
  10. PG

    Paul June 15th

    EXACTLY what I’ve been looking for! Thank you SO much!

    The only issue that I’ve found is that the href isn’t re-writing correctly in IE. It works fine in FF. Any ideas?

    ( Reply )
  11. PG

    Rob June 15th

    interesting ideas, thanx!

    ( Reply )
  12. PG

    Ed Baxter June 15th

    Great tutorial! :D

    ( Reply )
  13. PG

    Nathan June 15th

    Never thought about doing this. It will be useful for a future project

    ( Reply )
  14. PG

    Melissa June 15th

    Great idea, but it appears to me have a huge potential for abuse. How do you stop people from stealing your content and displaying it on their website? All they need is an url. Do you just rely on other people following the copyright laws?

    ( Reply )
    1. PG

      Stefan June 15th

      I’m affraid that’s all you can rely on.

      If humans can see the content, scripts can get it.

      ( Reply )
  15. PG

    Jake June 15th

    Wow, great stuff. Thanks for sharing your work and your discoveries! Only hope I can make as great of a contribution sometime.

    ( Reply )
  16. PG

    Harnish June 15th

    You need to actually modify your code to say curl_init instead of curl_int. You also need a variable $url where the URL goes in and then invoke $ch = curl_init($url);

    ( Reply )
  17. PG

    bussurfer June 15th

    Hi Mark, thanks for the great tutorial. Though I’m new to jQuery everything is pretty clear. I was wondering if there’s a way to replace the relative SRC and HREF attributes of the and elements with the absolute ones. Thanks

    ( Reply )
  18. PG

    bussurfer June 15th

    Hi Mark, thanks for the great tutorial. Though I’m new to jQuery everything is pretty clear. I was wondering if there’s a way to replace the relative SRC and HREF attributes of the SCRIPT and LINK elements with the absolute ones. Thanks

    ( Reply )
  19. PG

    Stephen June 15th

    If you’re trying to get content off of another site, you could also use a service like Feed43 ( http://www.feed43.com/ ) to scrape to a feed. We’ve done this to get content from third party sites that we use that don’t have the option of XML output.

    ( Reply )
  20. PG

    Dave June 15th

    Very nice TUT

    ( Reply )
  21. PG

    michael June 15th

    This tutorial retrieves the data via JQuery pretty nicely. But what would you do if JavaScript was disabled? I would suggest you just put links to the original articles in that situation, but then why not just implement the whole thing server side? In Australia a lot of internet connections are that slow that loading up external articles via JQuery and stripping all additional content just doesn’t seem viable.

    Also I want to reiterate Joe’s point – if the external site changed it’s markup your syndicator would no longer work. But then again theoretically their site should know you’re syndicating their content and should let you know in advance. And it would seem that in that situation they should just provide you with an RSS feed.

    ( Reply )
  22. PG

    Adrian June 15th

    You could also just use Yahoo! Pipes.

    ( Reply )
    1. PG

      Michael Freeman June 18th

      That is what I already do and it works great in most cases. I have found that most pages from Google Video are not valid HTML so the Yahoo Pipes parser breaks.

      In those cases I use java with the htmlcleaner.jar to put the junky response into a proper DOM

      ( Reply )
  23. PG

    Mujtaba June 15th

    Just what i needed for my upcoming site…
    Thnx 4 sharing

    ( Reply )
    1. PG

      Tim June 16th

      Me too. Thanks!

      ( Reply )
  24. PG

    Arpit Tambi June 16th

    More performance improvements -

    Perhaps on the PHP side, the page can be very well cached for 12 hours or 24 hours. This way servers don’t load the url on every request and saves bandwidth plus time.

    ( Reply )
  25. PG

    Thomas June 16th

    Personally I would prefer to do the whole thing server side and then cache it. Seems a lot better to me, and no JS enabled/whatever problems.

    ( Reply )
  26. PG

    David June 16th

    What about using something like the php explode or implode commands?

    ( Reply )
  27. PG

    shin June 16th

    Actually you don’t need the following code.
    $(”a”).each(function (i) {
    var href = $(this).attr(’href’);
    var new_href = domain + href;
    $(this).attr(’href’,new_href);
    });
    $(”img”).each(function (i) {
    var src = $(this).attr(’src’);
    var new_src = domain + src;
    ….

    This will create the following link. http://net.tutsplus.com/ will be repeated twice.

    http://net.tutsplus.com/http://net.tutsplus.com/tutorials/javascript-ajax/24-javascript-best-practices-for-beginners/

    ( Reply )
  28. PG

    Nat June 16th

    I’d have to agree with Melissa that this post raises warning flags.

    If the site doesn’t have an RSS feed, it could be that they do not want to and have not licensed that content to be consumed and displayed anywhere BUT on their domain.

    Always check the copyright at the footer of the site you are trying to scrape. I could be wrong but if the site doesn’t have a Creative Common license or similar stated, and it doesn’t offer an RSS or Atom feed, and you don’t have their permission you could be illegally stealing their content?

    Also, they could do a simple blocking of your server where your php cUlr script resides and you no longer have the service.

    Thoughts?

    ( Reply )
    1. PG

      Marc Loney June 17th

      Hi everyone,

      Thanks for your comments, this is my first tutorial for this site so i’m keen for as much feedback as possible!

      In terms of the copyright issues brought up I briefly mention them in my last paragraph. While something like this, server side or client side can be implimented as a’scrapper’ essentially stealing content from other sites that is not the only method. There are plenty of examples of technology out there that have the potential for abuse and at the end of the day all you can make sure is that you are doing the right thing. Always make sure you have the permission of the copyright owner to publish their content.

      On the other issue of implimenting this server-side, I mention a bit about the performance increase by doing this but in the end I ended up wanting to write it on the client-side purely to show a different application of jQuery for those starting to get into it.

      Cheers!

      ( Reply )
  29. PG

    Goran Juric June 17th

    You wrote that “The main reason for my choice in using JQuery is the ability to use its CSS-styled selectors to choose what content on our page we actually want to syndicate, ….”.

    Have a look at Zend_Dom_Query. You can use css selectors to grab content on the server site. Add Zend_Http_Client for fetching the page and Zend_Cache for caching the content and you will have a solution that does not depend on the other site being available all the time in no time.

    ( Reply )
    1. PG

      Montana Flynn June 17th

      That would certainly be the most elegant solution.

      ( Reply )
  30. PG

    Max Stanworth June 17th

    I think im going to use this in a prjoect im doing, love the pre loader

    ( Reply )
  31. PG

    Matt Fairbrass June 17th

    Interesting article and shows a nice proof of concept to solve a problem should an external site not provide you with any feeds to syndicate. But as many others have pointed out using this method is not without its flaws, most notably:

    - Loading time
    - Disabling Javascript in the web browser
    - External site changing their mark up.

    But aside from that the tutorial itself was very well written and explained the steps to implement such a method in a clear and concise way. Nice tutorial.

    ( Reply )
  32. PG

    Paul du Long June 18th

    TYPO!!!!!!!!!!

    $ch = curl_int(“http://net.tutsplus.com”);

    must be

    $ch = curl_init(“http://net.tutsplus.com”);

    ( Reply )
  33. PG

    Luke Robinson June 18th

    This tut just made my day. Many thanks!

    ( Reply )
  34. PG

    Diego SA June 18th

    Wow, normally I don’t use feed, but this tutorial is awesome! Nice!

    ( Reply )
  35. PG

    John Pitchers June 22nd

    Hmmm. I can hear the scraping already. :)

    ( Reply )
  36. PG

    Elery July 3rd

    Hi guys this tutorial is exactly what i’m looking for but it doesn’t seem to be working…how do i get it to work with XAMPP?

    ( Reply )
    1. PG

      Dustin September 26th

      ;extension=php_curl.dll

      needs to be enabled in php.ini

      extension=php_curl.dll

      ( Reply )
  37. PG

    lysenshi July 4th

    I just read your article, it is really wonderful.
    I am new to jQuery, and currently exploring its features and what it can do. I can see it can do so much.
    Thanks for the article ;)

    ( Reply )
  38. PG

    moshe July 7th

    thanks for this!

    my curl script works great (thanks guys!) but seems to print the digit ‘1′ after it brings in the URL content.

    anyone know why?

    thanks!

    ( Reply )
  39. PG

    Gecko July 11th

    my site get suspended after scrapping 4shared.com
    Reason: high cpu usage or leech

    How to minimize cpu usage when use curl?

    ( Reply )
  40. PG

    Ty Fairclough July 14th

    Right I have stumbled a problem with the appending of the domain to make relative links work. Simply if the site includes static and relative links those static links end up looking a bit like this:

    http://yourdomain.comhttp://yourdomain.com/externalorrelativelink.html

    and of course this breaks the links.

    Im useless at js, i couldn’t even get the script to select a class instead of an id to suit my needs. But would some kind of if statement that looks for the ‘domain’ in the url first be a good idea? Then you could continue the script as normal I would presume.

    Assistance on this would be of great help folks! Great tutorial here, its going to be really useful!

    ( Reply )
  41. PG

    Andy July 27th

    What a great list!

    ( Reply )
  42. PG

    sathish August 28th

    Great solution Thanks

    I have only basic knowledge in php.
    I used in my website i gave my domain name in the curl function

    $ch = curl_init(”http://mydomain.com/”);
    $html = curl_exec($ch);

    But I dint get anything except Blank page.
    Any solution plz

    ( Reply )
  43. PG

    เพชร September 22nd

    This is what I’m looking for, thanks.

    ( Reply )
  44. PG

    streetparade September 23rd

    Javascript sucks. Why not use PHP ?
    There is dom,xpath, and the one i love is simple_html_dom.
    This is more effectiver than jquery.
    http://simplehtmldom.sourceforge.net/

    ( Reply )
  1. Arrow
    Gravatar

    Your Name
    September 23rd