Use curl and PHP Simple HTML DOM Parser to inject WordPress into another page

David Nash css, html, php, wordpress 3 Comments

I was recently asked to create a WordPress theme that would run on its own server but be integrated into a larger e-commerce site that was running in a separate CMS.

The headers and footers for the site change frequently and the WordPress theme needed to insert itself between the two.

The basic outline to tackle this problem is:

  1. Create a template in the CMS with an empty content area, and set up eg /blog so that instead of using the CMS it  loads content from the WordPress server.
  2. Create a WordPress theme that uses the built-in curl library to read the above template as a string
  3. Manipulate that string so that we insert the WordPress content into it, and then output it

Step 1.

Use the existing CMS to create a template page. It shouldn’t appear in any menu, it just needs its own URL that the WordPress server can access.

Use the CMS to create a new URL and load external content, and point it to the URL of your WordPress server.

Step 2.

Create a basic theme. It’s easier to create the theme minus the header and footer first so that you don’t have to wait for an external site to load while you’re developing.

Once we have that we can insert it into the CMS template.

Download simplehtmldom. This is an amazing library that allows you to use syntax similar to jQuery to target elements. In my theme directory I created a subdirectory called simplehtmldom/, and in that included just the simple_html_dom.php and the app/ directory from the download. You could just extract the entire zip there, but I’m a minimalist.

Have a quick look at the manual for simplehtmldom.

The theme’s index.php will load the simplehtmldom library and we can now use its functions.

<?php
include_once('simplehtmldom/simple_html_dom.php');
?>

(Skip to the end for a complete theme you can base yours on).

Now we need to get the header and footer from the main website. Let’s use (have a look, it’s a real site!) and replace their text with our WordPress content. We’ll pretend that they have a header and footer.

We want to tell curl that instead of outputting the data immediately we want to put it into a string. To do this we use CURLOPT_RETURNTRANSFER.

<?php
include_once('simplehtmldom/simple_html_dom.php');

//set up curl
$ch = curl_init('http://example.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

$curl_html = curl_exec($ch); //use curl to get data from example.com

//use simplehtmldom to parse the site into a dom-like object
$html = str_get_html($curl_html);

echo $html;
?>

At this point you should see the content from example.com but it’ll appear as if it’s on your own server.

Step 3.

The fun bit!

What we want to do is start replacing html elements from that page with our own stuff. Let’s start with the head’s title element:

<?php
...

$wp_title = wp_title('', 0); //title of page, no separator, return string instead of display
if( ! empty($wp_title) )
	$wp_title = " | $wp_title";

$page_title = get_bloginfo('name')." | $wp_title";

$html->find('head title', 0)->innertext = $page_title;

echo $html;

?>

In the last line we’ve used simplehtmldom’s find() method. ‘head title’ says “look for all the TITLE elements in all the HEAD elements (there should only be one but if you’re talking about DIV elements there could be many).

find() always returns an array, unless you pass it the index of the element you want. We want the first element so the second parameter is 0.

Then the cool bit – we want to set the innertext of that element to our page title.

Now when we echo $html we see example.com content on our server, with our WordPress blog’s page title.

Now for something more interesting – we’ll insert a reference to our WordPress theme style.css so that when it comes to outputting the posts we can style them.

<?php
...

$css = '<link href="'.get_bloginfo('stylesheet_url').'" rel="stylesheet" media="screen" type="text/css" />';

$head = $html->find('head', 0); //the head element
$head->outertext = '<head>'.$head->innertext."\n".$css.</head>;

echo $html;
?>

With any luck when we reload we’ll see the example.com source code with our theme’s style.css appended to the HEAD element.

Note: the source code is probably going to look pretty ugly – if you want to clean it up, replace the call to str_get_html(…) withs str_get_html($curl_html, true, true, DEFAULT_TARGET_CHARSET, false);

By inserting our CSS at the end, we can easily override whatever comes from the server, if needed.

The next step is to get the content of the posts from WordPress and insert it into the body. Instead of using WordPress output functions, we want to put the data into a string.

<?php
...
$content = ''; // the string we will build
//The Loop
if( have_posts() ) : while( have_posts() ) : the_post();

$content .= '<div class="post">'."\n";
$content .= 	'<div class="date">'.get_the_date('l, jS F Y').'</div>'."\n";
$content .= 	'<h1><a href="'.get_permalink().'">'.get_the_title().'</a></h1>'."\n";
$content .= get_the_content();

endwhile; endif; //end The Loop

$html->find('body div', 0)->innertext = $content;

echo $html;
?>

Here we’re using the regular old WordPress loop. Each post is in a div with class “post”. We show the date and the title with a link to the permalink, and then the content of the post.

Because WordPress keeps track of where we are, we don’t really need to worry about anything else. Front page, posts, pages, categories etc should all work as expected.

Here’s the full source of our index.php:

<?php

include_once('simplehtmldom/simple_html_dom.php');

//set up curl
$ch = curl_init('http://example.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

$curl_html = curl_exec($ch); //use curl to get data from example.com

//use simplehtmldom to parse the site into a dom-like object
$html = str_get_html($curl_html, true, true, DEFAULT_TARGET_CHARSET, false);

//modify the title
$wp_title = wp_title('', 0); //title of page, no separator, return string instead of display
if( ! empty($wp_title) )
$wp_title = " | $wp_title";

$page_title = get_bloginfo('name').$wp_title;

$html->find('head title', 0)->innertext = $page_title;
//append our style.css link to head
$css = '<link href="'.get_bloginfo('stylesheet_url').'" rel="stylesheet" media="screen" type="text/css" />';

$head = $html->find('head', 0); //the head element
$head->outertext = '<head>'.$head->innertext."\n".$css.'</head>';

$content = ''; // the string we will build

//The Loop
if( have_posts() ) : while( have_posts() ) : the_post();

$content .= '<div class="post">'."\n";
$content .= '<div class="date">'.get_the_date('l, jS F Y').'</div>'."\n";
$content .= '<h1><a href="'.get_permalink().'">'.get_the_title().'</a></h1>'."\n";
$content .= get_the_content();

endwhile; endif; //end The Loop

$html->find('body div', 0)->innertext = $content;

echo $html;

?>

Download the basic theme above to get you started: curl-htmlsimpledom-theme.

Comments 3

  1. Pingback: Wordpress Guru Sydney - Using the pure white light of PHP to pierce the dark miasma of .NET forms – David Nash

Leave a Reply

Your email address will not be published. Required fields are marked *