Today we will build a PHP application to automatically query multiple Amazon sites for a product's sales rank. The data will be cached locally in an XML file to avoid unnecessary remote calls. A screenshot of the working program is below. Here is a link to a live demo.
Full Source Listing. A zip file of all the files is available to download at the bottom of the page.
The default product search is for a book that I cowrote, Pro PHP Programming** by Apress, but you can change the value to any valid Amazon Standard Identification Number (ASIN) and it will retrieve that product's information. (**this article is not representative of the book, but they both discuss phpQuery and DomDocument)
As a background, every product on Amazon has a unique ASIN value and can be viewed at http://www.amazon.com/dp/ASIN-VALUE-HERE Where "dp" stands for "detail page". Some Amazon domains are:
- Canada: http://www.amazon.ca
- China: http://www.amazon.cn
- France: http://www.amazon.fr
- Germany: http://www.amazon.de
- Italy: http://www.amazon.it
- Japan: http://www.amazon.co.jp
- United Kingdom: http://www.amazon.co.uk
- US (main): http://www.amazon.com
Because of product availability, differences in wording and character encoding issues, in this example we will limit ourselves to the English speaking domains, 'com', 'ca', 'co.uk'.
Web service API functionality for pulling product ranks from Amazon probably exists, but the purpose of this tutorial is to demonstrate page scraping and XML manipulation. To scrape the data we will use the phpQuery library which is designed to be similar to jQuery. Both phpQuery and jQuery allow easy DOM parsing, but phpQuery works as you would expect, on the server side.
To store our data we will use XML and DomDocument to manipulate the document. We could store our data in JSON format or in a database such as SQLite. We also could store every item in one large XML file. However, in this tutorial we will create a new XML file for every product. XML is generally easier to read with the PHP library SimpleXML. We will be updating an XML document and DomDocument is more capable of manipulating existing elements than SimpleXML.
This tutorial will split out our code into appropriate classes and reusable functions. Following object-oriented techniques makes the code more generic, easier to modify and debug. Let's get started!
Outline
Here is the general outline of what we want to accomplish:
- Get Data:
- Retrieve HTML from the product page of each Amazon site we are interested in.
- Scrape the full rank description from the HTML.
- Use regex to get the exact rank number. This number will be used in our best/worst comparision.
- Display the data to the screen.
-
Local storage, caching, updating:
- Store each product as a separate XML file.
- Create this file if none exists.
- Fetch and write data to the local file if a certain length of time has passed OR
- Simply read the cached file data.
Components
Our program needs the following components: a scraper/parser, XML input/output, a view to output our data, a configuration file, a main class to encompass all of these sub components and execute the script.
Splitting out the components here very loosely follows a Model-View-Controller (MVC) design pattern. The XML handling class is our model, View is our View, and our main class acts as the controller, tying in the model with the view. The scraper/parser acts as a helper class.
The advantage of MVC is it separates our business and presentation logic. Later on, if we want to instead use JSON for our model, we would need to only change the Model class used. Similarly, if we wanted to have a console view instead of HTML, we would just need to swap the View class.
Configuration
First let's go over our configuration file, config.php.
<?php
define('RECHECK_INTERVAL', 60 * 60); //one hour
define('SALES_RANK_DOM_ID', "#SalesRank");
define('IMAGE_DOM_ID', "#prodImage");
define('TITLE_DOM_ID', "#btAsinTitle");
define('AMAZON_URL_PREFIX', 'http://www.amazon.');
define('AMAZON_URL_SUFFIX', '/dp/');
date_default_timezone_set('America/Regina');
?>
The configuration file defines the Amazon URL prefix and suffix, the DOM ids that we are searching for on each Amazon page, how often we will pull in remote data instead of using the cached XML file and our default timezone.
Scraper/Parser
Next let's go over our scraper helper class. This class stores no data and will have two static methods, one for retrieving and parsing select DOM elements, and another for parsing an exact number from our sales rank DOM element, using regex.
Our first function below fetches a page from a specific amazon domain and looks for the sales rank and optionally the title and product image.
public static function fetchUpdatedData($domain, $id, $get_title = false) {
$title = $image = "";
//http://www.amazon.ca/dp/1234567890
$url = AMAZON_URL_PREFIX . $domain . AMAZON_URL_SUFFIX . $id;
$contents = file_get_contents($url);
if ($contents === false) {
//return early if URL is not found
return false;
}
phpQuery::newDocument($contents);
if ($get_title) {
$title = pq(TITLE_DOM_ID);
$image = pq(IMAGE_DOM_ID);
}
return array('sales_rank' => pq(SALES_RANK_DOM_ID),
'title' => $title,
'image' => $image);
}
If the URL is not found - which often happens as one Amazon store might not carry a specific product - then instead of doing more processing, we return false.
To use phpQuery is very simple. We just need to initial the document we want to use:
phpQuery::newDocument($contents);
and look for specific elements
return array('sales_rank' => pq(SALES_RANK_DOM_ID),
Here, pq is the equivalent of jQuery's
$(''); //or
jQuery('');
Our second function takes a long sales rank string such as
"Amazon Bestsellers Rank: #123,973 in Books (See Top 100 in Books) " and finds "123973"
public static function parseRankFromDescription( $description ) {
preg_match_all( '{Amazon Bestsellers Rank:</b>\s*#?(([0-9]{0,3},)?([0-9]{0,3},)?[0-9]{1,3}) in Books}mi',
$description, $matches );
if (isset( $matches[ 1 ] ) && isset( $matches[ 1 ][ 0 ] )) {
return intval( str_replace( ',', '', $matches[ 1 ][ 0 ] ) );
}
return -1;
}
The regex will capture any of
#1 1 #300 4,300 #4,300 1,700,000 #1,700,000
XML Storage/Retrieval
xml_io.php is the toughest class to understand, and using a database or JSON would most likely simplify things.
Our default XML will be stored in a variable using PHP Heredoc syntax:
$default_xml_string = <<<EOT
<?xml version="1.0" encoding="UTF-8" ?>
<info>
<lastcheck></lastcheck>
<title></title>
<image></image>
<sites/>
</info>
EOT;
define( 'DEFAULT_XML_STRING', $default_xml_string );
In the XML file for a product, we store the last checked time, the product title, the product image and sites. Sites will later be outlined with rank data such as:
<sites>
<domain cc="ca">
<current_rank>17</current_rank>
<best_rank>4</best_rank>
<worst_rank>33</worst_rank>
</domain>
<domain cc="co.uk">
<current_rank>48</current_rank>
<best_rank>52</best_rank>
<worst_rank>189</worst_rank>
</domain>
<sites/>
Here cc stands for country code.
The first part of the Xml_IO class declares variables for the DOM object, product id, XML filename, DomNode references and DomNode textContent values.
class Xml_IO {
private $dom = null;
private $id = 0;
private $filename = null;
//nodes
private $lastchecktime_node = null;
private $title_node = null;
private $image_node = null;
//node textcontent
private $lastchecktime = 0;
private $title = null;
private $image = null;
public function __construct() {
$this->dom = new DOMDocument();
}
If you look at the html snippet
<div id="007">James Bond</div>, div is a DomNode, "James Bond" is the textContent of div, and id is an attribute of div with the value "007".
There are some getter/setter functions in XML_IO which are simple and need no discussion. Next comes two functions for creating an XML file if one is not found and loading XML data from our file.
public function createFileIfItDoesNotExist() {
if (!file_exists( $this->filename )) {
file_put_contents( $this->filename, DEFAULT_XML_STRING );
}
}
public function loadXML( $id ) {
$this->id = $id;
$this->filename = $this->id.'.xml';
$this->createFileIfItDoesNotExist();
$this->dom->load( $this->filename );
$this->xpath = new DomXpath( $this->dom );
$this->lastchecktime_node = $this->xpath->query( "//lastcheck" )->item( 0 );
$this->title_node = $this->xpath->query( "//title" )->item( 0 );
$this->image_node = $this->xpath->query( "//image" )->item( 0 );
$this->sites_node = $this->xpath->query( "//sites" )->item( 0 );
$this->lastchecktime = (int) $this->lastchecktime_node->textContent;
$this->title = $this->title_node->textContent;
$this->image = $this->image_node->textContent;
}
Here we use load to import our XML file into the DomDocument object, DomXpath to create an Xpath object, query to find specific nodes, item(0) to find the first node in the resultant NodeList and textContent to find the inner text value of a node.
Next we have functions to updateXML, and depending on whether the domain already exists in the document, we update the domain node or append a new one to sites.
public function updateXML( $domain, $sales_rank ) {
if ($this->siteExistsInXML( $domain )) {
//echo "update<br/>";
return $this->updateSiteInXML( $domain, $sales_rank );
} else {
//echo "append<br/>";
return $this->appendSiteInXML( $domain, $sales_rank );
}
}
public function siteExistsInXML( $domain ) {
$sites = $this->xpath->query( "//domain" );
foreach ($sites as $site) {
if ($site->attributes->getNamedItem( "cc" )->textContent == $domain) {
return $site;
}
}
return false;
}
private function updateSiteInXML( $domain, $sales_rank ) {
$ranks = null;
foreach ($this->sites_node->childNodes as $site) {
if ($site->attributes->getNamedItem( "cc" )->textContent == $domain) {
try {
$ranks = $this->updateDomainRank( $site, $sales_rank );
} catch (Exception $e) {
var_dump( $e );
}
break;
}
}
return $ranks;
}
private function appendSiteInXML( $domain, $sales_rank ) {
$site = $this->dom->createElement( "domain" );
$cc = $this->dom->createAttribute( "cc" );
$cc->value = $domain;
$site->appendChild( $cc );
$this->sites_node->appendChild( $site );
$rank = $this->createAndAppend( $site, "current_rank", $sales_rank );
$exact_rank = Scraper::parseRankFromDescription( $sales_rank );
$this->createAndAppend( $site, "best_rank", $exact_rank );
$this->createAndAppend( $site, "worst_rank", $exact_rank );
return $this->queryRankNodesFromXML( $site );
}
Without getting into the custom function calls in the above two methods, we can observe new usage of the DomDocument. Here we use
->childNodes to get child nodes, ->attributes->getNamedItem( "cc" ) to get a specific attribute of a node by name,
createElement to create elements, createAttribute to create attributes, and appendChild to attach a created node to
a parent node. createAndAppend is a custom helper function which we will go over towards the end of this section.
To adjust the stored ranks we have two methods below:
public function queryRankNodesFromXML( $site_node ) {
if ($site_node) {
$current = $this->xpath->query( "current_rank", $site_node )->item( 0 );
$exact = Scraper::parseRankFromDescription( $current->textContent );
$best = $this->xpath->query( "best_rank", $site_node )->item( 0 );
$worst = $this->xpath->query( "worst_rank", $site_node )->item( 0 );
return array('current' => $current,
'exact' => $exact,
'best' => $best,
'worst' => $worst);
}
return array();
}
private function updateDomainRank( $site, $sales_rank ) {
$this->updateNode( $this->xpath->query( "current_rank", $site )->item( 0 ), $sales_rank );
$ranks = $this->queryRankNodesFromXML( $site );
if (!empty( $ranks )) {
if ($ranks[ 'exact' ] != -1) {
if ($ranks[ 'exact' ] < intval( $ranks[ 'best' ]->textContent )) {
$this->updateNode( $ranks[ 'best' ], $ranks[ 'exact' ] );
}
if ($ranks[ 'exact' ] > intval( $ranks[ 'worst' ]->textContent )) {
$this->updateNode( $ranks[ 'worst' ], $ranks[ 'exact' ] );
}
}
}
return $ranks;
}
The first function queries the existing stored ranks. The second function compares the updated current rank with the
all time best and worst ranks, updating them if necessary.
To write the modified file back to disk we have the method:
public function updateAndWriteFile() {
try {
$this->updateNode( $this->lastchecktime_node, time() );
$this->updateNode( $this->title_node, (htmlentities( $this->title ) ) );
$this->updateNode( $this->image_node, (htmlentities( $this->image ) ) );
file_put_contents( $this->filename, utf8_encode( $this->dom->saveXML() ) );
} catch (Exception $e) {
var_dump( $e );
}
}
Here we output the XML with saveXML, than encode it as UTF-8 and save it to disk. The updateNode function is a custom helper
function, which we will display next.
We have a couple of custom helper functions which reduce some common code into single line function calls:
private function createAndAppend( $parent, $child_node_name, $value ) {
$child = $this->dom->createElement( $child_node_name, $value );
$parent->appendChild( $child );
return $child;
}
private function updateNode( $node, $value ) {
if ($node) {
if ($node->firstChild) {
$node->replaceChild( $this->dom->createTextNode( $value ), $node->firstChild );
} else {
$node->appendChild( $this->dom->createTextNode( $value ) );
}
}
}
Of note, replaceChild takes an existing node and replaces it with another. The previous node needs to exist though. If
it does not, then we simply append our new node.
That wraps up the XML component of our program. Not so bad when you take it a little bit at a time.
View
Our view class, view.php, is much more straightforward. In fact I will display it below in its entirety.
<?php
class View {
private $body;
public function __construct() {
$this->body = "";
}
public function appendToBody( $input ) {
$this->body .= $input;
}
public function productInformationAsHTML( $title, $image ) {
$html = '<div style="float: left; width: 240px; margin-right: 20px;"><h2>'.$title."</h2>";
$html .= $image;
$html .= '</div>';
return $html;
}
public function domainRankAsHTML( $asin, $domain, $ranks, $entities = true ) {
$html = "<div class='domainContainer'>";
if (!empty( $ranks )) {
$html .= "<span class='domainName'>";
$html .= "<a href='".AMAZON_URL_PREFIX.$domain.AMAZON_URL_SUFFIX.$asin."' rel='external'/>";
$html .= $domain."</a>";
$html .= "</span><br/>";
$html .= "<div class='domainRank'>";
if ($entities) {
$html .= htmlentities( $ranks[ 'current' ]->textContent, ENT_QUOTES, 'UTF-8' );
} else {
$html .= $ranks[ 'current' ]->textContent;
}
$html .= "<strong>Best Rank:</strong> ".$ranks[ 'best' ]->textContent."<br/>";
$html .= "<strong>Worst Rank:</strong> ".$ranks[ 'worst' ]->textContent."<br/><br/>";
$html .= "</div>";
}
$html .= "</div>";
return $html;
}
public function display( $id ) {
print $this->body;
}
}
?>
The class takes in parameter data and wraps it into structural HTML tags, like a <div> or adds styling like bold and labels.
The main program
The first thing our main script, amazon_rank_finder_interactive.php does is set our error reporting level, and set the $asin variable value,
based on a default value and $_POST values. In production, you will want to turn display_errors off and log them instead.
<?php
error_reporting(E_ALL ^ E_STRICT);
ini_set('display_errors', 'on');
define('DEFAULT_ASIN', "1430235608");
$asin = DEFAULT_ASIN;
if (isset($_POST) && isset($_POST['asin'])) {
$asin = preg_replace("/[^a-zA-Z0-9]/", "", $_POST['asin']);
}
?>
We sanitize the input string to be only alpha numeric characters (correct me please if an ASIN can contain other characters).
Next we define our HTML 5 markup, include a small stylesheet and have an input form that let's us search for a specific product ASIN.
<!doctype HTML>
<html>
<head>
<title>Amazon Rank Finder</title>
<link rel="stylesheet" href="style.css"/>
</head>
<body>
<form action="<?php echo $_SERVER['PHP_SELF']; ?>" method="POST">
<input value="<?php echo str_pad($asin, 10, '0', STR_PAD_LEFT); ?>"
type="text" name="asin" />
<input type="Submit"/>
</form>
<div>
The one line above that needs a little explanation is:
<input value="<?php echo str_pad($asin, 10, '0', STR_PAD_LEFT); ?>"
This ensures that our search item is an interval of length exactly ten. For example, if you have the ASIN: 0012345678, but enter 12345678, the Amazon /dp/ URL still expects the padded zeros. str_pad will add 0s to the left of the string if necessary to make it ten characters long.
Next we include our component files, and define our run() function. This function will load (or create) our XML file and check if it is time to fetch remote content. If it is time, then we fetch and parse remotely, update and save the XML. If not, we merely load and display our saved XML data.
<?php
require_once('config.php');
require_once('xml_io.php');
require_once('scraper.php');
require_once('view.php');
class AmazonRankFinder {
private $id = 0;
private $domains = null;
//objects
private $model = null;
private $view = null;
public function __construct() {
$this->model = new XML_IO;
$this->view = new View;
}
public function run($id, Array $domains = array('ca', 'com', 'co.uk')) {
$this->domains = $domains;
$this->id = $id;
$this->model->loadXML($this->id);
if ($this->timeToCheck()) {
$this->checkRemoteContent();
$this->model->updateAndWriteFile();
} else {
$this->checkCachedContent();
}
$this->view->display($this->id);
}
private function timeToCheck() {
return (time() - $this->model->getLastCheckTime() > RECHECK_INTERVAL);
}
Notice that the run function takes an $id (ASIN) and Array of domains as arguments. The class controls the flow of the program and
coordinates the model and view.
Let's see what happens if we are getting remote content:
private function checkRemoteContent() {
$need_title_and_image = true;
$this->view->appendToBody("Checking id: $this->id now...<br/>");
foreach ($this->domains as $domain) {
$results = Scraper::fetchUpdatedData($domain, $this->id, $need_title_and_image);
if ($results) {
if ($need_title_and_image) {
$this->model->setTitle($results['title']);
$this->model->setImage($results['image']);
$this->view->appendToBody(html_entity_decode($this->view->productInformationAsHTML(
$results['title'], $results['image'])));
$need_title_and_image = false;
$this->view->appendToBody("<div style='float: left;'>");
}
//adds best/worst view
$ranks = $this->model->updateXML($domain, $results['sales_rank']);
if (!empty($ranks)) {
$this->view->appendToBody(html_entity_decode(
$this->view->domainRankAsHTML($this->id, $domain, $ranks, false)));
}
}
}
$this->view->appendToBody('</div>');
We scrape data. Next we process it, appending to our view and updating our loaded XML object.
Alternatively, let's see what happens when we serve up cached data:
private function checkCachedContent() {
$this->view->appendToBody(html_entity_decode($this->view->productInformationAsHTML(
$this->model->getTitle(), $this->model->getImage())));
$this->view->appendToBody("<div style='float: left;'>Last checked id: $this->id at " . date('m-d-Y h:i:sa', $this->model->getLastCheckTime()) . "<br/>");
foreach ($this->domains as $domain) {
$site_node = $this->model->siteExistsInXML($domain);
$ranks = $this->model->queryRankNodesFromXML($site_node);
if (!empty($ranks)) {
$this->view->appendToBody(html_entity_decode(
$this->view->domainRankAsHTML($this->id, $domain, $ranks)));
}
}
$this->view->appendToBody('</div>');
}
In this case our work is simplified. We load our XML data and append the view with it. We never modify the XML object once it is loaded.
Finally we, check that our form has been submitted, and if it is, run our program.
if (isset($_POST) && isset($_POST['asin'])) {
$amazon_rank_finder = new AmazonRankFinder();
$amazon_rank_finder->run($asin, array('ca', 'com', 'co.uk'));
}
?>
</div>
</body>
</html>
We also add closing HTML tags.
And that is about that!
Ideas
Data Collection
We currently keep track of the all time best and worst rankings that we have collected. You could expand upon this idea and keep track of the average of our collected ranks, or plot trends, etc.
Automating execution
You will want to automate running the program to collect data using a cron job. Since Amazon updates rankings hourly, it makes sense to have the cron also run hourly. The crontab for an hourly run would be "0, *, *, *" . You would execute your script by calling the page directly. For example, if your server has curl installed, you can use the command "curl http://path/to/your/script/amazon_rank_finder.php".
When you call the script directly, you need to either always run the program with a hard coded ASIN value, or change the form method to GET and pass a query string with an appropriate ASIN value. Otherwise, the page will load but the program will not fetch any results.
Feedback
I hope you learn something from this tutorial and welcome questions, errata or constructive feedback.


All Articles
Add new comment