Tag: cURL

Downloading PDF files

If you’ve done the Thirty Day Challenge, you’ll know how tedious it can be to download all the PDF files. Norio created a cool downloading script that saved me a lot of time. I was just too lazy to sit and write something like it, so I’m stealing his!

<?php
// make sure our download doesn't time out or get interrupted by closing the browser
set_time_limit(0);
ignore_user_abort(1);
// destination to download to
$file_dir = "sites/default/files/30dc";
// create the destination directory if it doesn't exist
if (!is_dir($file_dir)) mkdir($file_dir);
// go through each day of training (1-31)
for ($i = 1; $i <= 31; $i++) {
  // download the HTML contents of the training page for that day
  if ($page = file_get_contents("http://www.thirtydaychallenge.com/training/2009day".sprintf("%02d", $i).".php")) {
    // provide some feedback on where we are
    echo "<b>Day $i:</b><br />";
    // flush output to browser - see php.net/flush
    flush();
    // directory to download the current day's PDFs to
    $daydir = $file_dir."/day$i";
    // create the directory if it doesn't exist
    if (!is_dir($daydir)) mkdir($daydir);
    // grab all the URLs to the PDFs (regular expressions are awesome!)
    preg_match_all('~(http://media.thirtydaychallenge.com.s3.amazonaws.com/training09/([0-9A-Za-z_]+.pdf))~', $page, $matches);
    // go through each url we grabbed above
    foreach ($matches[1] as $key => $filename) {
      // check if the file already exists (no use in re-downloading PDFs we have)
      if (!file_exists($matches[2][$key])) {
        // provide some feedback on where we are
        echo "Downloading {$matches[2][$key]}.<br />";
        // flush output to browser
        flush();
        // download the pdf and store it locally
        file_put_contents("{$daydir}/{$matches[2][$key]}", file_get_contents($matches[1][$key]));
      }
    }
  }
}
?>

Check out the full post at Boff.co.za

HTML Headers

In various circumstances I’ve needed to have PHP pull the information on HTML headers that a web server returns. I’ve found cURL to provide an excellent solution to this.

  $htmlheader = "";
 
  function html_header($url) {
    global $htmlheader;
    $htmlheader = "";
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,$url);
    curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'readHeader');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_exec ($ch);
    curl_close ($ch);
 
    return $htmlheader;
  }
 
  function readHeader($ch, $header) {
      global $htmlheader;
      $htmlheader .= $header;
      return strlen($header);
  }

The variable will contain something like this:

HTTP/1.1 200 OK
Date: Thu, 23 Jul 2009 20:56:23 GMT
Server: Apache
X-Powered-By: PHP/5.2.0-8+etch13
Content-Type: text/html; charset=UTF-8
Via: 1.1 bc1-rba
Transfer-Encoding: chunked
Connection: Keep-Alive
Age: 0

There are two functions above. html_header is just the function to does the call to the URL and collects the header information. The header information is actually captured by the readHeader function.

You need to return the length of the header back to the CURLOPT_HEADERFUNCTION call, which is where the need for the global variable comes in. There might be a more elegant way of doing this, perhaps rather building this function into a class of it’s own. But I hope the above shows you how to get the information you require.

Faking your User Agent with cURL

It’s very often needed to fake your User Agent, not to do any untoward, but to test various aspects of your website. Perhaps you want your website to display differently when using Internet Explorer 7 than when using Firefox 3.5. For whatever reason you have, here is a simple solution I use for my web crawlers.

$userAgent = "Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_VERBOSE, false);
$html= curl_exec($ch);

With a bit of coding you can fill up an array with all the user agent strings you can find on the Internet, and randomly use them as the user agent when using cURL.