If you’ve done the Thirty Day Challenge, you’ll know how tedious it can be to download all the PDF files. Norio created a cool downloading script that saved me a lot of time. I was just too lazy to sit and write something like it, so I’m stealing his!
<?php // make sure our download doesn't time out or get interrupted by closing the browser set_time_limit(0); ignore_user_abort(1); // destination to download to $file_dir = "sites/default/files/30dc"; // create the destination directory if it doesn't exist if (!is_dir($file_dir)) mkdir($file_dir); // go through each day of training (1-31) for ($i = 1; $i <= 31; $i++) { // download the HTML contents of the training page for that day if ($page = file_get_contents("http://www.thirtydaychallenge.com/training/2009day".sprintf("%02d", $i).".php")) { // provide some feedback on where we are echo "<b>Day $i:</b><br />"; // flush output to browser - see php.net/flush flush(); // directory to download the current day's PDFs to $daydir = $file_dir."/day$i"; // create the directory if it doesn't exist if (!is_dir($daydir)) mkdir($daydir); // grab all the URLs to the PDFs (regular expressions are awesome!) preg_match_all('~(http://media.thirtydaychallenge.com.s3.amazonaws.com/training09/([0-9A-Za-z_]+.pdf))~', $page, $matches); // go through each url we grabbed above foreach ($matches[1] as $key => $filename) { // check if the file already exists (no use in re-downloading PDFs we have) if (!file_exists($matches[2][$key])) { // provide some feedback on where we are echo "Downloading {$matches[2][$key]}.<br />"; // flush output to browser flush(); // download the pdf and store it locally file_put_contents("{$daydir}/{$matches[2][$key]}", file_get_contents($matches[1][$key])); } } } } ?>
In various circumstances I’ve needed to have PHP pull the information on HTML headers that a web server returns. I’ve found cURL to provide an excellent solution to this.
$htmlheader = ""; function html_header($url) { global $htmlheader; $htmlheader = ""; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL,$url); curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'readHeader'); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_exec ($ch); curl_close ($ch); return $htmlheader; } function readHeader($ch, $header) { global $htmlheader; $htmlheader .= $header; return strlen($header); }
The variable will contain something like this:
HTTP/1.1 200 OK Date: Thu, 23 Jul 2009 20:56:23 GMT Server: Apache X-Powered-By: PHP/5.2.0-8+etch13 Content-Type: text/html; charset=UTF-8 Via: 1.1 bc1-rba Transfer-Encoding: chunked Connection: Keep-Alive Age: 0
There are two functions above. html_header is just the function to does the call to the URL and collects the header information. The header information is actually captured by the readHeader function.
You need to return the length of the header back to the CURLOPT_HEADERFUNCTION call, which is where the need for the global variable comes in. There might be a more elegant way of doing this, perhaps rather building this function into a class of it’s own. But I hope the above shows you how to get the information you require.
It’s very often needed to fake your User Agent, not to do any untoward, but to test various aspects of your website. Perhaps you want your website to display differently when using Internet Explorer 7 than when using Firefox 3.5. For whatever reason you have, here is a simple solution I use for my web crawlers.
$userAgent = "Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"; $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_VERBOSE, false); $html= curl_exec($ch);
With a bit of coding you can fill up an array with all the user agent strings you can find on the Internet, and randomly use them as the user agent when using cURL.