I’m always interested in new way to scrape websites, and when I was given a recent project to do I thought it would be the perfect time to test out a class I found recently, Simple HTML POM Parser. This is what’s so great about the PHP development community, they share.
So with a few links of code, you are able to parse a complete HTML page and get various information from it. Here I will look at getting links on a website.
$url = "http://www.phpdeveloping.co.za/";
$html = file_get_html($url);
if ($links = $html->find('a'))
{
foreach($links as $link)
{
echo $link->href."\r\n";
echo $link->title."\r\n";
}
}
As simple as that!
With all the fancy tracking and web analytics software out there, you’d wonder why someone would still want to have their own tracking functionality on a website. I find it easier to customize a little script than to figure out how to get it working in the analytics software. Sometimes you really just want something simple done, and don’t want to go through all the effort to figure it out.
Here is a handy little script you can use.
<?php
//saves ip address and timestamp
$str=date("Y-m-d H:i:s") . ": ". $_SERVER['REMOTE_ADDR'] . "\n";
file_put_contents("ip_list.txt", $str, FILE_APPEND);
header("content-type: image/gif");
//43byte 1x1 transparent pixel gif
echo base64_decode("R0lGODlhAQABAIAAAAAAAAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==");
?>
What the above does is simply logs the IP address of the site visitor and the time they visited to a plan text file in the root directory. It then displays a 1×1 pixel transparent GIF file to the browser.
Why the output of a GIF file? So that you can in any of your HTML documents just put
and it will call the script and display a transparent GIF. So it won’t affect the way your website is being displayed, but it will still record the visitor’s details. It also doesn’t rely on javascript, which is always a plus for me.
You can expand the script even more by changing it to include more statistics. You can make it as complex or as simple as possible.
A client recently asked about the Dreamweaver STE format and retrieving a password from it. I’ve never done it, but a little research into the encryption format quickly provided a solution.
The password is basically in the form of number and letter which represent a series of hexadecimal numbers which have been modified a bit.
Let’s say the encrypted password is 70627576
It goes like:
- Each hexadecimal number is 2 digits, so “70627576″ would be 70, 62, 75 and 76
- Subtract the position of the number from itself, starting with 0.
- 70 is in the 0 position, so subtract 0 from it
- 62 is in the 1 position, so subtract 2 from it
- 75 is in the 2 position, so subtract 3 from it
- 76 is in the 3 position, so subtract 4 from it
So, 70:62:75:76 becomes 70:61:73:73.
- Once done, convert each to ASCII.
Now for the code to do this
$encoded = "70627576";
$letters = explode(' ', wordwrap($encoded, 2, ' ', 2));
$password = '';
for ($i = 0; $i < count($letters); $i++) {
$password .= chr(hexdec($letters[$i]) - $i);
}
echo $password;
I came across the range function a few months ago, and it has saved so many lines of repetitive code. I got so tired of using for loops to do the most simple things.
Check out a few examples to see how this function can make your life so much easier.
Print all numbers from a to z:
foreach(range('a', 'z') as $letter) {
echo $letter;
}
Print all numbers from 0 to 12
foreach(range(0, 12) as $number) {
echo $number;
}
Print all numbers from 0 to 100, but only showing every 10th one.
foreach(range(0, 100, 10) as $number) {
echo $number;
}
Whenever you are doing form processing, it is handy to have an email validation function.
Here is one I found and tweaked a bit. Has given me a lot of good mileage!
function check_email_address($email) {
if (!ereg("^[^@]{1,64}@[^@]{1,255}$", $email)) {
return false;
}
$email_array = explode("@", $email);
$local_array = explode(".", $email_array[0]);
for ($i = 0; $i < sizeof($local_array); $i++) {
if (!ereg("^(([A-Za-z0-9!#$%&'*+/=?^_`{|}~-][A-Za-z0-9!#$%&'*+/=?^_`{|}~\.-]{0,63})|(\"[^(\\|\")]{0,62}\"))$", $local_array[$i])) {
return false;
}
}
if (!ereg("^\[?[0-9\.]+\]?$", $email_array[1])) {
$domain_array = explode(".", $email_array[1]);
if (sizeof($domain_array) < 2) {
return false;
}
for ($i = 0; $i < sizeof($domain_array); $i++) {
if (!ereg("^(([A-Za-z0-9][A-Za-z0-9-]{0,61}[A-Za-z0-9])|([A-Za-z0-9]+))$", $domain_array[$i])) {
return false;
}
}
}
return true;
}
Here is a function to check if a string begings with another string
function startsWith($haystack,$needle,$case=true) {
if($case) {
return (strcmp(substr($haystack, 0, strlen($needle)),$needle)===0);
}
return (strcmp(strtolower(substr($haystack, 0, strlen($needle))),strtolower($needle))===0);
}
Here’s a nice function to check if a string ends with another string:
function endsWith($haystack,$needle,$case=true) {
if($case){
return (strcmp(substr($haystack, strlen($haystack) - strlen($needle)),$needle)===0);
}
return (strcmp(strtolower(substr($haystack, strlen($haystack) - strlen($needle))),strtolower($needle))===0);
}
Regular expressions can be very confusing at times, but they are extremely powerful to use when coding. It makes validation of user input a lot easier.
The example below will show you have to use a simple regular expression to check whether a domain name is valid or not.
$url = "http://www.phpdevelopment.co.za/";
if (preg_match('/^(http|https|ftp):\/\/([A-Z0-9][A-Z0-9_-]*(?:\.[A-Z0-9][A-Z0-9_-]*)+):?(\d+)?\/?/i', $url)) {
echo "Your url is fine.";
} else {
echo "Your url is not fine.";
}
I’ve made a few website scrapers over the last few months, and have enjoyed it very much. Something that I need quite a bit was to extra all objects (images, css files, etc) that a website refers to.
This is the function I used to get images from a HTML document.
function get_images($html)
{
$images = array();
preg_match_all('/(img|src)\=(\"|\')[^\"\'\>]+/i', $html, $media);
unset($html);
$html=preg_replace('/(img|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$media[0]);
foreach($data as $url) {
$info = pathinfo($url);
if (isset($info['extension'])) {
if (($info['extension'] == 'jpg') ||
($info['extension'] == 'jpeg') ||
($info['extension'] == 'gif') ||
($info['extension'] == 'png'))
array_push($images, $url);
}
}
return $images;
}
This function takes as input the HTML content as a string. You can get this using cURL or file_get_contents. It returns an array of all the images it found on that page.
In various circumstances I’ve needed to have PHP pull the information on HTML headers that a web server returns. I’ve found cURL to provide an excellent solution to this.
$htmlheader = "";
function html_header($url) {
global $htmlheader;
$htmlheader = "";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'readHeader');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_exec ($ch);
curl_close ($ch);
return $htmlheader;
}
function readHeader($ch, $header) {
global $htmlheader;
$htmlheader .= $header;
return strlen($header);
}
The variable will contain something like this:
HTTP/1.1 200 OK
Date: Thu, 23 Jul 2009 20:56:23 GMT
Server: Apache
X-Powered-By: PHP/5.2.0-8+etch13
Content-Type: text/html; charset=UTF-8
Via: 1.1 bc1-rba
Transfer-Encoding: chunked
Connection: Keep-Alive
Age: 0
There are two functions above. html_header is just the function to does the call to the URL and collects the header information. The header information is actually captured by the readHeader function.
You need to return the length of the header back to the CURLOPT_HEADERFUNCTION call, which is where the need for the global variable comes in. There might be a more elegant way of doing this, perhaps rather building this function into a class of it’s own. But I hope the above shows you how to get the information you require.