I’ve made a few website scrapers over the last few months, and have enjoyed it very much. Something that I need quite a bit was to extra all objects (images, css files, etc) that a website refers to.
This is the function I used to get images from a HTML document.
function get_images($html) { $images = array(); preg_match_all('/(img|src)\=(\"|\')[^\"\'\>]+/i', $html, $media); unset($html); $html=preg_replace('/(img|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$media[0]); foreach($data as $url) { $info = pathinfo($url); if (isset($info['extension'])) { if (($info['extension'] == 'jpg') || ($info['extension'] == 'jpeg') || ($info['extension'] == 'gif') || ($info['extension'] == 'png')) array_push($images, $url); } } return $images; }
This function takes as input the HTML content as a string. You can get this using cURL or file_get_contents. It returns an array of all the images it found on that page.

