Skip to main content

Basic Web Scraping: Pulling Out Data From Google Search Results

Basic Web Scraping: Scraping Google Search Results

[UPDATE (22-AUG-2009): THIS IS THE NEW WORKING VERSION.]

Today we are going to discuss a bit advanced topic, not in the sense that it’d be difficult to understand (I always try to make things easier anyway) but that you won’t find an apparent use of it. What we are going to do today is what is called Web Scraping. By the way web scraping means retrieving data from web and pulling out useful information out of it for our use. Of course this wouldn’t be the next best web scraper rather it would la a basic foundation on how simple a web scraper can be.

OK let’s kick off guys!

As is obvious we are going to scrape Google’s Web Search Results to retrieve the number of pages indexed for a search term.

Scarping Google Results for Number of Pages Indexed

To retrieve results for a search term we need the URL, for this fire up your favorite Browser and browse to the Search Engine’s (Google, or whatever) homepage, type in any search query and hit enter.

OK now look at the address bar, in my case I looked like below, your’ should be similar or whatever:

http://www.google.com/search?hl=en&q=learning+c&btnG=Google+Search

On inspection you can see pour search term in the URL which is ‘URL Encoded’ (changes some character such as spaces to codes). There we have it, you can place any search keyword (urlencoded, very simple with PHP’s in-built function) and fetch that page. But how in a script, you might ask. Because that is what we need.

Well using the following function:

file_get_contents();

[UPDATE: WE'LL BE USING THE FOLLOWING USER-DEFINED FUNCTION INSTEAD. READ COMMENTS FOR MORE INFORMATION:

function my_fetch($url,$user_agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)')
{
    
$ch curl_init();
    
curl_setopt ($chCURLOPT_URL$url);
    
curl_setopt ($chCURLOPT_USERAGENT$user_agent);
    
curl_setopt ($chCURLOPT_HEADER0);
    
curl_setopt ($chCURLOPT_RETURNTRANSFER1);
    
curl_setopt ($chCURLOPT_REFERER'http://www.google.com/');
    
$result curl_exec ($ch);
    
curl_close ($ch);
    return 
$result;
}

]

If you have been following this blog for sometime, you might remember we once used it in my Creating a Simple Shout Box in PHP post to fetch contents from a local file. Yeah its beauty is that it can fetch remote (HTTP) files too.

$data = file_get_contents("http://www.google.com/search?hl=en&q=learning+c&btnG=Google+Search");

[UPDATE: NOW USING:

$data = my_fetch("http://www.google.com/search?hl=en&q=learning+c&btnG=Google+Search");

]

Above code will fetch the Google Search Results for the keyword we searched for in the browser, $data will contain the HTML source.

Since we have to scrape the total number of pages indexed for a particular search term (displayed as “Results 1 - 10 of about XXXX …”) we would find some text near that number(XXXX in this case). In this case that text is simply “Results 1 - 10 of about”, its also unique throughout the page hence if we could find it in the code returned we can easily find the needed data. One more thing we can ease off searching by first stripping off HTML from the code returned so that only text remains. This part can be implemented as below:

    $data=my_fetch("http://www.google.com/search?hl=en&q=".$s."&btnG=Google+Search");
    
    
//strip off HTML
    
$data=strip_tags($data);

    
$find='Results 1 - 10 of about ';
    
$find2=' for';

    
//have text beginning from $find
    
$data=strstr($data,$find);

    
//find position of $find2
    
$pos=strpos($data,$find2);

    
//take substring out, which'd be the number we want
    
$search_number=substr($data,strlen($find), $pos-strlen($find));

Here is the complete code:

<html>
<head>
<title>Google Result Scraper</title>
</head>

<body>
<p align="center" style="font-size: 500%"><font color="#0000FF">G</font><font
     color="#FF0000">o</font><font color="#FFFF00">o</font><font
     color="#0000FF">g</font><font color="#00FF00">l</font><font
     color="#FF0000">e</font><font size="2"><br />
Result Scraper</font></p>

<?php
function my_fetch($url,$user_agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)')
{
$ch curl_init();
curl_setopt ($chCURLOPT_URL$url);
curl_setopt ($chCURLOPT_USERAGENT$user_agent);
curl_setopt ($chCURLOPT_HEADER0);
curl_setopt ($chCURLOPT_RETURNTRANSFER1);
curl_setopt ($chCURLOPT_REFERER'http://www.google.com/');
$result curl_exec ($ch);
curl_close ($ch);
return 
$result;
}

$s $_GET['s'];
if (isset(
$s))
{
echo 
"<p><i>Search for $s</i></p>";
    
$s urlencode($s);
    
$data my_fetch("http://www.google.com/search?hl=en&q=" $s "&btnG=Google+Search");
    
//strip off HTML
    
$data strip_tags($data);
    
//now $data only has text NO HTML
    //these have to found out in the fetched data
    
$find 'Results 1 - 10 of about ';
    
$find2 ' for';
    
//have text beginning from $find
    
$data strstr($data$find);
    
//find position of $find2
    //there might be many occurence
    //but it'd give position of the first one,
    //which is what we want, anyway
    
$pos strpos($data$find2);

//take substring out, which'd be the number we want
$search_number=substr($data,strlen($find), $pos-strlen($find));

echo 
"Pages Indexed: $search_number";
}
else
{
    
?>

<form name="form1" id="form1" method="get" action="">
<div align="center">
<p>  <input name="s" type="text" id="s" size="50" />
<input type="submit" name="Submit" value="Count" /></p>
</div>
</form>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>
<?php
}
?>
</p>
<p align="right"><font size="2">by <a
     href="http://learning-computer-programming.blogspot.com/">Learning
Computer Programming</a></font></p>
</body>
</html>

Wow, our first scarper is completed. It has a nice interface, you type in search phrase click ‘Count’ and there you are. It displays the number of pages that contains that term same as on Google.

Have fun guys and do comment!

P.S.: You might want to read String Manipulation Function in PHP I and String Manipulation Function in PHP II if you are not much familiar with the string manipulation functions we are using in the code above.

Previous Posts:

Popular posts from this blog

Fix For Toshiba Satellite "RTC Battery is Low" Error (with Pictures)

RTC Battery is Low Error on a Toshiba Satellite laptop "RTC Battery is Low..." An error message flashing while you try to boot your laptop is enough to panic many people. But worry not! "RTC Battery" stands for Real-Time Clock battery which almost all laptops and PCs have on their motherboard to power the clock and sometimes to also keep the CMOS settings from getting erased while the system is switched off.  It is not uncommon for these batteries to last for years before requiring a replacement as the clock consumes very less power. And contrary to what some people tell you - they are not rechargeable or getting charged while your computer or laptop is running. In this article, we'll learn everything about RTC batteries and how to fix the error on your Toshiba Satellite laptop. What is an RTC Battery? RTC or CMOS batteries are small coin-shaped lithium batteries with a 3-volts output. Most laptops use

The Best Way(s) to Comment out PHP/HTML Code

PHP supports various styles of comments. Please check the following example: <?php // Single line comment code (); # Single line Comment code2 (); /* Multi Line comment code(); The code inside doesn't run */ // /* This doesn NOT start a multi-line comment block /* Multi line comment block The following line still ends the multi-line comment block //*/ The " # " comment style, though, is rarely used. Do note, in the example, that anything (even a multi-block comment /* ) after a " // " or " # " is a comment, and /* */ around any single-line comment overrides it. This information will come in handy when we learn about some neat tricks next. Comment out PHP Code Blocks Check the following code <?php //* Toggle line if ( 1 ) {      // } else {      // } //*/ //* Toggle line if ( 2 ) {      // } else {      // } //*/ Now see how easy it is to toggle a part of PHP code by just removing or adding a single " / " from th

Introduction to Operator Overloading in C++

a1 = a2 + a3; The above operation is valid, as you know if a1, a2 and a3 are instances of in-built Data Types . But what if those are, say objects of a Class ; is the operation valid? Yes, it is, if you overload the ‘+’ Operator in the class, to which a1, a2 and a3 belong. Operator overloading is used to give special meaning to the commonly used operators (such as +, -, * etc.) with respect to a class. By overloading operators, we can control or define how an operator should operate on data with respect to a class. Operators are overloaded in C++ by creating operator functions either as a member or a s a Friend Function of a class. Since creating member operator functions are easier, we’ll be using that method in this article. As I said operator functions are declared using the following general form: ret-type operator#(arg-list); and then defining it as a normal member function. Here, ret-type is commonly the name of the class itself as the ope