Ad

Wednesday, June 11, 2008

Basic Web Scraping: Pulling Out Data From Google Search Results

Basic Web Scraping: Scraping Google Search Results

[UPDATE (22-AUG-2009): THIS IS THE NEW WORKING VERSION.]

Today we are going to discuss a bit advanced topic, not in the sense that it’d be difficult to understand (I always try to make things easier anyway) but that you won’t find an apparent use of it. What we are going to do today is what is called Web Scraping. By the way web scraping means retrieving data from web and pulling out useful information out of it for our use. Of course this wouldn’t be the next best web scraper rather it would la a basic foundation on how simple a web scraper can be.

OK let’s kick off guys!

As is obvious we are going to scrape Google’s Web Search Results to retrieve the number of pages indexed for a search term.

Scarping Google Results for Number of Pages Indexed

To retrieve results for a search term we need the URL, for this fire up your favorite Browser and browse to the Search Engine’s (Google, or whatever) homepage, type in any search query and hit enter.

OK now look at the address bar, in my case I looked like below, your’ should be similar or whatever:

http://www.google.com/search?hl=en&q=learning+c&btnG=Google+Search

On inspection you can see pour search term in the URL which is ‘URL Encoded’ (changes some character such as spaces to codes). There we have it, you can place any search keyword (urlencoded, very simple with PHP’s in-built function) and fetch that page. But how in a script, you might ask. Because that is what we need.

Well using the following function:

file_get_contents();

[UPDATE: WE'LL BE USING THE FOLLOWING USER-DEFINED FUNCTION INSTEAD. READ COMMENTS FOR MORE INFORMATION:

function my_fetch($url,$user_agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)')
{
    
$ch curl_init();
    
curl_setopt ($chCURLOPT_URL$url);
    
curl_setopt ($chCURLOPT_USERAGENT$user_agent);
    
curl_setopt ($chCURLOPT_HEADER0);
    
curl_setopt ($chCURLOPT_RETURNTRANSFER1);
    
curl_setopt ($chCURLOPT_REFERER'http://www.google.com/');
    
$result curl_exec ($ch);
    
curl_close ($ch);
    return 
$result;
}

]

If you have been following this blog for sometime, you might remember we once used it in my Creating a Simple Shout Box in PHP post to fetch contents from a local file. Yeah its beauty is that it can fetch remote (HTTP) files too.

$data = file_get_contents("http://www.google.com/search?hl=en&q=learning+c&btnG=Google+Search");

[UPDATE: NOW USING:

$data = my_fetch("http://www.google.com/search?hl=en&q=learning+c&btnG=Google+Search");

]

Above code will fetch the Google Search Results for the keyword we searched for in the browser, $data will contain the HTML source.

Since we have to scrape the total number of pages indexed for a particular search term (displayed as “Results 1 - 10 of about XXXX …”) we would find some text near that number(XXXX in this case). In this case that text is simply “Results 1 - 10 of about”, its also unique throughout the page hence if we could find it in the code returned we can easily find the needed data. One more thing we can ease off searching by first stripping off HTML from the code returned so that only text remains. This part can be implemented as below:

    $data=my_fetch("http://www.google.com/search?hl=en&q=".$s."&btnG=Google+Search");
    
    
//strip off HTML
    
$data=strip_tags($data);

    
$find='Results 1 - 10 of about ';
    
$find2=' for';

    
//have text beginning from $find
    
$data=strstr($data,$find);

    
//find position of $find2
    
$pos=strpos($data,$find2);

    
//take substring out, which'd be the number we want
    
$search_number=substr($data,strlen($find), $pos-strlen($find));

Here is the complete code:

<html>
<head>
<title>Google Result Scraper</title>
</head>

<body>
<p align="center" style="font-size: 500%"><font color="#0000FF">G</font><font
     color="#FF0000">o</font><font color="#FFFF00">o</font><font
     color="#0000FF">g</font><font color="#00FF00">l</font><font
     color="#FF0000">e</font><font size="2"><br />
Result Scraper</font></p>

<?php
function my_fetch($url,$user_agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)')
{
$ch curl_init();
curl_setopt ($chCURLOPT_URL$url);
curl_setopt ($chCURLOPT_USERAGENT$user_agent);
curl_setopt ($chCURLOPT_HEADER0);
curl_setopt ($chCURLOPT_RETURNTRANSFER1);
curl_setopt ($chCURLOPT_REFERER'http://www.google.com/');
$result curl_exec ($ch);
curl_close ($ch);
return 
$result;
}

$s $_GET['s'];
if (isset(
$s))
{
echo 
"<p><i>Search for $s</i></p>";
    
$s urlencode($s);
    
$data my_fetch("http://www.google.com/search?hl=en&q=" $s "&btnG=Google+Search");
    
//strip off HTML
    
$data strip_tags($data);
    
//now $data only has text NO HTML
    //these have to found out in the fetched data
    
$find 'Results 1 - 10 of about ';
    
$find2 ' for';
    
//have text beginning from $find
    
$data strstr($data$find);
    
//find position of $find2
    //there might be many occurence
    //but it'd give position of the first one,
    //which is what we want, anyway
    
$pos strpos($data$find2);

//take substring out, which'd be the number we want
$search_number=substr($data,strlen($find), $pos-strlen($find));

echo 
"Pages Indexed: $search_number";
}
else
{
    
?>

<form name="form1" id="form1" method="get" action="">
<div align="center">
<p>  <input name="s" type="text" id="s" size="50" />
<input type="submit" name="Submit" value="Count" /></p>
</div>
</form>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>
<?php
}
?>
</p>
<p align="right"><font size="2">by <a
     href="http://learning-computer-programming.blogspot.com/">Learning
Computer Programming</a></font></p>
</body>
</html>

Wow, our first scarper is completed. It has a nice interface, you type in search phrase click ‘Count’ and there you are. It displays the number of pages that contains that term same as on Google.

Have fun guys and do comment!

P.S.: You might want to read String Manipulation Function in PHP I and String Manipulation Function in PHP II if you are not much familiar with the string manipulation functions we are using in the code above.

Previous Posts:

33 comments:

  1. Wow. Impressive. Considering that I don't know any php I was delighted to find that your code worked first time.

    It would be nice if Google could add a search results count function to their AJAX APIs but in the mean time there's no other way to access that data.

    Thanks,

    SEO Alchemist

    http://www.marketappeal.co.uk/

    ReplyDelete
  2. @SEO Alchemist

    Thanks

    Enjoy!

    ReplyDelete
  3. Anonymous8:08 PM

    Hi. I've tried this parser , but it doesn't work. It seems like file_get_contents doesn't get any data from google search :(

    ReplyDelete
  4. At the time of writing, the script was working fine.

    Don't know for now because maybe Google have stopped serving requests from unknown User Agents.

    BTW, accessing pages like this doesn't supply the User-Agent string to the remote server which might be the reason why it's not working anymore.

    You may try the same thing using cURL or Snoopy both of which can be set to send User-Agent string.

    ReplyDelete
  5. My web site: www.documentseeker.com, utilizes CURL functions with simple_html_dom. You provide keywords and select a document type and it scrapes Google for the results. A great tool for someone looking for ebooks or informational documents!

    ReplyDelete
  6. Just curious if there is any ethical standard (or not) that is acceptable amount of scraping.

    ReplyDelete
  7. @ Sean

    The following link will be useful:
    http://en.wikipedia.org/wiki/Web_scraping#Legal_issues

    ReplyDelete
  8. :( Google captcha?

    ReplyDelete
  9. great front end but I don't get any results
    I (for test purposes) cut and copied script, made a file for it then saved as .php
    here:
    http://link-directory.org/testing/scripts.php

    I entered a search term (cars) and the app sent me here:
    http://link-directory.org/testing/scripts.php?s=cars&Submit=Count
    I get no results though. am I setting up the hosting file elements correctly? could you give so advice to get this to work thanks.

    ReplyDelete
  10. RE: could you give some advice to get this to work thanks

    so'? I meant some - sorry mistype (yakes)

    ReplyDelete
  11. Hi Jay,

    Many peoples have told me that the script is not working anymore. This is because the script tries to fetch data from Google, but they are not serving data to anything they consider robots (possibly user agent string).

    Consider using the following function

    //pretend to be a browser
    function my_fetch($url,$user_agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)')
    {
    $ch = curl_init();
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt ($ch, CURLOPT_HEADER, 0);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($ch, CURLOPT_REFERER, 'http://www.pcpropertymanager.com/wsnlinks/');
    $result = curl_exec ($ch);
    curl_close ($ch);
    return $result;
    }//function

    I haven't checked the code but I guess iyt should work. Just copy and paste this in the script (in PHP section) and instead of calling "file_get_contents()" call "my_fetch()".

    Hope this helps.

    ReplyDelete
  12. Anonymous3:02 PM

    Hey Arvind this still gives me a Captcha... :-( hope I can find a solution to this problem need to scrape 10.000 result pages and after the first 400 google kicked my ass with the captcha stuff :( loL!

    ReplyDelete
  13. Hi Anonymous,

    Please don't use it to abuse websites, this was just to illustrate how things work.

    As far as i know Google won't let you access more than 1000 results for any keyword.

    ReplyDelete
  14. Arvind, many thanks for this tutorial, but I haven't fully understand yet, could you further more explain about this scrape tutorial? And do not forget to enclose the examples (result) of codes you given. It is just my suggestion. Thanks anyway........

    ReplyDelete
  15. Arvind Gupta said :
    6:26 PM


    RE: Many peoples have told me that the script is not working anymore. This is because the script tries to fetch data from Google, but they are not serving data to anything they consider robots (possibly user agent string).

    Consider using the following function

    _______________________________________________

    now that's a cool script! I've been looking for a virtual browser bot for a while now, this I'll look at.

    ReplyDelete
  16. RE:I haven't checked the code but I guess iyt should work. Just copy and paste this in the script (in PHP section) and instead of calling "file_get_contents()" call "my_fetch()".
    Hope this helps.

    I got a bad execution on the "my_fetch()" query

    Fatal error: Call to undefined function my_fetch() in /home/linkdir/public_html/crazy/scripts.php on line 19

    so I went back to "file_get_contents()" for the execution. Script still fails to retrieve though. Could this be because of

    curl_setopt ($ch, CURLOPT_REFERER, 'http://www.pcpropertymanager.com/wsnlinks/');

    http://www.pcpropertymanager.com/wsnlinks/ returning a Error 404?

    or maybe google blocks virtual browser queries?

    Anyway I'll hunt around for a patch and hopefully get this gem working.

    ReplyDelete
  17. Hi Jay,

    Please first copy the function code into the PHP code area after which you can make a call to "my_fetch" instead of "file_get_contents".

    One more thing, instead of the following line:

    curl_setopt ($ch, CURLOPT_REFERER, 'http://www.pcpropertymanager.com/wsnlinks/');

    use:

    curl_setopt ($ch, CURLOPT_REFERER, 'http://www.google.com/');

    ReplyDelete
  18. Hi Jay,

    I've updated the code and now it's working please use the new version.

    ReplyDelete
  19. Jijo

    Thank you very much

    ReplyDelete
  20. Hi technitrous08,

    It's my pleasure!

    ReplyDelete
  21. Very impressive code.

    ReplyDelete
  22. I would sayt hat is really a proper explaination of extracting Data and forming good results.

    ReplyDelete
  23. This is a fantastic script! Thanks so much for providing it. I tried to find a guide to making it display multiple items, but couldn't find anything. Any way to make it display more than one item with just one keyword? I don't mind a web link of something to read! Thanks again,

    Sean

    ReplyDelete
  24. Hi Sean,

    Thanks.

    What exactly do you mean by "multiple items"?

    This script scrapes "Page count" for a keyword which can be a only one per keyword.

    Could you clarify, please?

    ReplyDelete
  25. I am interested in finding a way to find the top xx resuults for a keyword search in Google and extract the 10 URLs to a text file.

    Can your script be enhanced to cater for that operation?

    Thanks

    ReplyDelete
  26. Hi happy-camper,

    Of course, use some regex to parse and scrape out the URLs.

    ReplyDelete
  27. Could you give me some sample code to acomplish this Arvind?

    Thanks

    ReplyDelete
  28. the scraped results counts are not matching now ?

    ReplyDelete
  29. I believe that Google have changed how they display results pages again, breaking the above script.

    Do you have another update please?

    ReplyDelete
  30. @Arvind... Fantastic tutorial, nice and simple and the coding is spot on, great that you have updated it as the time has gone on, thanks :D

    ReplyDelete
  31. Undefined index $s
    on line 27
    anyone could help plz

    ReplyDelete
  32. Does anyone receive an error from being blocked by Google? If you do it to many times then Google thinks its an automated script? Has anyone tried services like http://googlescraping.com to resolve this?

    ReplyDelete

You are free to comment anything, although you can comment as 'Anonymous' it is strongly recommended that you supply your name. Thank You.

Please don't use abusive language.