Skip to main content

Web Scraping: Gathering "Related Searches" Keyword Data From Google Search Results

Web Scraping: Gathering "Related Searches" Keyword Data From Google Search Results

You might first want to read Basic Web Scraping: Pulling Out Data From Google Search Results.

In the other Basic Web Scraping post we created a Simple Web Scraper that retrieved Google’s Search Result Pages for some keyword and scraped the “Number of Pages” indexed for the keyword. That was a good starting point in learning web scraping, that’s why in this post we’re going to extend that scraper to return one more information, “Related Keywords”.

Actually, Related Searches is a bunch of keywords that Google displays at the bottom of search result pages. These are not displayed all the keywords that you search for, but only for keywords that are somewhat broad. These “related Searches” are the keywords that Google has evaluated to be related with the one searched for.

To scrape this, we’d first need to analyze the HTML code of Google Search Result. We need to find out where the Related Searches block is in the code, typically it’d like:

<h2 class=r>Searches related to:<b> computer programming</b></h2><table border=0 cellpadding=0 cellspacing=0 style="margin-top:6px"><tr style="font-size:84%"><td style="padding:0 30px 6px 0" valign=top><a href="/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=lS4&q=computer+programming+schools&revid=92395746&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=1">computer programming <b>schools</b></a></td><td>&nbsp;</td><td style="padding:0 30px 6px 0" valign=top><a href="/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=lS4&q=computer+programming+careers&revid=92395746&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=2">computer programming <b>careers</b></a></td><td>&nbsp;</td><td style="padding:0 30px 6px 0" valign=top><a href="/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=lS4&q=computer+programming+languages&revid=92395746&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=3">computer programming <b>languages</b></a></td><td>&nbsp;</td><td style="padding:0 30px 6px 0" valign=top><a href="/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=lS4&q=computer+programming+c%2B%2B&revid=92395746&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=4">computer programming <b>c++</b></a></td><td>&nbsp;</td></tr><tr style="font-size:84%"><td style="padding:0 30px 6px 0" valign=top><a href="/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=lS4&q=computer+programming+information&revid=92395746&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=5">computer programming <b>information</b></a></td><td>&nbsp;</td><td style="padding:0 30px 6px 0" valign=top><a href="/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=lS4&q=basic+computer+programming&revid=92395746&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=6"><b>basic</b> computer programming</a></td><td>&nbsp;</td><td style="padding:0 30px 6px 0" valign=top><a href="/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=lS4&q=computer+programming+for+dummies&revid=92395746&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=7">computer programming <b>for dummies</b></a></td><td>&nbsp;</td><td style="padding:0 30px 6px 0" valign=top><a href="/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=lS4&q=game+programming&revid=92395746&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=8"><b>game</b> programming</a></td><td>&nbsp;</td></tr></table>

Remember our goal here is to find out certain footprint or consistent block in the code (which always will be on every page we request), out of which the needed data (Related Searches) can be pulled out.

Here, we can safely assume it to be:

<h2 class=r>Searches related to:

and

</td></tr></table>

So if we do a search for the above block in the code, we can easily pull the required information.

Now, we can follow the following steps to scrape the required information:

  1. Retrieve the required page.
  2. Look for the consistent part of code that uniquely encloses or identifies the required information. Scrape that block.
  3. Pull the required data out.

Following is the code listing:

<html>
<head>
<title>Google Result Scraper</title>
</head>

<body>
<p align="center" style="font-size:500%"><font color="#0000FF">G</font><font color="#FF0000">o</font><font color="#FFFF00">o</font><font color="#0000FF">g</font><font color="#00FF00">l</font><font color="#FF0000">e</font> 
  <font size="2"><br />Result Scraper</font></p>

<?php
$s
=$_GET['s'];

if(isset(
$s))
{
    
//*******F I R S T   P A R T*********
    //Find the number of pages indexed for the searched term
    //*From previous part
    
echo "<p><i>Search for $s</i></p>";

    
$s=urlencode($s);

    
$main_data=file_get_contents("http://www.google.com/search?hl=en&q=".$s."&btnG=Google+Search");
    
    
//strip off HTML
    
$data=strip_tags($main_data);
    
//now $data only has text NO HTML

    //these have to ound out in the fetched data
    
$find='Results 1 - 10 of about ';
    
$find2=' for';

    
//have text beginning from $find
    
$data=strstr($data,$find);

    
//find position of $find2
    //there might be many occurence
    //but it'd give position of the first one, 
    //which is what we want, anyway
    
$pos=strpos($data,$find2);

    
//take substring out, which'd be the number we want
    
$search_number=substr($data,strlen($find), $pos-strlen($find));

    echo 
"Pages Indexed: $search_number";
    
    
    
//********S E C O N D   P A R T*******
    //Find related searches
    
    
echo "<h3>Related Keywords</h3>";

    
//these have to found out in the fetched data
    
$find='<h2 class=r>Searches related to:';
    
$find2='</table>';
    
$find3='<table border=0';

    
//have text beginning from $find
    
$data=strstr($main_data,$find);

    
//find position of $find2
    //there might be many occurence
    //but it'd give position of the first one, 
    //which is what we want, anyway
    
$pos=strpos($data,$find2);

    
//take substring out, this is the block of data
    //that we'd fetch required information off of.
    
$related_search_block=substr($data,strlen($find), $pos-strlen($find));
    
    
//pull out the required data, stripping off "Searches related to.."
    
$related_searches=strstr($related_search_block,$find3);
    
//now we only have the keywords, of course formatted as is
    //with tables
    
    //strip off HTML, therefore taking off table and various
    //other formattings
    
$related_searches=strip_tags($related_searches);
    
//now we have the keywords (plain text) sepearted with 
    //each other using "&nbsp;'
    
    //explode it
    
$keywords=explode('&nbsp;',$related_searches);
    
//now we a nice array having all the keywords
    
    //print the keywords
    //nicely formatted using <ol> (ordered list)
    
echo "<ol>";
    foreach(
$keywords as $keyword)
    {
        if(
$keyword!='' && $keyword!=' ')
        echo 
"<li>
                $keyword
             </li>"
;
    }
    echo 
"</ol>";    
}
else
{
?>

<form name="form1" id="form1" method="get" action="">
  <div align="center">
    <p> 
      <input name="s" type="text" id="s" size="50" />
      <input type="submit" name="Submit" value="Find Related Keywords" />
    </p>
  </div>
</form>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>
  <?php
}
?>
</p>
<p align="right"><font size="2">by <a href="http://learning-computer-programming.blogspot.com/">Learning 
  Computer Programming</a></font></p>
</body>
</html>

Although this script is not very useful as it is (after all it’s meant for learning purpose) you can make it more useful by extending it. one thing that’d be nice is to hav it search for related keywords and again take each keyword as a base for further searches. This way it’d be able to supply you with many keywords related to the one entered. Best thing is, the keywords will be Googly Related!

Previous Posts:

Popular posts from this blog

Fix For Toshiba Satellite "RTC Battery is Low" Error (with Pictures)

RTC Battery is Low Error on a Toshiba Satellite laptop "RTC Battery is Low..." An error message flashing while you try to boot your laptop is enough to panic many people. But worry not! "RTC Battery" stands for Real-Time Clock battery which almost all laptops and PCs have on their motherboard to power the clock and sometimes to also keep the CMOS settings from getting erased while the system is switched off.  It is not uncommon for these batteries to last for years before requiring a replacement as the clock consumes very less power. And contrary to what some people tell you - they are not rechargeable or getting charged while your computer or laptop is running. In this article, we'll learn everything about RTC batteries and how to fix the error on your Toshiba Satellite laptop. What is an RTC Battery? RTC or CMOS batteries are small coin-shaped lithium batteries with a 3-volts output. Most laptops use

The Best Way(s) to Comment out PHP/HTML Code

PHP supports various styles of comments. Please check the following example: <?php // Single line comment code (); # Single line Comment code2 (); /* Multi Line comment code(); The code inside doesn't run */ // /* This doesn NOT start a multi-line comment block /* Multi line comment block The following line still ends the multi-line comment block //*/ The " # " comment style, though, is rarely used. Do note, in the example, that anything (even a multi-block comment /* ) after a " // " or " # " is a comment, and /* */ around any single-line comment overrides it. This information will come in handy when we learn about some neat tricks next. Comment out PHP Code Blocks Check the following code <?php //* Toggle line if ( 1 ) {      // } else {      // } //*/ //* Toggle line if ( 2 ) {      // } else {      // } //*/ Now see how easy it is to toggle a part of PHP code by just removing or adding a single " / " from th

Introduction to Operator Overloading in C++

a1 = a2 + a3; The above operation is valid, as you know if a1, a2 and a3 are instances of in-built Data Types . But what if those are, say objects of a Class ; is the operation valid? Yes, it is, if you overload the ‘+’ Operator in the class, to which a1, a2 and a3 belong. Operator overloading is used to give special meaning to the commonly used operators (such as +, -, * etc.) with respect to a class. By overloading operators, we can control or define how an operator should operate on data with respect to a class. Operators are overloaded in C++ by creating operator functions either as a member or a s a Friend Function of a class. Since creating member operator functions are easier, we’ll be using that method in this article. As I said operator functions are declared using the following general form: ret-type operator#(arg-list); and then defining it as a normal member function. Here, ret-type is commonly the name of the class itself as the ope