Crawl Google, they do the same to you ; )

Posted by Felix Geisendörfer, on Jun 10, 2008 - in PHP & CakePHP » Other

Hey folks,

Marc Grabanski just had the great idea of using google to help with the migration of your site to a new domain / url schema. Just get a list of all pages google has indexed from your site and then use that as your basis for checking if your migration worked or not. This is very convenient because you do not have to know all your own urls yourself, and you'll only get the relevant ones (if they are not in google they are unlikely to have traffic).

So here is some quick code for crawling Google instead of being crawled by them in CakePHP:

php
  1. class GoogleIndexShell extends Shell {
  2.   function main() {
  3.     App::import('HttpSocket');
  4.     list($site) = $this->args;
  5.     $Socket = new HttpSocket();
  6.     $links = array();
  7.  
  8.     $start = 0;
  9.     $num = 100;
  10.     do {
  11.       $r = $Socket->get('http://www.google.com/search', array(
  12.         'hl' => 'en',
  13.         'as_sitesearch' => $site,
  14.         'num' => $num,
  15.         'filter' => 0,
  16.         'start' => $start,
  17.       ));
  18.       if (!preg_match_all('/href="([^"]+)" class="?l"?/is', $r, $matches)) {
  19.         die($this->out('Error: Could not parse google results'));
  20.       }
  21.       $links = array_merge($links, $matches[1]);
  22.       $start = $start + $num;
  23.     } while (count($matches[1]) >= $num);
  24.  
  25.     $links = array_unique($links);
  26.     $this->out(sprintf('-> Found %d links on google:', count($links)));
  27.     $this->hr();
  28.     $this->out(join("\n", $links));
  29.   }
  30. }

Usage is as simple as running:

./cake google_index debuggable.com

Which should produce an output like this:

Welcome to CakePHP v1.2.0.7125 beta Console
---------------------------------------------------------------
App : app
Path: /Users/felix/dev/www/php5/debuggable/app
---------------------------------------------------------------
-> Found 293 links on google:
---------------------------------------------------------------
http://debuggable.com/
http://debuggable.com/contracting
http://debuggable.com/contact
http://debuggable.com/workshops
http://debuggable.com/open-source/fixtures-shell
http://debuggable.com/open-source/google-analytics-api
http://debuggable.com/posts/thinking-what:480f4dd5-5f1c-4d37-99b0-4768cbdd56cb
http://debuggable.com/posts/jquerycamp07:480f4dd6-8d40-44e1-8551-4a58cbdd56cb
...

Oh and if you want to see more shell sample code, also check out our FixtureShell and the blog post for it.

-- Felix Geisendörfer aka the_undefined

PS: Please note that this is a quick hack, and any non-trivial change in the markup google uses will break. This is only meant for temporary usage.