Crawl Google, they do the same to you ; )
Posted by Felix Geisendörfer, on Jun 10, 2008 - in PHP & CakePHP » Other
Hey folks,
Marc Grabanski just had the great idea of using google to help with the migration of your site to a new domain / url schema. Just get a list of all pages google has indexed from your site and then use that as your basis for checking if your migration worked or not. This is very convenient because you do not have to know all your own urls yourself, and you'll only get the relevant ones (if they are not in google they are unlikely to have traffic).
So here is some quick code for crawling Google instead of being crawled by them in CakePHP:
-
class GoogleIndexShell extends Shell {
-
function main() {
-
App::import('HttpSocket');
-
$Socket = new HttpSocket();
-
-
$start = 0;
-
$num = 100;
-
do {
-
'hl' => 'en',
-
'as_sitesearch' => $site,
-
'num' => $num,
-
'filter' => 0,
-
'start' => $start,
-
));
-
}
-
$start = $start + $num;
-
-
$this->hr();
-
}
-
}
Usage is as simple as running:
./cake google_index debuggable.com
Which should produce an output like this:
Welcome to CakePHP v1.2.0.7125 beta Console --------------------------------------------------------------- App : app Path: /Users/felix/dev/www/php5/debuggable/app --------------------------------------------------------------- -> Found 293 links on google: --------------------------------------------------------------- http://debuggable.com/ http://debuggable.com/contracting http://debuggable.com/contact http://debuggable.com/workshops http://debuggable.com/open-source/fixtures-shell http://debuggable.com/open-source/google-analytics-api http://debuggable.com/posts/thinking-what:480f4dd5-5f1c-4d37-99b0-4768cbdd56cb http://debuggable.com/posts/jquerycamp07:480f4dd6-8d40-44e1-8551-4a58cbdd56cb ...
Oh and if you want to see more shell sample code, also check out our FixtureShell and the blog post for it.
-- Felix Geisendörfer aka the_undefined
PS: Please note that this is a quick hack, and any non-trivial change in the markup google uses will break. This is only meant for temporary usage.