Just recently Google announced a great feature as part of the Webmaster Tools: you can fetch your site with the Googlebot. At first I thought they would reveal what content gets extracted from the site and how they might proceed from there but they just seem to crawl your site, showing you the HTTP header fields and the site’s content.
In this post I’d like to present some Java code using the latest and greatest version of HttpClient that allows you to crawl any site, have a look at the HTTP header fields, the site’s content and measure how long it took to download the site. It’s almost the same what Google’s feature does.
The team building HttpClient does a great job improving and perfecting the software. I really like using this fully featured client implementation when it comes to retrieving data via HTTP. It’s easy to set up and configuring as you see fit is a snap.
I’ve used it for this little project too because I only needed four lines of code to get it running. In my opinion the HttpClient API can’t get any better than this. Check out the following code:
HttpClient httpclient = new DefaultHttpClient(); HttpGet httpget = new HttpGet(url); HttpResponse response = this.httpclient.execute(httpget); HttpEntity entity = response.getEntity();
Once you’ve done that you can use the
HttpEntity objects to retrieve the HTTP header fields and the downloaded content; the
EntityUtils come in handy here.
If you run my implementation it prints the following information: HTTP return code, HTTP header fields, the downloaded content and the time it took to execute the request. This is similar to the information offered by Google’s Fetch as Googlebot feature. As you can see, it’s pretty easy implementing it on your own if you haven’t got special requirements.
Nevertheless, Google’s Fetch as Googlebot feature is a really nice thing and I think they’ll expand its features as a greater part of the Webmaster Tools. In this post I wanted to show you that it’s pretty easy building a similar tool on your own. By the way, using e.g. the Web developer plugin for Firefox might be good alternative too.
Note, that Google probably invested a lot more work to get this feature rolling and it’s very likely that it wasn’t that easy for them to make it part of the Webmaster Tools. In this regard this post might make you think that this feature was an easy one – you guessed it, that’s probably not the case.