Just recently Google announced a great feature as part of the Webmaster Tools: you can fetch your site with the Googlebot. At first I thought they would reveal what content gets extracted from the site and how they might proceed from there but they just seem to crawl your site, showing you the HTTP header fields and the site’s content.
In this post I’d like to present some Java code using the latest and greatest version of HttpClient that allows you to crawl any site, have a look at the HTTP header fields, the site’s content and measure how long it took to download the site. It’s almost the same what Google’s feature does.
The Eclipse project with the code for this post can be downloaded as tar.gz or zip. You can browse the code online here.
Implementation
The team building HttpClient does a great job improving and perfecting the software. I really like using this fully featured client implementation when it comes to retrieving data via HTTP. It’s easy to set up and configuring as you see fit is a snap.
I’ve used it for this little project too because I only needed four lines of code to get it running. In my opinion the HttpClient API can’t get any better than this. Check out the following code:
HttpClient httpclient = new DefaultHttpClient(); HttpGet httpget = new HttpGet(url); HttpResponse response = this.httpclient.execute(httpget); HttpEntity entity = response.getEntity(); |
Once you’ve done that you can use the HttpResponse
and HttpEntity
objects to retrieve the HTTP header fields and the downloaded content; the EntityUtils
come in handy here.
If you run my implementation it prints the following information: HTTP return code, HTTP header fields, the downloaded content and the time it took to execute the request. This is similar to the information offered by Google’s Fetch as Googlebot feature. As you can see, it’s pretty easy implementing it on your own if you haven’t got special requirements.
Conclusion
Nevertheless, Google’s Fetch as Googlebot feature is a really nice thing and I think they’ll expand its features as a greater part of the Webmaster Tools. In this post I wanted to show you that it’s pretty easy building a similar tool on your own. By the way, using e.g. the Web developer plugin for Firefox might be good alternative too.
Note, that Google probably invested a lot more work to get this feature rolling and it’s very likely that it wasn’t that easy for them to make it part of the Webmaster Tools. In this regard this post might make you think that this feature was an easy one – you guessed it, that’s probably not the case.
do you finish this project ?
No, I won’t be building this project. This post was just a POC that a feature like this is possible with just a few lines of code if you’re starting off simple.