Monday, March 26, 2012

Spring Hadoop 1.0.0.M1 has a bug!

There appears to be a bug in the recently released Spring Hadoop 1.0.0.M1. I ran into this bug while working on an implementation using Spring Hadoop and Apache Hadoop API. The bug can be reproduced as described below, and after spending many hours with a debugger, I think I have finally narrowed down the cause.

Spring Hadoop lets you define a Hadoop namespace (hdp), and as mentioned in the reference docs, uses a hdp:configuration tag, as below, to externally configure your namenode, replication, mapred and other settings. The keyword resources is Spring resource that is used here to specify location of any external configuration files (custom resource).
<hdp:configuration resources="classpath:/custom-site.xml"/>
The tag, in essence, is a wrapper around a factory bean that spits out a Configuration object after parsing the defined resources. When you read in the configuration as above, and try to obtain a FileSystem, for example, as
, a Stream closed exception is thrown. You can see the stack trace in this stackoverflow post here.

If you look at the source code for Spring Hadoop on github, it is essentially a combination of Spring with Apache Hadoop API underneath. Spring ConfigurationFactoryBean and Apache Hadoop's Configuration together causes this exception.

Basically, an input stream is opened for the custom resource (to parse the pre-defined fields into an XML DOM tree) and is closed after the resource is read; subsequent reloading of configuration by the FileSystem get method causes an IO Exception because the same stream, which is already closed, is read again. Trying to read the same stale input stream is the issue.

One solution, which we are using for now, is to write up a custom ConfigurationFactoryBean class that will create a new Configuration instance using the existing one as parent (possibly passed in as an argument), and adding resources as URL. If you do not want to write a custom bean, then you can use Spring Properties and SpEl (Spring Expression Language) to configure any custom settings. In the examples on github, Costin appears to be using properties file to configure Hadoop related settings.

Hopefully this bug will be fixed soon!


  1. Thanks for taking the time to discuss on hadoop, I feel strongly about it and love learning more on this topic. If possible, as you gain expertise, would you mind updating your blog with more information? It is extremely helpful .
    Hadoop Training in hyderabad

  2. There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

    Hadoop training institutes in chennai | Hadoop Training Chennai