Monday, March 26, 2012

Spring Hadoop 1.0.0.M1 has a bug!

There appears to be a bug in the recently released Spring Hadoop 1.0.0.M1. I ran into this bug while working on an implementation using Spring Hadoop and Apache Hadoop API. The bug can be reproduced as described below, and after spending many hours with a debugger, I think I have finally narrowed down the cause.

Spring Hadoop lets you define a Hadoop namespace (hdp), and as mentioned in the reference docs, uses a hdp:configuration tag, as below, to externally configure your namenode, replication, mapred and other settings. The keyword resources is Spring resource that is used here to specify location of any external configuration files (custom resource).
<hdp:configuration resources="classpath:/custom-site.xml"/>
The tag, in essence, is a wrapper around a factory bean that spits out a Configuration object after parsing the defined resources. When you read in the configuration as above, and try to obtain a FileSystem, for example, as
FileSystem.get(conf);
, a java.io.IOException: Stream closed exception is thrown. You can see the stack trace in this stackoverflow post here.

If you look at the source code for Spring Hadoop on github, it is essentially a combination of Spring with Apache Hadoop API underneath. Spring ConfigurationFactoryBean and Apache Hadoop's Configuration together causes this exception.

Basically, an input stream is opened for the custom resource (to parse the pre-defined fields into an XML DOM tree) and is closed after the resource is read; subsequent reloading of configuration by the FileSystem get method causes an IO Exception because the same stream, which is already closed, is read again. Trying to read the same stale input stream is the issue.

One solution, which we are using for now, is to write up a custom ConfigurationFactoryBean class that will create a new Configuration instance using the existing one as parent (possibly passed in as an argument), and adding resources as URL. If you do not want to write a custom bean, then you can use Spring Properties and SpEl (Spring Expression Language) to configure any custom settings. In the examples on github, Costin appears to be using properties file to configure Hadoop related settings.

Hopefully this bug will be fixed soon!

8 comments:

  1. Thanks for taking the time to discuss on hadoop, I feel strongly about it and love learning more on this topic. If possible, as you gain expertise, would you mind updating your blog with more information? It is extremely helpful .
    Hadoop Training in hyderabad

    ReplyDelete
  2. I feel really happy to have seen your webpage and look forward to so many more entertaining times reading here. Thanks once more for all the details.
    industrial safety course in chennai

    ReplyDelete

  3. Hello, I read your blog occasionally, and I own a similar one, and I was just wondering if you get a lot of spam remarks? If so how do you stop it, any plugin or anything you can advise? I get so much lately it’s driving me insane, so any assistance is very much appreciated.
    AWS Training in Chennai | Best AWS Training in Chennai
    Best Data Science Training in Chennai
    Best Python Training in Chennai
    Best RPA Training in Chennai
    Digital Marketing Training in Chennai
    Matlab Training in Chennai
    Best AWS Course Training in Chennai
    Best Devops Course Training in Chennai
    Java Training Institute in Chennai
    C C++ Training in Chennai

    ReplyDelete
  4. I must appreciate you for providing such a valuable content for us. This is one amazing piece of article. Helped a lot in increasing my knowledge.
    AWS training in chennai | AWS training in anna nagar | AWS training in omr | AWS training in porur | AWS training in tambaram | AWS training in velachery

    ReplyDelete