## Friday, March 30, 2012

### Psychology of Baby Steps Principle

I have borrowed the term "Baby Steps Principle" (BSP) from a colleague, a senior software engineer, at work. Full credit for the terminology goes to him, but the general perspective presented in the blog entry is my way of trying to organize some thoughts on it. Apologies for the length of this ramble.

We all, at some point or another, apply the "Baby Steps Principle" (BSP). BSP is the concept of developing or building things in an incremental fashion -- work on simple, constrained versions of a problem, and use the insights obtained to build up a solution to the complex, harder problem. This approach to solving problems is ingrained in our minds generally from school, irrespective of whether we come from an engineering or science discipline.

Even though we are quite intimate with this idea, often times we forget to use it. The following situation might seem quite familiar: you are trying to solve a problem, somewhat overwhelmed trying to aggressively apply many ideas; but your subconscious is telling you maybe BSP will work, which you completely ignore wasting hours, but ultimately resort to BSP and find that insight that helps land the perfect solution. A look at BSP psyche might help us get a better understanding on how to use BSP to our advantage.

The dogma of BSP, or incremental development, centers around this cardinal concept: get simpler versions working first, and build on that to get to the complex version. The simpler version of the problem is essentially a highly constrained version of the original problem. Once a solution to the simpler version is found, it can provide an insight into how to relax the constraints and approach a solution to the complex problem. Iterating on simplified versions of a problem is the key to solving the original problem.

We apply BSP in software development very frequently. For example, if you are developing a web MVC (Model View Controller) framework, why not build a bare or minimum version first, get it working, and then start building in your features to transform it into the end product. In scaling a system to 1 million users, we usually build a version for 100 or 1000 users first, and try to grow that number based on insights from building that relatively small scale version.

The above examples work because of a multitude of factors. But from among these factors, the key might be the motivation one obtains from small successful milestones. Successful increments give a sense of achievement, which motivates one to move on to the next step. The sense of achievement removes frustration of not getting anywhere as one struggles with the original problem.

In software, this greatly helps in debugging as well. It becomes less difficult to understand where the bug might lie after having reached a successful milestone. From the psychological perspective of the engineer, this translates to less stress since the process to identify bugs becomes easier, which in turn leads to better workplace motivation and increased productivity.

The examples are not just limited to software. If you have heard of dual decomposition, then you will see BSP also playing a role here. The basic tenet of dual decomposition is to decompose a complex, intractable problem into smaller tractable ones. The idea is to solve these smaller problems efficiently, and approximate the solution to the complex, intractable problem by building up from the solutions to these smaller problems. Figuring out how to decompose the original problem is itself a great milestone, being able to solve each of the decomposed problems is another milestone, followed by combining them cleverly to approximate the solution to the original problem and iterating.

From the cursory advantages outlined above, being able to apply BSP should be quite useful. How can we ensure we are using BSP when needed? The answer is quite simple -- listen to that subconscious or gut feeling. If you are feeling uncomfortable or overwhelmed while approaching a particular problem, it might be a good time to take a breather. Take a step back and think about constraining and simplifying the problem. Start working on the simplified version of the problem. Try to get some insight and incrementally remove constraints to reflect more of the original problem, and you are likely going to hit on an elegant solution to the original problem. Take advantage of the smaller milestones, and enjoy these successes as you work your way towards solving the larger, complex problem.

The concept of minimal viable product (MVP) from the Lean Principle also reflects the ideas inherent in BSP. Build up to a full, successful product by working on smaller milestones, evaluating them, and using the insights obtained to incrementally work on the next steps. I think we can all benefit from BSP, and it is just a matter of listening to that subconscious or gut feeling -- trust me, I am an engineer!

## Wednesday, March 28, 2012

### More data, more noise, or more rare events

We are in a data splurge. Everyone is interested in data, how to gather it efficiently, how to store it, and most importantly, what to make out of it. Data is playing a key part from search, ad and movie recommendation, to development of social media based products. We want to know more and more about our users, be more personalized. So we collect more data, in an attempt to cover every aspect of our users' likes, preferences and habits.

But how does one decide what data is worth collecting? Or, do we just collect everything we can get our hands onto? How does one find the balance, between collecting noisy data and those informative events that will give us the crucial insight?

## Monday, March 26, 2012

### Spring Hadoop 1.0.0.M1 has a bug!

There appears to be a bug in the recently released Spring Hadoop 1.0.0.M1. I ran into this bug while working on an implementation using Spring Hadoop and Apache Hadoop API. The bug can be reproduced as described below, and after spending many hours with a debugger, I think I have finally narrowed down the cause.

Spring Hadoop lets you define a Hadoop namespace (hdp), and as mentioned in the reference docs, uses a hdp:configuration tag, as below, to externally configure your namenode, replication, mapred and other settings. The keyword resources is Spring resource that is used here to specify location of any external configuration files (custom resource).
<hdp:configuration resources="classpath:/custom-site.xml"/>
The tag, in essence, is a wrapper around a factory bean that spits out a Configuration object after parsing the defined resources. When you read in the configuration as above, and try to obtain a FileSystem, for example, as
FileSystem.get(conf);
, a java.io.IOException: Stream closed exception is thrown. You can see the stack trace in this stackoverflow post here.

If you look at the source code for Spring Hadoop on github, it is essentially a combination of Spring with Apache Hadoop API underneath. Spring ConfigurationFactoryBean and Apache Hadoop's Configuration together causes this exception.

Basically, an input stream is opened for the custom resource (to parse the pre-defined fields into an XML DOM tree) and is closed after the resource is read; subsequent reloading of configuration by the FileSystem get method causes an IO Exception because the same stream, which is already closed, is read again. Trying to read the same stale input stream is the issue.

One solution, which we are using for now, is to write up a custom ConfigurationFactoryBean class that will create a new Configuration instance using the existing one as parent (possibly passed in as an argument), and adding resources as URL. If you do not want to write a custom bean, then you can use Spring Properties and SpEl (Spring Expression Language) to configure any custom settings. In the examples on github, Costin appears to be using properties file to configure Hadoop related settings.

Hopefully this bug will be fixed soon!

## Saturday, March 24, 2012

### Principle of least coding

I would not call myself a software engineering veteran. As you read the piece below, you can think of it as a hodgepodge of ideas that have been in my mind for a while, and an attempt to obtain some clarity in the process of scribbling this. Please feel free to leave your thoughts and comments.

When I say "Principle of Least Coding", I want to be upfront and mention right away that I do not mean "obscure" coding. The term "least" should not be confused with obscure or obfuscated code. In a lot of cases, one can write up some exotic one liner that performs exactly the same logic, but the obfuscation causes more pain for future developers. Least coding is not about just a principle, but about elegance. If you are a coding veteran, you might very likely already know what direction I might be heading. A key point in my writing this post is to try to clarify and organize my thoughts on a collection of principles, which I refer to as Principle of least coding, which every engineer knows or have heard about, but still often times forget to incorporate in his or her daily endeavor of creation.

Least coding is the amount of code that just solves the problem of interest, in a comprehensive manner. There are no extraneous lines of logic that performs anything more than necessary to accurately solve the problem at hand. The code is as general as possible. Such code is also simple, which also makes it elegant. When you see such code, you will get a feeling of satisfaction and happiness -- you will feel the code just does enough to solve the problem at hand, nothing more or nothing less.

There is also an intimate relation between least coding and good design. When one designs a solution well, the amount of code that one has to write becomes minimized. Introducing configurability and minimizing coupling among lines of code or logic is an example of good design that also adheres to principle of least coding.

One can take a piece of code and actually convert it into least coding. This is what we usually refer to as refactoring. Refactoring slims down a piece of code or logic by subtracting out extra, unnecessary parts to make it lean.

Writing too much code that performs a lot of logic makes it susceptible to bugs. The best piece of code without bugs is the one that has not been written yet. As soon as the code is brought into existence in your editor, the chances of bugs creeping in become larger than zero. The point being, if you do not write too much code, then the chances of bugs creeping in that will come back and haunt you in the future are minimized.

So how can we go about writing code that is least coding? In general, when one works out the design (as has been apparent many times from personal and anecdotal experiences) well in advance before writing code -- which is something one should be doing instead of just sitting down in front of a terminal and start typing -- one can work a simple, generic logic which will minimize the amount of code one needs to write. Applying refactoring to the first draft of code can make it lean, transforming the final product into an elegant solution that adheres to the principle of least coding. An iterative thinking process, and a continual application of basic principles can make any junk code beautiful.

## Friday, March 23, 2012

### Setting up a local hadoop cluster

I am running a hadoop cluster on my local machine, for development purposes, in a pseudo distributed mode. In pseudo distributed mode, hadoop daemons run in separate processes.
Just thought I will outline the basic setup process here, in case someone else finds it handy. It is actually quite straightforward to get a hadoop cluster running locally. Here are the basic steps:
• You need to download a hadoop implementation. The one I am using is from Cloudera, and you can obtain a tarball from here.
• Expand the tarball into some convenient directory on your local machine. You will find a bin directory which contains all necessary files for managing hadoop, and a conf directory where hadoop configurations are located. You will have to modify conf-site.xml, hdfs-site.xml, and mapred-site.xml to specify name nodes, replications, etc. to customize your cluster setup. More on this at the bottom.
• You need to ensure remote login, that is, check if you can do ssh localhost without being asked for a password. If you already have a public key, then you can run the following command (assuming dsa key) to ensure password-less local ssh login.

cat $HOME/.ssh/id_dsa.pub >>$HOME/.ssh/authorized_keys

• Set the following environment variables:

• Create a hadooptemp directory (see below in core-site.xml) using the following command:

sudo mkdir -m 777 hadooptemp

• Format the data node:

hadoop namenode -format

• Bring up pseudo mode by invoking start-all.sh. You can stop pseudo mode using stop-all.sh.
You can set up override configurations for hadoop following the examples below.

An example core-site.xml:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration>  <property>    <name>fs.default.name</name>    <value>hdfs://localhost:8020</value>  </property>  <property>     <name>hadoop.tmp.dir</name>    <value>/tmp/hadooptemp/\${user.name}</value>  </property></configuration>
An example hdfs-site.xml:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration>  <property>    <name>dfs.replication</name>    <value>1</value>  </property>  <property>    <name>dfs.permissions</name>    <value>false</value>  </property>  <property>    <!-- specify this so that running 'hadoop namenode -format' formats the right dir -->    <name>dfs.name.dir</name>    <value>/tmp/hadoop/cache/hadoop/dfs/name</value>  </property></configuration>
And an example mapred-site.xml:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration>  <property>    <name>mapred.job.tracker</name>    <value>localhost:8021</value>  </property>  <property>    <name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name>    <value>true</value>  </property></configuration>

alert('Looks like this works!');