## Sunday, September 2, 2012

### Quick note: Installing numpy, scipy, and matplotlib

After some trial and error in trying to install NumPy, SciPy, and Matplotlib, it appears the easier way to do it is via macports. I am using python 2.7, so I installed the corresponding versions. I already had gcc and gfortran installed.

sudo port install py27-numpy
sudo port install py27-scipy
sudo port install py27-matplotlib

After installation finishes, especially for NumPy and SciPy, do a quick run of the unit tests to make sure your Python environment is set up correctly. I am planning on playing around with the scikits-learn package.

## Sunday, August 19, 2012

### The best programming advice from the experts

InformIT has a collection of short articles from some of the most prominent programmers titled "The Best Programming Advice I Ever Got." A quick summary of their advice for myself:

• Take time to understand the error message at the top of an exception/stack trace before making additional changes to code.
• Answer questions -- you will learn a lot in the process of finding answers.
• Stay out of other people's code.
• Think before debugging.
• TDD -- Test Driven Design / Test Driven Development.
• Make code usable before making code reusable.

## Tuesday, June 19, 2012

### Cassandra quick notes (Part I)

I had a number of quick notes on Cassandra, which I thought others might find useful as well. Since my original set of quick notes is pretty long, I am breaking it up into two parts (this is the first part). If you are interested in the "operations" aspects of Cassandra, I would recommend looking at this book; a lot of the pointers are from this book.
• Calculate ideal initial tokens
$\text{init_token} = \text{node_num_zero_indexed} \times \frac{2^{127}}{\text{num_nodes}}$
• Adjust replication factor to work with quorum
$\text{nodes_for_quorum} = \frac{\text{rep_factor}}{2} + 1$
• Anti Entropy Repair
Anti Entropy Repair (AES) is a very intensive data repair mechanism, and should preferably be run at times of low traffic. It can result in duplicate data on nodes, which can be removed using nodetool compact, or can be eliminated during the repair process using a the $\text{-pr}$ option. Schedule for AES should be lower or equal to $\text{gc_grace_seconds}$. AES should be run in the following situations:
• Change in replication factor
• Joined nodes without auto bootstrap
• Lost or corrupted fils (such as SSTables, indexes, or commit logs)
$<cassandra_home>/bin/nodetool -h$HOST -p $PORT repair -pr • Nodetool cleanup Use nodetool cleanup to remove copies of data from nodes for which they are not responsible for. Cleanup is intensive; run cleanup for topology changes, or hinted handoff and write consistency ANY. $ <cassandra_home>/bin/nodetool -h $HOST -p$PORT cleanup
• Use nodetool snapshot for backup
Snapshot makes hard-links of files in the data directory to a subfolder,
"snapshot/<timestamp>"
$<cassandra_home>/bin/nodetool -h$HOST -p $PORT snapshot • Clear snapshots with nodetool $ <cassandra_home>/bin/nodetool -h $HOST -p$PORT clearsnapshot

• Nodetool to move nodes in ring
$<cassandra_home>/bin/nodetool -h$HOST -p $PORT move <new_token>  • Nodetool to remove a "downed" node $ <cassandra_home>/bin/nodetool -h $HOST -p$PORT removetoken <token_value>

When a node is removed, Cassandra actively begins replicating the missing data until it is stored on the number of nodes specified by replication factor.
• Removing a "live" node
$<cassandra_home>/bin/nodetool -h$LIVE_NODE_TO_REMOVE -p $PORT decommission  • Get quick stats using nodetool The following are pretty handy in quickly gleaming at Cassandra cluster state. $ <cassandra_home>/bin/nodetool -h $HOST -p$PORT tpstats

$<cassandra_home>/bin/nodetool -h$HOST -p $PORT cfstats$ <cassandra_home>/bin/nodetool -h $HOST -p$PORT compactionstats

$<cassandra_home>/bin/nodetool -h$HOST -p $PORT cfhistgrams  • Monitor GC events in Cassandra log Cassandra has options in$\text{conf/cassandra-env.sh}$that cause Java to print garbage collection message to log file. $ grep "GC inspection" /var/log/cassandra/system.log

## Thursday, June 14, 2012

### My reading list 2012, so far

I am happy to say that I have managed to adhere to my goal of reading regularly. Below is a list of the books I have read to far, half way into the year:
• Refactoring: Improving the Design of Existing Code by Martin Fowler
• Outliers: The Story of Success by Malcolm Gladwell
• The Lean Startup: How today's entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses by Eric Ries
• Armchair Economist: Economics and Everyday life by Steven Landsburg
> This has been on my reading list for a while, finally got around to it this year.
• Maximum City: Bombay Lost and Found by Suketu Mehta.
(updated 08/19/2012)
• Cassandra: High Performance Cookbook by Edward Capriolo.
• Quiet: The Power of Introverts in a World that can't stop talking by Susan Cain.
• Getting Real: The smarter, faster, easier way to build a successful web application by 37signals.
(updated 11/15/2012)
• Moonwalking with Einstein: The art and science of remembering everything by Joshua Foer.

## Thursday, April 26, 2012

### Hitting 1000 miles a year!

I have been biking back and forth to work regularly, as well as running every other day during the week. I have managed to make this a schedule over the last few months. Here is a casual estimate of my walking and biking distances (in miles) a year based on this current practice:
• Biking back and forth to work
• 5 days a week, approximately 2 miles each way, for a total of 4 miles per day and 20 miles per week.
• Running three days a week
• Approximately 2 miles every session, for a total of 6 miles.
The above adds up to 26 miles in a week, which is 104 miles a month, leading to 1248 miles a year. Assuming I stick with this routine for the whole year, doesn't seem too bad for a cs+math geek, eh!

## Wednesday, April 11, 2012

### Pair Design

A lot of times, when writing software, a pair programming approach is used. It's an agile software development technique in which two programmers work together at one workstation. At any time, one is the "driver" and the other "navigator", and the roles are switched at intervals. Basically the "driver" is writing the code, and the "navigator" is simultaneously reviewing the code in progress.

I got introduced to this approach during my undergraduate, and we had to use this approach whenever we were working on an assignment in groups of two. Pivotal labs seems to be using pair programming approach quite effectively. I am not sure which other companies use this approach, but it seems the pair programming idea can be extended to the design process as well.

Generally an engineer (let's call this engineer the primary engineer) gets a task, maybe fixing a bug, improving a feature, or building a new API, all of which involves thinking and designing a solution. The primary engineer designs the solution, and the paired engineer is there to discuss the solution, address any limitations, good parts, and improvements before coding commences. The paired engineer will be a sounding board for the primary engineer to think aloud about the proposed solution. Of course, the roles will be reciprocating.

I am not saying don't evaluate designs as a team, this is very valuable, and is usually reserved for something significant. But for day to day design of solutions, having an assigned sounding board can help the primary engineer to think aloud about the problem. Each paired design partners will rotate, maybe every sprint, for example. This way, each team member will also get a chance to work with others, and an informal review process for designs will naturally set in.

Just a thought.

## Friday, April 6, 2012

### Switching to using MathJax for LaTex support

I reset my LaTex support in blogspot to use MathJax. Looks like it is rendering, $$A = B^T$$

Setting up LaTex support using MathJax is straightforward. Just put the following script before the </head> element in your blogger template.
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js">MathJax.Hub.Config({extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js"],jax: ["input/TeX", "output/HTML-CSS"],tex2jax: {inlineMath: [ ['$','$'], ["\$","\$"] ],displayMath: [ ['$$','$$'], ["\$","\$"] ],},"HTML-CSS": { availableFonts: ["TeX"] }});</script>

## Wednesday, April 4, 2012

### 14 Design Lessons Learned from Berkeley DB

The original article can be found here, but I wanted to put together a summary list of the design lessons in the article for later reference for myself.

1. Software should be designed and built as a cooperating set of modules. Each module should have a well-defined API boundary, which can (and should) change as needed, but a boundary is always present.
2. Software design is a way to force oneself to think through the entire problem, and it is really worth thinking through before writing code.
3. Software architecture will degrade over time, there is no way around it.
4. Naming (variables, methods, functions, etc.) and coding style should be consistent, since good naming and formatting provides a lot of information. Failing to follow expected coding conventions should be considered a significant offense.
5. Make upgrade decisions carefully. Consider carefully minor updates to major releases.
6. In designing a library, respect namespace. Do not make engineers put in an effort to memorize reserved words, constants, etc. just to get your library working without naming collisions.
7. Make functionality shared if they appear more than once. Write a test suite for general purpose routines. If the code is difficult to write, write and maintain it separately so that other code cannot corrupt it.
8. Use encapsulation and layering, even if it seems unnecessary. Life will be easier down the road.
9. Think stability over increased concurrency.
10. If one is thinking of writing special purpose code to handle something, step back and see if things can be simpler.
11. If one doesn't have the time to make a change right now, then one won't find the chance later.
12. "...Every piece of code should do a small number of things and there should be a high-level design encouraging programmers to build functionality out of smaller chunks of functionality, and so on..." (from Berkeley DB article)
13. Every bug is important.
14. "...Our goal as architects and programmers is to use the tools at our disposal: design, problem decomposition, review, testing, naming and style conventions, and other good habits, to constrain programming problems to problems we can solve...." (from Berkeley DB article)

## Friday, March 30, 2012

### Psychology of Baby Steps Principle

I have borrowed the term "Baby Steps Principle" (BSP) from a colleague, a senior software engineer, at work. Full credit for the terminology goes to him, but the general perspective presented in the blog entry is my way of trying to organize some thoughts on it. Apologies for the length of this ramble.

We all, at some point or another, apply the "Baby Steps Principle" (BSP). BSP is the concept of developing or building things in an incremental fashion -- work on simple, constrained versions of a problem, and use the insights obtained to build up a solution to the complex, harder problem. This approach to solving problems is ingrained in our minds generally from school, irrespective of whether we come from an engineering or science discipline.

Even though we are quite intimate with this idea, often times we forget to use it. The following situation might seem quite familiar: you are trying to solve a problem, somewhat overwhelmed trying to aggressively apply many ideas; but your subconscious is telling you maybe BSP will work, which you completely ignore wasting hours, but ultimately resort to BSP and find that insight that helps land the perfect solution. A look at BSP psyche might help us get a better understanding on how to use BSP to our advantage.

The dogma of BSP, or incremental development, centers around this cardinal concept: get simpler versions working first, and build on that to get to the complex version. The simpler version of the problem is essentially a highly constrained version of the original problem. Once a solution to the simpler version is found, it can provide an insight into how to relax the constraints and approach a solution to the complex problem. Iterating on simplified versions of a problem is the key to solving the original problem.

We apply BSP in software development very frequently. For example, if you are developing a web MVC (Model View Controller) framework, why not build a bare or minimum version first, get it working, and then start building in your features to transform it into the end product. In scaling a system to 1 million users, we usually build a version for 100 or 1000 users first, and try to grow that number based on insights from building that relatively small scale version.

The above examples work because of a multitude of factors. But from among these factors, the key might be the motivation one obtains from small successful milestones. Successful increments give a sense of achievement, which motivates one to move on to the next step. The sense of achievement removes frustration of not getting anywhere as one struggles with the original problem.

In software, this greatly helps in debugging as well. It becomes less difficult to understand where the bug might lie after having reached a successful milestone. From the psychological perspective of the engineer, this translates to less stress since the process to identify bugs becomes easier, which in turn leads to better workplace motivation and increased productivity.

The examples are not just limited to software. If you have heard of dual decomposition, then you will see BSP also playing a role here. The basic tenet of dual decomposition is to decompose a complex, intractable problem into smaller tractable ones. The idea is to solve these smaller problems efficiently, and approximate the solution to the complex, intractable problem by building up from the solutions to these smaller problems. Figuring out how to decompose the original problem is itself a great milestone, being able to solve each of the decomposed problems is another milestone, followed by combining them cleverly to approximate the solution to the original problem and iterating.

From the cursory advantages outlined above, being able to apply BSP should be quite useful. How can we ensure we are using BSP when needed? The answer is quite simple -- listen to that subconscious or gut feeling. If you are feeling uncomfortable or overwhelmed while approaching a particular problem, it might be a good time to take a breather. Take a step back and think about constraining and simplifying the problem. Start working on the simplified version of the problem. Try to get some insight and incrementally remove constraints to reflect more of the original problem, and you are likely going to hit on an elegant solution to the original problem. Take advantage of the smaller milestones, and enjoy these successes as you work your way towards solving the larger, complex problem.

The concept of minimal viable product (MVP) from the Lean Principle also reflects the ideas inherent in BSP. Build up to a full, successful product by working on smaller milestones, evaluating them, and using the insights obtained to incrementally work on the next steps. I think we can all benefit from BSP, and it is just a matter of listening to that subconscious or gut feeling -- trust me, I am an engineer!

## Wednesday, March 28, 2012

### More data, more noise, or more rare events

We are in a data splurge. Everyone is interested in data, how to gather it efficiently, how to store it, and most importantly, what to make out of it. Data is playing a key part from search, ad and movie recommendation, to development of social media based products. We want to know more and more about our users, be more personalized. So we collect more data, in an attempt to cover every aspect of our users' likes, preferences and habits.

But how does one decide what data is worth collecting? Or, do we just collect everything we can get our hands onto? How does one find the balance, between collecting noisy data and those informative events that will give us the crucial insight?

## Monday, March 26, 2012

### Spring Hadoop 1.0.0.M1 has a bug!

There appears to be a bug in the recently released Spring Hadoop 1.0.0.M1. I ran into this bug while working on an implementation using Spring Hadoop and Apache Hadoop API. The bug can be reproduced as described below, and after spending many hours with a debugger, I think I have finally narrowed down the cause.

Spring Hadoop lets you define a Hadoop namespace (hdp), and as mentioned in the reference docs, uses a hdp:configuration tag, as below, to externally configure your namenode, replication, mapred and other settings. The keyword resources is Spring resource that is used here to specify location of any external configuration files (custom resource).
<hdp:configuration resources="classpath:/custom-site.xml"/>
The tag, in essence, is a wrapper around a factory bean that spits out a Configuration object after parsing the defined resources. When you read in the configuration as above, and try to obtain a FileSystem, for example, as
FileSystem.get(conf);
, a java.io.IOException: Stream closed exception is thrown. You can see the stack trace in this stackoverflow post here.

If you look at the source code for Spring Hadoop on github, it is essentially a combination of Spring with Apache Hadoop API underneath. Spring ConfigurationFactoryBean and Apache Hadoop's Configuration together causes this exception.

Basically, an input stream is opened for the custom resource (to parse the pre-defined fields into an XML DOM tree) and is closed after the resource is read; subsequent reloading of configuration by the FileSystem get method causes an IO Exception because the same stream, which is already closed, is read again. Trying to read the same stale input stream is the issue.

One solution, which we are using for now, is to write up a custom ConfigurationFactoryBean class that will create a new Configuration instance using the existing one as parent (possibly passed in as an argument), and adding resources as URL. If you do not want to write a custom bean, then you can use Spring Properties and SpEl (Spring Expression Language) to configure any custom settings. In the examples on github, Costin appears to be using properties file to configure Hadoop related settings.

Hopefully this bug will be fixed soon!

## Saturday, March 24, 2012

### Principle of least coding

I would not call myself a software engineering veteran. As you read the piece below, you can think of it as a hodgepodge of ideas that have been in my mind for a while, and an attempt to obtain some clarity in the process of scribbling this. Please feel free to leave your thoughts and comments.

When I say "Principle of Least Coding", I want to be upfront and mention right away that I do not mean "obscure" coding. The term "least" should not be confused with obscure or obfuscated code. In a lot of cases, one can write up some exotic one liner that performs exactly the same logic, but the obfuscation causes more pain for future developers. Least coding is not about just a principle, but about elegance. If you are a coding veteran, you might very likely already know what direction I might be heading. A key point in my writing this post is to try to clarify and organize my thoughts on a collection of principles, which I refer to as Principle of least coding, which every engineer knows or have heard about, but still often times forget to incorporate in his or her daily endeavor of creation.

Least coding is the amount of code that just solves the problem of interest, in a comprehensive manner. There are no extraneous lines of logic that performs anything more than necessary to accurately solve the problem at hand. The code is as general as possible. Such code is also simple, which also makes it elegant. When you see such code, you will get a feeling of satisfaction and happiness -- you will feel the code just does enough to solve the problem at hand, nothing more or nothing less.

There is also an intimate relation between least coding and good design. When one designs a solution well, the amount of code that one has to write becomes minimized. Introducing configurability and minimizing coupling among lines of code or logic is an example of good design that also adheres to principle of least coding.

One can take a piece of code and actually convert it into least coding. This is what we usually refer to as refactoring. Refactoring slims down a piece of code or logic by subtracting out extra, unnecessary parts to make it lean.

Writing too much code that performs a lot of logic makes it susceptible to bugs. The best piece of code without bugs is the one that has not been written yet. As soon as the code is brought into existence in your editor, the chances of bugs creeping in become larger than zero. The point being, if you do not write too much code, then the chances of bugs creeping in that will come back and haunt you in the future are minimized.

So how can we go about writing code that is least coding? In general, when one works out the design (as has been apparent many times from personal and anecdotal experiences) well in advance before writing code -- which is something one should be doing instead of just sitting down in front of a terminal and start typing -- one can work a simple, generic logic which will minimize the amount of code one needs to write. Applying refactoring to the first draft of code can make it lean, transforming the final product into an elegant solution that adheres to the principle of least coding. An iterative thinking process, and a continual application of basic principles can make any junk code beautiful.

## Friday, March 23, 2012

### Setting up a local hadoop cluster

I am running a hadoop cluster on my local machine, for development purposes, in a pseudo distributed mode. In pseudo distributed mode, hadoop daemons run in separate processes.
Just thought I will outline the basic setup process here, in case someone else finds it handy. It is actually quite straightforward to get a hadoop cluster running locally. Here are the basic steps:
• You need to download a hadoop implementation. The one I am using is from Cloudera, and you can obtain a tarball from here.
• Expand the tarball into some convenient directory on your local machine. You will find a bin directory which contains all necessary files for managing hadoop, and a conf directory where hadoop configurations are located. You will have to modify conf-site.xml, hdfs-site.xml, and mapred-site.xml to specify name nodes, replications, etc. to customize your cluster setup. More on this at the bottom.
• You need to ensure remote login, that is, check if you can do ssh localhost without being asked for a password. If you already have a public key, then you can run the following command (assuming dsa key) to ensure password-less local ssh login.

cat $HOME/.ssh/id_dsa.pub >>$HOME/.ssh/authorized_keys

• Set the following environment variables:

• Create a hadooptemp directory (see below in core-site.xml) using the following command:

sudo mkdir -m 777 hadooptemp

• Format the data node:

hadoop namenode -format

• Bring up pseudo mode by invoking start-all.sh. You can stop pseudo mode using stop-all.sh.
You can set up override configurations for hadoop following the examples below.

An example core-site.xml:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration>  <property>    <name>fs.default.name</name>    <value>hdfs://localhost:8020</value>  </property>  <property>     <name>hadoop.tmp.dir</name>    <value>/tmp/hadooptemp/\${user.name}</value>  </property></configuration>
An example hdfs-site.xml:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration>  <property>    <name>dfs.replication</name>    <value>1</value>  </property>  <property>    <name>dfs.permissions</name>    <value>false</value>  </property>  <property>    <!-- specify this so that running 'hadoop namenode -format' formats the right dir -->    <name>dfs.name.dir</name>    <value>/tmp/hadoop/cache/hadoop/dfs/name</value>  </property></configuration>
And an example mapred-site.xml:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration>  <property>    <name>mapred.job.tracker</name>    <value>localhost:8021</value>  </property>  <property>    <name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name>    <value>true</value>  </property></configuration>

alert('Looks like this works!');