Sunday, September 2, 2012

Quick note: Installing numpy, scipy, and matplotlib

After some trial and error in trying to install NumPy, SciPy, and Matplotlib, it appears the easier way to do it is via macports. I am using python 2.7, so I installed the corresponding versions. I already had gcc and gfortran installed.

sudo port install py27-numpy
sudo port install py27-scipy
sudo port install py27-matplotlib

After installation finishes, especially for NumPy and SciPy, do a quick run of the unit tests to make sure your Python environment is set up correctly. I am planning on playing around with the scikits-learn package.

Sunday, August 19, 2012

The best programming advice from the experts

InformIT has a collection of short articles from some of the most prominent programmers titled "The Best Programming Advice I Ever Got." A quick summary of their advice for myself:

  • Write less code (I have written about this before).
  • Take time to understand the error message at the top of an exception/stack trace before making additional changes to code.
  • Read, read, and read -- just make sure things you read are high quality and informative material.
  • Answer questions -- you will learn a lot in the process of finding answers.
  • Stay out of other people's code.
  • Think before debugging.
  • TDD -- Test Driven Design / Test Driven Development.
  • Make code usable before making code reusable.

Tuesday, June 19, 2012

Cassandra quick notes (Part I)

I had a number of quick notes on Cassandra, which I thought others might find useful as well. Since my original set of quick notes is pretty long, I am breaking it up into two parts (this is the first part). If you are interested in the "operations" aspects of Cassandra, I would recommend looking at this book; a lot of the pointers are from this book.
  • Calculate ideal initial tokens
$  \text{init_token} = \text{node_num_zero_indexed} \times \frac{2^{127}}{\text{num_nodes}} $
  • Adjust replication factor to work with quorum
$ \text{nodes_for_quorum} = \frac{\text{rep_factor}}{2} + 1 $
  • Anti Entropy Repair
Anti Entropy Repair (AES) is a very intensive data repair mechanism, and should preferably be run at times of low traffic. It can result in duplicate data on nodes, which can be removed using nodetool compact, or can be eliminated during the repair process using a the $\text{-pr}$ option. Schedule for AES should be lower or equal to $\text{gc_grace_seconds}$. AES should be run in the following situations:
    • Change in replication factor
    • Joined nodes without auto bootstrap
    • Lost or corrupted fils (such as SSTables, indexes, or commit logs)
$ <cassandra_home>/bin/nodetool -h $HOST -p $PORT repair -pr
  • Nodetool cleanup
Use nodetool cleanup to remove copies of data from nodes for which they are not
responsible for. Cleanup is intensive; run cleanup for topology changes, or
hinted handoff and write consistency ANY.
$ <cassandra_home>/bin/nodetool -h $HOST -p $PORT cleanup
  • Use nodetool snapshot for backup
Snapshot makes hard-links of files in the data directory to a subfolder,
"snapshot/<timestamp>"
$ <cassandra_home>/bin/nodetool -h $HOST -p $PORT snapshot
  • Clear snapshots with nodetool
$ <cassandra_home>/bin/nodetool -h $HOST -p $PORT clearsnapshot
  • Nodetool to move nodes in ring
$ <cassandra_home>/bin/nodetool -h $HOST -p $PORT move <new_token>
  • Nodetool to remove a "downed" node
$ <cassandra_home>/bin/nodetool -h $HOST -p $PORT removetoken <token_value>
When a node is removed, Cassandra actively begins replicating the missing data until it is stored on the number of nodes specified by replication factor.
  • Removing a "live" node
$ <cassandra_home>/bin/nodetool -h $LIVE_NODE_TO_REMOVE -p $PORT decommission
  • Get quick stats using nodetool
The following are pretty handy in quickly gleaming at Cassandra cluster state.
$ <cassandra_home>/bin/nodetool -h $HOST -p $PORT tpstats

$ <cassandra_home>/bin/nodetool -h $HOST -p $PORT cfstats

$ <cassandra_home>/bin/nodetool -h $HOST -p $PORT compactionstats

$ <cassandra_home>/bin/nodetool -h $HOST -p $PORT cfhistgrams
  • Monitor GC events in Cassandra log
Cassandra has options in $\text{conf/cassandra-env.sh}$ that cause Java to print
garbage collection message to log file.
$ grep "GC inspection" /var/log/cassandra/system.log

Thursday, June 14, 2012

My reading list 2012, so far

I am happy to say that I have managed to adhere to my goal of reading regularly. Below is a list of the books I have read to far, half way into the year:
  • Refactoring: Improving the Design of Existing Code by Martin Fowler
  • Outliers: The Story of Success by Malcolm Gladwell
  • The Lean Startup: How today's entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses by Eric Ries
  • Armchair Economist: Economics and Everyday life by Steven Landsburg
    > This has been on my reading list for a while, finally got around to it this year.
  • Maximum City: Bombay Lost and Found by Suketu Mehta.
(updated 08/19/2012)
  • Cassandra: High Performance Cookbook by Edward Capriolo.
  • Quiet: The Power of Introverts in a World that can't stop talking by Susan Cain.
  • Getting Real: The smarter, faster, easier way to build a successful web application by 37signals.
(updated 11/15/2012)
  • Moonwalking with Einstein: The art and science of remembering everything by Joshua Foer.

Thursday, April 26, 2012

Hitting 1000 miles a year!

I have been biking back and forth to work regularly, as well as running every other day during the week. I have managed to make this a schedule over the last few months. Here is a casual estimate of my walking and biking distances (in miles) a year based on this current practice: 
  • Biking back and forth to work
    • 5 days a week, approximately 2 miles each way, for a total of 4 miles per day and 20 miles per week.
  • Running three days a week
    • Approximately 2 miles every session, for a total of 6 miles.
The above adds up to 26 miles in a week, which is 104 miles a month, leading to 1248 miles a year. Assuming I stick with this routine for the whole year, doesn't seem too bad for a cs+math geek, eh!

Wednesday, April 11, 2012

Pair Design

A lot of times, when writing software, a pair programming approach is used. It's an agile software development technique in which two programmers work together at one workstation. At any time, one is the "driver" and the other "navigator", and the roles are switched at intervals. Basically the "driver" is writing the code, and the "navigator" is simultaneously reviewing the code in progress.

I got introduced to this approach during my undergraduate, and we had to use this approach whenever we were working on an assignment in groups of two. Pivotal labs seems to be using pair programming approach quite effectively. I am not sure which other companies use this approach, but it seems the pair programming idea can be extended to the design process as well.

Generally an engineer (let's call this engineer the primary engineer) gets a task, maybe fixing a bug, improving a feature, or building a new API, all of which involves thinking and designing a solution. The primary engineer designs the solution, and the paired engineer is there to discuss the solution, address any limitations, good parts, and improvements before coding commences. The paired engineer will be a sounding board for the primary engineer to think aloud about the proposed solution. Of course, the roles will be reciprocating.

I am not saying don't evaluate designs as a team, this is very valuable, and is usually reserved for something significant. But for day to day design of solutions, having an assigned sounding board can help the primary engineer to think aloud about the problem. Each paired design partners will rotate, maybe every sprint, for example. This way, each team member will also get a chance to work with others, and an informal review process for designs will naturally set in.

Just a thought.

Friday, April 6, 2012

Switching to using MathJax for LaTex support

I reset my LaTex support in blogspot to use MathJax. Looks like it is rendering, $$A = B^T$$

Setting up LaTex support using MathJax is straightforward. Just put the following script before the </head> element in your blogger template.
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js">
MathJax.Hub.Config({
extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
},
"HTML-CSS": { availableFonts: ["TeX"] }
});
</script>

Wednesday, April 4, 2012

14 Design Lessons Learned from Berkeley DB

The original article can be found here, but I wanted to put together a summary list of the design lessons in the article for later reference for myself.

  1. Software should be designed and built as a cooperating set of modules. Each module should have a well-defined API boundary, which can (and should) change as needed, but a boundary is always present.
  2. Software design is a way to force oneself to think through the entire problem, and it is really worth thinking through before writing code.
  3. Software architecture will degrade over time, there is no way around it.
  4. Naming (variables, methods, functions, etc.) and coding style should be consistent, since good naming and formatting provides a lot of information. Failing to follow expected coding conventions should be considered a significant offense.
  5. Make upgrade decisions carefully. Consider carefully minor updates to major releases.
  6. In designing a library, respect namespace. Do not make engineers put in an effort to memorize reserved words, constants, etc. just to get your library working without naming collisions.
  7. Make functionality shared if they appear more than once. Write a test suite for general purpose routines. If the code is difficult to write, write and maintain it separately so that other code cannot corrupt it.
  8. Use encapsulation and layering, even if it seems unnecessary. Life will be easier down the road.
  9. Think stability over increased concurrency.
  10. If one is thinking of writing special purpose code to handle something, step back and see if things can be simpler.
  11. If one doesn't have the time to make a change right now, then one won't find the chance later.
  12. "...Every piece of code should do a small number of things and there should be a high-level design encouraging programmers to build functionality out of smaller chunks of functionality, and so on..." (from Berkeley DB article)
  13. Every bug is important.
  14. "...Our goal as architects and programmers is to use the tools at our disposal: design, problem decomposition, review, testing, naming and style conventions, and other good habits, to constrain programming problems to problems we can solve...." (from Berkeley DB article)

Friday, March 30, 2012

Psychology of Baby Steps Principle

I have borrowed the term "Baby Steps Principle" (BSP) from a colleague, a senior software engineer, at work. Full credit for the terminology goes to him, but the general perspective presented in the blog entry is my way of trying to organize some thoughts on it. Apologies for the length of this ramble.

We all, at some point or another, apply the "Baby Steps Principle" (BSP). BSP is the concept of developing or building things in an incremental fashion -- work on simple, constrained versions of a problem, and use the insights obtained to build up a solution to the complex, harder problem. This approach to solving problems is ingrained in our minds generally from school, irrespective of whether we come from an engineering or science discipline.

Even though we are quite intimate with this idea, often times we forget to use it. The following situation might seem quite familiar: you are trying to solve a problem, somewhat overwhelmed trying to aggressively apply many ideas; but your subconscious is telling you maybe BSP will work, which you completely ignore wasting hours, but ultimately resort to BSP and find that insight that helps land the perfect solution. A look at BSP psyche might help us get a better understanding on how to use BSP to our advantage.

The dogma of BSP, or incremental development, centers around this cardinal concept: get simpler versions working first, and build on that to get to the complex version. The simpler version of the problem is essentially a highly constrained version of the original problem. Once a solution to the simpler version is found, it can provide an insight into how to relax the constraints and approach a solution to the complex problem. Iterating on simplified versions of a problem is the key to solving the original problem.

We apply BSP in software development very frequently. For example, if you are developing a web MVC (Model View Controller) framework, why not build a bare or minimum version first, get it working, and then start building in your features to transform it into the end product. In scaling a system to 1 million users, we usually build a version for 100 or 1000 users first, and try to grow that number based on insights from building that relatively small scale version.

The above examples work because of a multitude of factors. But from among these factors, the key might be the motivation one obtains from small successful milestones. Successful increments give a sense of achievement, which motivates one to move on to the next step. The sense of achievement removes frustration of not getting anywhere as one struggles with the original problem.

In software, this greatly helps in debugging as well. It becomes less difficult to understand where the bug might lie after having reached a successful milestone. From the psychological perspective of the engineer, this translates to less stress since the process to identify bugs becomes easier, which in turn leads to better workplace motivation and increased productivity.

The examples are not just limited to software. If you have heard of dual decomposition, then you will see BSP also playing a role here. The basic tenet of dual decomposition is to decompose a complex, intractable problem into smaller tractable ones. The idea is to solve these smaller problems efficiently, and approximate the solution to the complex, intractable problem by building up from the solutions to these smaller problems. Figuring out how to decompose the original problem is itself a great milestone, being able to solve each of the decomposed problems is another milestone, followed by combining them cleverly to approximate the solution to the original problem and iterating.

From the cursory advantages outlined above, being able to apply BSP should be quite useful. How can we ensure we are using BSP when needed? The answer is quite simple -- listen to that subconscious or gut feeling. If you are feeling uncomfortable or overwhelmed while approaching a particular problem, it might be a good time to take a breather. Take a step back and think about constraining and simplifying the problem. Start working on the simplified version of the problem. Try to get some insight and incrementally remove constraints to reflect more of the original problem, and you are likely going to hit on an elegant solution to the original problem. Take advantage of the smaller milestones, and enjoy these successes as you work your way towards solving the larger, complex problem.

The concept of minimal viable product (MVP) from the Lean Principle also reflects the ideas inherent in BSP. Build up to a full, successful product by working on smaller milestones, evaluating them, and using the insights obtained to incrementally work on the next steps. I think we can all benefit from BSP, and it is just a matter of listening to that subconscious or gut feeling -- trust me, I am an engineer!

Wednesday, March 28, 2012

More data, more noise, or more rare events

We are in a data splurge. Everyone is interested in data, how to gather it efficiently, how to store it, and most importantly, what to make out of it. Data is playing a key part from search, ad and movie recommendation, to development of social media based products. We want to know more and more about our users, be more personalized. So we collect more data, in an attempt to cover every aspect of our users' likes, preferences and habits.

But how does one decide what data is worth collecting? Or, do we just collect everything we can get our hands onto? How does one find the balance, between collecting noisy data and those informative events that will give us the crucial insight?