andrew makes things

a blog about programming, machine learning, and whatever else I'm working on

Parsing Bash in JavaScript in Chrome With Browserify

| Comments

For a side project, I wanted to be able to use js-shell-parse to parse complex Bash commands in JavaScript, in a Chrome extension. (More on this craziness in a future post!)

The js-shell-parse library is targeted at node, and it makes frequent use of require and of various npm packages. Being a node noob, I hadn’t used Browserify before, but it turned out to be exactly what I needed: a tool to bundle complex dependency chains of node packages for the browser. Here are the steps to convert js-shell-parse into a single, compiled bundle:

# Clone the repo
git clone

# Install the various npm dependencies and browserify (you may need to use sudo)
npm install -g pegjs pegjs-override-action isarray array-map browserify

# Run the included build script
node build.js > js-shell-parse.js

# Create a very simple loader script called 'loader.js' that contains one line
echo "window.jsShellParse = require('./js-shell-parse');" > loader.js

# Run browserify on the loader, which will parse the AST and bundle all dependencies
browserify loader.js -o compiled-js-shell-parse.js

Finally, you can include the output in any website!

<script src="compiled-js-shell-parse.js"></script>

and try this in the console:

var structure = jsShellParse('echo "The date is: `date`" > output');
"[{"type":"command","command":{"type":"literal","value":"echo"},"args":[{"type":"concatenation","pieces":[{"type":"literal","value":"The date is: "},{"type":"command-substitution","commands":[{"type":"command","command":{"type":"literal","value":"date"},"args":[],"redirects":[],"env":{},"control":";","next":null}]}]}],"redirects":[{"type":"redirect-fd","fd":1,"op":">","filename":{"type":"literal","value":"output"}}],"env":{},"control":";","next":null}]"

If you’re playing with js-shell-parse, the tests are helpful to see what kinds of shell/bash commands it can parse. (Pretty much everything!)

Never Forget Your Umbrella Again, With Huginn

| Comments

Huginn is a tool that I’ve been working on, with the support of generous open source collaborators, for about a year. Huginn is a light-weight infrastructure for building data-gathering and data-reacting tasks for your everyday life. Think of it as an open source Yahoo! Pipes, IFTTT, or Zapier. It wouldn’t surprise me if, in the future, Huginn evolves into something like Google Now, but without the creepiness factor, because you control and host your own data.

I haven’t done a very good job of sharing all of the things that can be built with Huginn, but I’m resolved to start.

So, in this post, a very simple example:

Problem: I always forget to check the weather, leave my umbrella at home, and get soaked.
Solution: Let Huginn send you an email (or SMS) each morning when rain is expected in your area.

Archive a PDF of Your Posterous Blog

| Comments

My wife and I had a private travel blog on Posterous. Unfortunately, Posterous got aquihired by Twitter and is shutting down, so I spent a few minutes figuring out how to save a PDF of the blog. Chrome and Firefox did pretty poorly at saving a decent looking PDF for my long blog so I installed wkhtmltopdf and made one myself. Here’s how.

Install wkhtmltopdf, then run:

wkhtmltopdf --enable-plugins --margin-bottom 0 --margin-top 0 --margin-left 0 --margin-right 0 "" archive.pdf

If your blog has a password, log in with a browser and copy the cookie. Firefox makes this easy in the Developer Toolbar by entering cookie list.

wkhtmltopdf --cookie cookiename cookievalue ...

Finally, if you’re on a Mac and used the pre-built wkhtmltopdf disk image, you can still use the command line.

/path/to/ ...

Command Line Accounting With Ledger and Reckon, an Update

| Comments

I’ve been using ledger, combined with a custom Ruby gem called reckon, to balance my small business’s accounts for the last few years. The command line, Bayesian statistics, and Double Entry Accounting! What could be better? Here’s how I do it.

First, I export the year’s transaction history from Chase (in my case) and save it as a CSV file called chase-2012.csv. It looks something like this:

Type,Post Date,Description,Amount
DEBIT,12/31/2012,"ODESK***BAL-27DEC12 650-12345 CA           12/28",-123.45
DEBIT,12/24/2012,"ODESK***BAL-20DEC12 650-12345 CA           12/21",-123.45
DEBIT,12/24/2012,"GH *GITHUB.COM     FP 12345 CA        12/23",-12.00

Then, I make a new ledger file called 2012.dat and start it with:

2012/01/01 * Checking
    Assets:Bank:Checking            $10,000.00
    Equity:Opening Balances

Where the $10,000.00 is the hypothetical starting balance of my bank account on the first day of 2012. Since I’ve been using ledger, this is just the balance of the account from the summary that I generated at the end of 2011.

Now, I run reckon, initially with the -p option to see its analysis of the CSV file:

> reckon -f chase-2012.csv -v -p --contains-header

What is the account name of this bank account in Ledger? |Assets:Bank:Checking| 
I didn't find a high-likelyhood money column, but I'm taking my best guess with column 4.
| Date       | Amount     | Description                                              |
| ...        | ...        | ...                                                      |
| 2012/12/24 | -$12.00    | DEBIT; GH *GITHUB.COM FP 12345 CA 12/23                  |
| 2012/12/24 | -$123.45   | DEBIT; ODESK***BAL-20DEC12 650-12345 CA 12/21            |
| 2012/12/31 | -$123.45   | DEBIT; ODESK***BAL-27DEC12 650-12345 CA 12/28            |

It looks like reckon has guessed the correct columns from the CSV, so now I run it in “learning” mode. It loads in my data from 2011 and uses it to guess at labels for my 2012 data, using a simple Naive Bayes classifier.

> reckon -f chase-2012.csv -v -o 2012.dat -l 2011.dat --contains-header

| 2012/12/24 | -$12.00 | DEBIT; GH *GITHUB.COM FP 12345 CA 12/23        |
To which account did this money go? ([account]/[q]uit/[s]kip) |Expenses:Web Hosting:Github| 
| 2012/12/24 | -$123.45 | DEBIT; ODESK***BAL-20DEC12 650-12345 CA 12/21   |
To which account did this money go? ([account]/[q]uit/[s]kip) |Expenses:Programming| 
| 2012/12/31 | -$123.45 | DEBIT; ODESK***BAL-27DEC12 650-12345 CA 12/28   |
To which account did this money go? ([account]/[q]uit/[s]kip) |Expenses:Programming|

In each of these cases, the Bayesian classifier correctly guessed the appropriate label for these expenses based on last year’s data.

Now, with a fully-updated 2012.dat file, I run it through ledger and get the following hypothetical results:

> ledger -f 2012.dat -s bal

    $20,000   Assets:Bank:Checking
   $-10,000   Equity:Opening Balances
    $258.90   Expenses
    $246.90     Programming
     $12.00     Web Hosting
     $12.00       Github
   $-10,258.90  Income
   $-10,258.90    Some source of income that makes this math work

I like to do all of this work inside of my Dropbox folder in case I delete or overwrite a file by mistake.

Want to do all of this yourself? Start by visiting, or by installing ledger with homebrew:

> brew install ledger

Then install reckon:

> (sudo) gem install reckon

Have fun!

Reckon is available on GitHub, and, for updates, you should follow me on Twitter.

Running Ruby Inside of Ruby (in the Best Way Ever)

| Comments

There are no good Ruby sandboxing options right now. You can sort of use $SAFE levels and taint checking, you can sort of use Shikashi, you can use the secure gem to run in a separate process, and you can, with much care, use chrooted and jailed virtual machines or Linux containers. None of these options met my exacting standards, meaning they’re not ridiculous. Therefore, I’m introducing…


An unholy amalgam of therubyracer’s V8 engine and emscripted-ruby to allow a truly sandboxed Ruby-on-Ruby environment.

Check it out on GitHub!

Compressing Code

| Comments

What can we learn about a code base or a language based on its compressibility? My pet theory is that less compressible code will be, on average, better code, because less compressible code implies more factoring, more reuse, and fewer repetitions.

Larger numbers (longer lines) indicate more compressibility.

Below are the compressibility results for some popular libraries and languages. To generate this data, I downloaded each package or library, extracted the source files, removed comments, and gzipped the files with maximum compression. The numbers represent the ratio of uncompressed source file size to compressed source file size. Smaller ratios imply less compressible code.

Unsurprisingly, Java was the most compressible language- all that boilerplate! Python was the least compressible language, with Java, on average, being about twice as compressible. I was somewhat surprised that JavaScript was the second best language in terms of incompressibility.

It’s probably disingenuous to draw strong conclusions from these results, but I still find them intriguing. What, if anything, do you think they mean?

Aside: Compression itself is a fascinating subject. Being able to compress something is fundamental to being able to understanding it. If you can rewrite a 2 page document into 2 paragraphs, while still expressing its core ideas, you’ve deeply understood the material. Hence the existence of the Hutter Prize, a standing challenge in Artificial Intelligence to further compress a corpus of English text. Hence, also, the existence of specialized image compression algorithms that compress human faces better than anything else because they understand, in software, what a human face generally looks like.

If I can compress an image of your face, I can probably also recognize it. Imagine that I have thousands of photos of faces. Using some linear algebra and creative encodings, I can figure out the commonalities and differences among these faces. Basically, I can derive a set of common noses, a set of common eyes, a set of common brows… and, given a new face photo, I can compute a mixture of these common attributes for the new face. Perhaps it, roughly speaking, has 60% of Common Nose 6 and 40% of Common Nose 12. Well, then I can represent the picture of this new nose as roughly two numbers, the “amounts” of Nose 6 and of Nose 12, and suddenly I’ve compressed a collection of hundreds or thousands of pixels- a photo of a nose- into just two numbers. We could also go in reverse, taking a photo, calculating the percentages of different common features, and then looking up in a database to see who we know whose face expresses those same feature percentages. We can compress, and, thus, we can recognize. (Interested in this? See Eigenfaces as well as PCA and ICA for face analysis.)

My Experiences With Personal Outsourcing

| Comments

Over the last few years I’ve been experimenting with outsourcing. I’ve done this both for personal and business projects. In the personal domain, some people call this “lifesourcing”: the practice of modularizing and outsourcing parts of your life that you don’t enjoy so that you can maximize the parts that you do. It’s outsourcing (with many of the same pros and cons), but for your personal life.

A growing number of sites have popped up recently to facilitate lifesourcing, and while these sites aren’t strictly needed- you can still find skilled people to help you on Craigslist, for example- they make this sort of outsourcing even easier.

I’d like to talk about some tips and tricks, but first, let me list a few of the things that I have outsourced over the last couple of years.

Personal Things

  • My wife and I outsourced hand written, cursive wedding invitations on TaskRabbit. (My mother-in-law preferred them to be hand written, my hand writing stinks, and my wife didn’t have the time.)
  • Also for our wedding, someone on Fiverr polished our save-the-date photo.
  • After the wedding, we hired a wonderful, well-traveled woman on TaskRabbit to help plan our honeymoon in South America.
  • We paid a TaskRabbit to scan all of our wedding cards for posterity.
  • Currently, we have a virtual assistant from oDesk who helps us with the ocasional life task. She has proofread documents and called gyms around San Francisco, looking for ones with good pools.
  • We’ve given friends custom, hand painted watercolor birthday cards from Fiverr and custom wedding presents from Etsy. (Which, one couple swears, is their favorite wedding present!) We designed the art (roughly) and then it was made with skill by the artists.
  • When I have outstanding questions, I turn to Aardvark, Yahoo! Answers, and other outsourced question answering services. I would gladly pay for a better one.
  • A carpenter on Craigslist designed and built a custom, adjustable standing desk for me. (Arguably outsourcing, arguably not.)

(Micro-)business Things

  • I’ve hired artists and logo designers on Fiverr. (I actually bought some excellent artwork and had an ongoing business relationship with an artist who I found for $5 on Fiverr.)
  • Workers on oDesk and TaskRabbit have helped me brainstorm domain names.
  • I’ve paid users on Mechanical Turk and later on oDesk to label data for me for some Machine Learning research.
  • I’ve brainstormed with a worker from Coffee & Power.
  • oDeskers have also written software, done graphics, wrote blogs, sent emails, maintained website communities, and researched ideas for me.

Okay, so clearly I’ve experimented with this a fair bit. Here are some of the things that I’ve learned:

  • Outsourcers take maintenance. It only makes sense to outsource something that would take you more than some threshold amount of time to do yourself. Repetitive tasks are great candidates.
  • Be creative- what could you do, if only you had the time?
  • Outsource things you’re bad at, or simply hate doing.
  • Art is a great thing to outsource. Finding someone’s work that you like on Etsy is fun and addictive and custom gifts make a lasting impression, often costing the same as something far more mundane.
  • If you’re trying to get art off of Fiverr, I recommend contacting 5-10 different providers, having them all do the work for $5 each, and then continuing to work with your favorite. This same strategy, of redundant hiring and then consolidating, works well across many lifesourcing and outsourcing domains.
  • You should think about hiring people on oDesk in the same way as you would any other interview process. Ask to see work, look at portfolios, and, ideally, provide interview challenges that directly map to the work they will be doing for you. In my case, when I hired someone to maintain one of my websites, my interview questions revolved around writing example emails and deciding which links were worth posting. When I hired people to classify a dataset, I gave them access to the real classification application and had them do a sample set. If they did well, I hired them.
  • If you’re going to go through the trouble of interviewing and hiring on oDesk, I strongly recommend codifying your interview and training instructions as reusable documents. When your current worker(s) inevitably leave or flake out, you can hire and train the next set more quickly. You can also hire more than one at a time for added redundancy.
  • There are a shockingly large number of people on this planet who speak (nearly) perfect English, have sharp wits, and are looking for work. If you have tasks that you can pay them a fair wage to solve, you’re helping everyone. And remember, a fair wage in the Philippines (where many people speak English perfectly), is significantly less than in the US. Do your cost of living research and pay fairly and generously!
  • Accountability and incentives are important. I left Mechanical Turk and instead interviewed and hired individuals off oDesk for data labeling tasks because I received better quality and had more consistency over who I was working with.

Are you a lifesourcer or a micro-outsourcer? What have you learned?