andrew makes things

a blog about programming, machine learning, and whatever else I'm working on

Machine Learning Project Ideas

| Comments

Ryan Stout and I are giving a talk at RailsConf about Machine Learning tomorrow. To go along with the talk, here is a list of project ideas to get your creative juices flowing:

  • A robust email and mailing address typo corrector for web forms.
  • A Rickroll detecting browser plugin- it warns you before you follow a link that will likely result in Rickrolling. (Rickroll Protection As A Service?)
  • A per-user clicktrail analyzer that predicts which links a user is most likely to follow, given their history. Use this to highlight or promote high-likelihood links.
  • A user info and usage pattern analyzer that classifies users by likelihood of upgrading to a premium plan.
  • A RubyGem for classifying user generated content into appropriate, inappropriate, spam, NSFW, etc.
  • Along the same lines: a nudity detector for uploaded images.
  • A RubyGem for code optimization based on the current backtrace, possibly using reinforcement learning. For example:
1
2
3
4
5
6
7
8
9
10
  # This probabilistically selects a choice based on the
  # current backtrace and the history of reinforcement signals seen.
  optimize do
    choice do
      # some code path that ultimately triggers a "reward" or "punishment" signal
    end
    choice do
      # some other code path
    end
  end
  • A story karma predictor that estimates the final score on Hacker News of any article, based on textual content and the poster’s info.
  • A system that classifies support requests by their estimated severity.
  • Make things easier for your users:
    • given them default settings selected by users similar to themselves
    • default to pages they use often; expand modules they interact with frequently

Once you know what’s possible, it’s hard to find a project that wouldn’t benefit from some machine learning.

Have other ideas? Want to discuss these? Post them in the comments and follow @tectonic for updates.

Fixing the Chrome Background Refresh Bug

| Comments

There is a bug in the current version of Chromium (hence Google Chrome) that sometimes fails to redraw CSS background images when they’re hidden and then re-shown. This issue appeared on Mavenlink’s Tour page. Thomas Fuchs discusses some possible solutions, but none of those worked for us. Here is our ugly solution:

Please let me know if you find something better!

Hacking Google for Fun and Profit

| Comments

At the end of last year, Google announced their Vulnerability Reward Program which rewards security researchers for reported security and privacy holes in Google properties. This sounded like an interesting challenge, and I set out to find security holes. I found three, got paid, and am now in the Google Security Hall of Fame. All in all, a rewarding experience.

Below I describe the three security holes that I found.

Determining if a user has emailed another user

In my opinion, this is the most subtle, but also the most disturbing, of the three bugs. As with the other bugs that I found, this was an example of Cross Site Request Forgery– the practice of convincing a user’s browser to make a request on their behalf to a remote server. This type of attack generally only works when the user is logged in to the remote service. In this case, if a user is already logged into Gmail (and they usually are), a malicious website could make a series of requests for Gmail profile images and, based on the return codes, determine whether or not the visitor had communicated with another Gmail user. This worked because Gmail, as a well-intentioned privacy measure, would only show profile images to a viewer if they had had mutual contact. Here is some example code that worked at the time:

checkUsername
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
function checkUsername(username, callback) {
    var image = new Image();
    image.onload = function() {
          callback(true);
    };
    image.onerror = function() {
      callback(false);
    };
    image.src = "https://mail.google.com/mail/photos/" + username + "%40gmail.com?1&rp=1&pld=1&r=" + (new Date()).getTime();
}

checkUsername("fbi-reports", function(hasEmailed) {
  alert("The current visitor " + (hasEmailed ? "has" : "has not") + " emailed the FBI.");
});

checkUsername("wikileaks", function(hasEmailed) {
  alert("The current visitor " + (hasEmailed ? "has" : "has not") + " emailed WikiLeaks.");

It should be clear why this is a serious privacy concern. If you suspected someone of being a whistleblower, for example, you could make a page that probed a bunch of revealing email addresses and checked to see if any had been contacted. Luckily, Google reports that they have now fixed this bug. Cross Site Request Forgery attacks can usually be prevented by adding a CSRF token (a unique and user-specific token) to every request.

Identification of a user’s Gmail address

This bug would have allowed a malicious website to determine your Google username if you were simultaneously logged into your Google account and typed anything into a seemingly innocuous web form. One of the fields in the form would actually be an iframe pointing to a public Google Document. When the user typed into the field, they would really be entering text into the Google Document, and what appeared to be their cursor in the field would actually be the Google Document insertion point. When a user typed into the field, the attacker could determine their username (and hence email address) by observing the publicly-displayed list of current document editors.

Again, this is a type of Cross Site Request Forgery, specifically known as Clickjacking, which can be especially hard to prevent. There are many types of Clickjacking, almost all of which use iframes. One approach, which I used here, is to artfully display content from a target site in such a way as to look like it’s part of the current page. Another approach is to hide the iframe invisibly under the user’s cursor, moving it as the cursor moves, and causing the user to click on the other site without realizing it.

Google correctly used the X-XSS-Protection and X-Frame-Options headers, but some browsers do not honor these. The solution to this one is tricky, but it is generally to use frame busting, to provide appropriate headers, to use CSRF tokens, and to not expose any user information without a direct user interaction.

Deletion of all future email

The third bug that I found was a fairly severe security hole that affected a portion of Gmail users. Due to a missing CSRF token during the first step of the filter creation flow in the HTML-only version of Gmail, a malicious site could trick visitors into creating a Gmail filter that would delete all future received email. This worked in the current (at the time) version of Firefox, but not in Chrome or Safari due to their correct handling of the x-frame-options header. I didn’t test it in IE.

This security hole was exploitable via a combination of a classic Cross Site Request Forgery with a Clickjacking attack. First, I discovered that it was possible to submit the first part of the filter creation flow in an iframe using JavaScript because Google had forgotten to include a unique CSRF token in the form.

1
2
3
4
5
6
7
8
<form id='form' method='POST' target='iframe' action='https://mail.google.com/mail/h/ignored/?v=prf' enctype='multipart/form-data'>
    <input type=hidden name='cf1_hasnot' value='adfkjhsdf'>
    <input type=hidden name='s' value='z'>
    <input type=hidden name='cf2_tr' value='true'>
    <input type=hidden name='cf1_attach' value='false'>
    <input type=hidden name='nvp_bu_nxsb' value='Next Step'>
    <input type='submit' style='display: none'>
</form>

I then positioned the iframe such that the “Create Filter” button on the subsequent page would fill the frame without showing the button border; only the word “Create” was visible. A fake button was then shown around the iframe with a style that matched the gmail style such that when the user believed they were submitting a form with a submit button entitled “Create,” they were really creating a malicious and destructive filter in Gmail.

Google says this has now been fixed.

Google’s Response

In all three cases, Google responded promptly to my security report and fixed the bug within a reasonable amount of time. I was given two $500 awards for the three bugs. Google generously doubled these amounts when I chose to donate them to charity, so the Athens Conservency and the Buckeye Forest Council, two of my favorite local charities in Athens, OH, received one thousand dollars each, care of Google.

These were subtle bugs. They took trial and error to find. However, in total, I only spent a few spare evenings of my time. If Google’s products- some of the most secure in the world- are susceptible to these sorts of attacks, you can bet many others are as well. Every programer makes these mistakes sometimes. Security is too complicated for anyone to get right all of the time. Check your code!

Take your security into your own hands… or, why you should hack Google too!

Many companies try to silence security bug reporters through legal threats and sometimes even action, driving discoverers of bugs underground and onto the black market where such knowledge can do real harm. Google has set an admirable example by creating a program that is enlightened, responsive, and well-run, and I hope other companies move in the same direction.

I had a great time using jsFiddle to explore and demonstrate bugs. You can do the same— check out their guidelines and do your part to improve the security of products that you love.

Enjoyed this post? You should follow me on Twitter.

How to Make Your Rails App Tweet the Twitter

| Comments

Suppose you want to build a Rails application for tracking popular links, and you want it to post the most popular links to Twitter automatically. This quick tutorial will show you how to do that using the newest version of the Ruby twitter gem. A little while ago I added the ability for Freebies Finder to tweet popular freebies. I recently had to do this for another site and decided that a tutorial was in order.

Setup the accounts

We’ll pretend that our website is called AwesomeLinks.com. Signup for two Twitter accounts, AwesomeLinks and AwesomeLinksDev. We need to create a Twitter application through which our website can post to these accounts. Do this by logging into AwesomeLinks and visiting https://dev.twitter.com/apps/new. Select ‘Client’ as the Application Type, skip the Callback URL, and select Read & Write access. Twitter will give you a OAuth Consumer key and secret, which you will soon need.

Why Plug Computers Are a Security Nightmare

| Comments

The increasing availability of low profile “wall-wart” plug computers like the SheevaPlug can be viewed as an emerging threat to physical network security. For $99, a budding industrial espionagist could buy the SheevaPlug developer kit or the consumer TonidoPlug, install some easily-available network intrusion testing software, and illicitly “test” the security of a competitor’s network.

While many of these techniques have been known for a while, the low form factor of plug computers and consumer netbooks, coupled with their rapidly decreasing price, could enable disposable intrusion tools and open new avenues for attack. The current $99 SheevaPlug has no wireless capability and limited storage, but the manufacture has just announced an expanded model with wifi, bluetooth, and an internal hard drive. Even without these advances, current generation plug computers can easily be expanded with a USB memory stick or external USB hard drive, USB wireless interface, and more. For little more than $100 one could make a practically undetectable wireless bug that can be deployed in seconds.

In fact, soon you may be able to just buy an all-in-one penetration plug computer, the PlugBot.

edit: as was pointed out to me after posting this article, the described device already exists and can tunnel out over 3G. http://pwnieexpress.com/pwnplug3g.html

With two wireless adapters and some simple software, a plug computer becomes a wireless bridge capable of automatically cracking wireless networks in range (both WEP and WPA are vulnerable these days; see aircrack). Most locations have multiple 3rd party networks overlapping their physical space, which, if cracked, could be used as back channels for the plug computer to phone home. The attacker could then tunnel into the company network undetected and completely bypass the company’s external defenses by routing through an available 3rd party wireless network. From the perspective of the attacked network, even if the intrusion is noticed, it appears to come from within their own physical space.

A number of other uses come to mind for such devices:

  • Passive sniffing of internal network traffic using dsniff and sending it back to an attacker. Many networks aren’t sufficiently secured once you’re past the perimeter firewalls.
  • Physically connect two ethernet interfaces and use the plug computer as a man-in-the-middle proxy to sniff all traffic entering and leaving a workstation.
  • Attach a camera or other sensor payload and use as an over-the-internet video bug.

Again, much of this has been possible for a while, but form factor is everything. Also, I haven’t seen people talking about the possibility of bridging multiple available wireless networks together for attack obfuscation and to avoid connecting through a company’s edge network. I don’t think companies pay enough attention to passive physical monitoring and intrusion threats like this, especially given the insecurity of wireless encryption standards. What do you think?

How I Do Command-line Accounting: Ledger and Reckon

| Comments

Ledger is a powerful yet simple double-entry accounting system with a command line interface, making it perfect for those of us who prefer our text editor and flat files over any confusing and inflexible accounting program.

I’ve been using Ledger for about a year now, including for balancing last year’s budget. In the process, I’ve needed both to import CSV files of financial data from various sources and to label that data with the most appropriate account. To do this, I’ve written a Ruby gem called Reckon that can usually guess bank data CSV headings, and can also use Bayesian learning to automatically categorize each entry. I think of this as Mint for the command line, but where you don’t have to trust a 3rd party with your bank account passwords.

Getting started with Ledger is pretty easy. I recommend skimming the official documentation (pdf).

I usually start out the year by making a new year.dat file and entering my business bank account’s starting balance:

2010/01/01 * Checking
    Assets:Bank:Checking            $5.000.00
    Equity:Opening Balances

Next, I run Reckon on a CSV dump file from my bank: reckon -f bank.csv -p

Rails RSpec Tests Are CPU Bound

| Comments

Today I experimented with running a large Rails RSpec test suite on a RAM disk. My hope was that by hosting either the MySQL server or the Rails project directory on the RAM disk, the test execution would be significantly increased. If this were the case, I would feel (more) compelled to buy a certain new device with a solid-state disk drive. Unfortunately, while I now have some slick scripts to bring up a RAM disk with either my Rails project or MySQL running on it, the improvements were on the order of 10 seconds over a 10 minute test run (the load time of Rails). Thus, it is clear that these tests are CPU bound, not disk IO bound, and a SSD wouldn’t help.

If you have an SSD, can you corroborate this?

(Tests were performed on a brand new Quad-Core Intel Core i7 iMac.)

Revolutionary

| Comments

Sometime in the late ’80s I convinced my parents to get an external hard drive for our Mac Plus to augment our extravagant two floppy disk drives. We got a 70MB drive for about $400, which is about $500 in today’s dollars. Now you can get 2TB for $100, or 10TB for $500. This is an increase of about 14 million percent in about 20 years! Talk about technological change! I have no doubt that this growth - that storage is now basically free - can enable incredible new technologies. What can we do with this?