While reading Steve Job’s biography, it’s hard to avoid thinking about his death, and the knowledge and information that was in his head that is now lost to people.
But it also got me thinking about another small piece of information that we’ll probably never know. Passwords, like them or not, are a pretty important part of our modern lives. We type them all the time, so much so that the act is often an instinctual, brain-stem level activity that we don’t even think about, unless we get it wrong.
Passwords should be “easy to remember, but hard to guess” — something that you know, but is difficult for someone else to guess because of the sheer number of possibilities. For 10-letter passwords, containing letters and numbers, there are 62 10 possible configurations (839,299,365,868,340,224):
Okay, so a password is basically a piece of information that hopefully only we know. Is that all there is to safety? But what happens if someone breaks into a server? How do companies ensure that your data is somewhat safe even when in hands of an adversary?
Hashing
Thankfully, passwords aren’t usually stored in plain text, because if hackers got access to the database, they would know your password, giving access to your account as well as possibly being able to use it on other sites. Indeed, if you ever get an email with your plaintext password, you should probably not trust the service.
What is normally stored are passwords that have been put through a hash function. Hash functions are one-way functions that produce a “hash” or a “digest” output deterministically for each input piece of text. You can think of it as a “black box” that takes some text and spits out some other value. This value, the hash , is unique for each input, and can be used to determine whether two inputs were the same.
To explain how this works, let’s create a hash function that generates a number from an input text. It does so by converting each letter in a text string to its corresponding position in the alphabet, and summing them together. Let’s see how this looks for the password “hello” :
If, for each user, you store this number, you’ll still be able to check whether they typed the right input, because the hash of the input will match the hash that is stored. There is no need to keep the plaintext password and it can be discarded. When testing whether a candidate password is correct, we only need to compare its hash with the stored hash for the user.
The important thing to note is that you can’t easily deduce the original password from the hash — you can only hash candidate passwords to see if they match. However, the dummy hash function has a problem, namely a lot of collisions, where many inputs lead to the same hash output. For example, “hello” and “plain” both would have a hash value of 52 with the above setup. Real cryptographic hash functions like MD5, SHA1, or SHA256 have a far lower probability of collisions, because they create hash values that are more uniformly distributed. Here’s what “hello” looks like when passed through a few crytographic hash functions in the python console:
>>> import hashlib
>>> hashlib.md5("hello").hexdigest()
'5d41402abc4b2a76b9719d911017c592'
>>> hashlib.sha1("hello").hexdigest()
'aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d'
>>> hashlib.sha1("hellp").hexdigest()
'd4baaad7a68a379b1e133e0fea0603051b0124ca'
Notice the more uniform distribution of the hash outputs — even though the texts “hello” and “hellp” only differ by one character, the output of the hash is vastly different.
so is this safe?
Even though the password is no longer stored in plain text, its security is still quite low. Thanks to increasingly powerful processors and the uniquity of parallel computing, it’s possible to quickly try hundreds of millions of hashes to try to find the input password that corresponds to a hash value. Indeed, password crackers don’t even need to buy fancy computers nowadays, since they can use cloud computing muscle like Amazon GPU compute instances, which can be easily scaled up just by adding more computers.
A program for “cracking” hashed passwords to retrieve the plain text password generally works by generating guess passwords, passing them through the hash function, and then comparing the result to the hash to be cracked. If they match, then the original password is found:
Password cracking is embarrasingly parallel, meaning that it’s almost trivial to split the larger problem into subtasks that can be farmed out for parallel computation. For example, if scanning through the dictionary, each letter can be tackled by a separate machine in parallel. With a GPU, it’s also possible to massively parallelize this computation even on one machine.
Indeed, when the hash functions used are simple ones like MD5, parallel computing isn’t even necessary. When Gawker’s servers were compromised and 1.3 million password hashes were posted publicly, 400K of the passwords were cracked by trying dictionary words, common passwords, and random text on just one computer. The most common passwords were quite predictable:
- “123456” (2,516 users)
- “password” (2,188 users)
- “12345678” (1,205 users)
- “qwerty” (696 users)
- “abc123” (498 users)
It isn’t even strictly necessary to try to find the input by computing hashes, which is CPU intensive: if the password is something simple like “password” or “123456” , it’s can easily be looked up in a table of pre-computed hashes, an example of the CPU-memory tradeoff.
You can download large tables of precomputed hashes, and some of the most common ones can even be found on Google, such as the MD5 hash of “password” , 5f4dcc3b5aa765d61d8327deb882cf99. Obviously, storing each hash for lookup would require an enormous amount of memory, but more advanced methods exist for looking up inputs from hashes, such as rainbow tables.
Salts
To make lookup tables less effective, smart authentication providers will add a salt to the password before hashing the text. A salt is a unique text string that is normally appended to the password before it is hashed and the hash stored. For example, I could always add four random characters to the end of the passwords provided by my users:
Salts would then be stored alongside the hash in the database. The result of each user having a different salt is that, even though the salt is known, the table of precomputed hashes without the same salt would be useless. Instead, all of the hashes would have to be recomputed to crack the passwords, for each password to be cracked. Usually, salts are generated randomly and stored along with the password, so that cracking one hash would only yield one particular (plain password,hash) combination.
Other cryptographic techniques
Key stretching is another technique used to increase the burden on password crackers. In key stretching, the hashed password is re-hashed many times to generate the stored result, to increase the cost of computing the hash. This might introduce a delay of a second for the end user, which is acceptable. But to someone attempting to crack a password, this delay is much more of a hinderance because of the sheer number of possible passwords.
Key strengthening, also known as using secret salts is a variation on salts. In addition to a regular salt, a “secret” salt is also added to the password data, but it is discarded after the hash function runs. Thus, authentication will require a brute force search even for legitimate users, but only in the space of the secret salt. For example, if the secret salt is just another letter, then 26 hashes will have to be attempted, which is not significant for a legitimate user authenticating, but adds considerable expensive for crackers.
Sharing passwords: the weakest link
Unfortunately, all of the cryptographic sophistication is for naught if you share the same password across multiple services and authentication platforms. Consider a user who uses the same password on Google, Amazon, and Gawker. Like happened a few months ago, even if Google and Amazon had their security in tip top shape, Gawker was using a weak hash function without a salt, and 400K passwords were deciphered.
Now, if the user shares the same password across all pages, then all the work that Google and Amazon did is far less effective at maintaining a user’s privacy — the cracker can just use the decoded password from the hacked site, together with an email or username, and try it at Gmail, Facebook, your bank website, your Amazon account, etc.
In particular, you shouldn’t trust any site that is not using HTTPS (secure) in a login context, unless you don’t care about your password being stolen. Just look and see whether there is https
or http
in the URL. If it’s just HTTP, that means your password is being sent in plain text over the internet, so it’s susceptible to being picked up along the way by packet sniffers, who look for exactly this kind of thing.
Generating better passwords, per site
For many reasons, it’s very important to use a different password for each different service that you use, particularly for services that are high-value, such as banks, money accounts, shopping accounts, and personal email. Yes, this sounds annoying, but you can still generate memorable passwords using heuristics, that you can both remember and are very strong. If you have super strong passwords, but just keep them written down somewhere, it defeats the purpose of having security, so it’s important to remember them.
- The Roman rhetorical tradition gives us the method of loci, which is a way to remember things based on physical location, which we humans are particularly good at. For example, for your bank account, you could use the password “3 blocks north of Glendale” — just associate some location you know about with the bank (it shouldn’t even be the bank, just some other location) and just associate the location with the password. You’ll find it quite easy to remember things like this, but it’s incredibly difficult to guess.
- If you use a lot of websites, you can make your own mnemonic device to create complicated but easy to remember passwords based on the website name. For example, you can think of what song you might associate with a particular webpage, and then create a password from some of that song’s lyrics, taking the second letter of each word. You can also use colors from the page, the name of the page, mental associations, or something based on how you first found the site, among others. The point it to create a rule that you remember, so that even if you forget a particular password you can remember how you came up with it. University of Cambridge researchers found that mnemonic passwords were just as strong as “random” text strings.
…folk belief is that random passwords are better than those based on mnemonic phrases. However, each appeared to be just as strong as the other.
Phishing
From reading above, I hope you appreciate how advanced password storage has become. Hackers have also appreciated this. It’s become far more profitable to ask users for their passwords directly rather having to tackle the intractible computational problems involved with cracking.
For this reason, phishing and its variants are much more common today. Phishing is when an authentication provider is spoofed by an impostor with the intent of collecting authentication credentials. Spear phishing, where a user is sent a personalized message pretending to be a known service, has proven to be a particularly effective. Is such an attack, the attacker might have knowledge that the target user has recently ordered something from Amazon, so they would send an email pretending to be Amazon saying that there is a problem with the order. The user, not percieving such an email as suspicious, might then sign in to a bogus page upon which the password is collected.
Phishing traditionally happens through emails, although more recent attack vectors such as tabnabbing happen online. In tabnabbing, malicious javascript changes the look of a page to that of a familiar service, with the hope of collecting login information from users. To avoid these kinds of attacks, it’s important to always glance and see whether the URL is what it should be when you are doing your online business:
Better authentication
There are three general approaches to human authentication :
- Something unique you know (eg. passwords, PINs)
- Something unique that you have (eg. keys, badges, smart cards)
- Something unique about you (eg. fingerprints, retina, DNA)
Indeed, passwords fall entirely into the first category. On the other hand, in the real world, most physical security involves “something you have”, usually a key. Increasingly, this is being applied to the digital realm, so that to authenticate you need to something with you, and not only a password.
Two factor authentication, where you have a physical key generator as well as a password you know, has been around in industry for a while. It’s the standard for ATM withdrawls — you have the card, and you also know something (the PIN). It’s now being introduced to the online world when Google recently introduced two factor support for accounts, which you should enable if you want to increase the security of your account.
There are problems with physical keys as well - especially when they can be easily stolen. Common house keys can even be copied from photos taken hundreds of feet away. Fingerprints or DNA can be faked or lifted off of your body. Online, there is a problem with lack of implementation and support — for example, the iOS email client on iPhones currently doesn’t directly support two-factor authentication.
Summary
“100% secure” isn’t possible, because of the sheer number of players involved in authentication — but keeping your online accounts reasonably secure isn’t hard. If you run a site that has user logins, ensure that you’re using the appropriate cryptographic techniques. And if you’re a user, create complicated but memorable passwords using mnemonics that you don’t share between sites. If you follow these simple steps, you can minimize the chance that you will end up being the victim of an account hijacking.
We may never know what Steve Job’s passwords were, but that’s a good thing.