A Self-Inflicted Data Breach

Published on April 15th, 2022

Some History

In December of 2021 Gravatar suffered a data breach. The breach wasn't your typical data breach. They were not compromised in some grand heist. Instead, their public data was harvested at scale. The data included an MD5 hash of each user's email address which was easily dehashed leading to millions of people's personal data being exposed.

The Gravatar breach hit the news and was promptly added to high profile watch lists such as Have I Been Pwned. This blog post is not about Gravatar. It's about a similar breach that time has left behind and to this day isn't included in Have I Been Pwned's service (or any other service as far as I could tell).

A Self Inflicted Data Breach

This blog post is about Stack Exchange's data breach that occurred in 2013. This breach included all of Stack Exchange's forums, including Stack Overflow.

The odd thing about Stack Exchange's breach is the fact that it was entirely self-inflicted. Every year Stack Exchange published public datasets for their forums. User email addresses were hashed (MD5, hello again) to obfuscate them. Of course, that doesn't work and means Stack Exchange virtually handed out everyone's email addresses.

Some time in 2013 the email hash issue was brought to light and Stack Exchange promptly removed the email hashes from their dumps, but the damage was already done. The original 2013 dump is not impossible to find, even today.

Yet, the data breach isn't reported anywhere? Why not? Obviously the scale of the breach is smaller than Gravatar's, but does that really matter? There are far smaller breaches on HIBP. Seems like a case of selective amnesia.

How difficult is it to crack hashes anyway? I wanted to see how difficult it is to crack a large set of email hashes and I wanted to get an idea of how many people were impacted by Stack Exchange's data breach. I grabbed the original Stack Overflow dataset from 2013 (the one with the email hashes) and I got to work.

Recovering Hashes

The Stack Overflow dataset contains 1,877,942 unique email hashes. How does one crack a hash anyway? I decided to give Hashcat a try. Hashcat is a utility that is used to crack passwords. Dehashing email addresses is a lot like cracking passwords. The primary difference is there's no wordlists out there to help you crack email hashes.

I built my own wordlist using known email addresses from other data sources. The wordlist I built contains 1,678,117,445 unique email addresses (weighing in at 36GB). I fired up Hashcat with

hashcat -m 0 hashes.txt emails.txt

With my Nvidia RTX 3080 it only took 3 minutes and 17 seconds to process. Hashcat was able to recover 51.81% (972,933) of the hashes. Just shy of a million. I am sure more hashes can be recovered using additional techniques like brute forcing known email patterns, but I didn't want to invest any more time to figure it out. I also have a creeping feeling my normalized wordlist may have reduced the recovery rate. I have a hunch that Stack Overflow may have not normalized email addresses before storing them.

All in all, cracking the email addresses was fairly simple and straight forward. I'd like to see the entire 2013 Stack Exchange data breach added to more monitoring services. Who knows-- maybe it will now.