In Part 2 I talked about identifying sock puppets at Birther Report by exploiting information in its avatar system. The underlying email addresses are obscured by a cryptographic digest, but is there a way around that?
It started two years ago with my article, “Troll hunter.” It’s about a Swedish group’s attempt to expose individuals who posted at a right-wing web site, taking advantage of a poor security model in the popular Disqus commenting system. Disqus provided, through a public interface (API), a cryptographic digest or hash of a commenter’s email address. Troll Hunter’s approach was to collect a huge number of email addresses (around 200 million), compute their cryptographic hashes and match them to commenters on the right-wing web site. When the email hash matches the commenter’s hash, then the commenter’s email address is exposed.
Disqus subsequently changed its API, and the specific approach used by Troll Hunter no longer works, but I wondered if a similar approach would work at Birther Report. There were two initial goals: one was to determine if any prominent person was secretly a birther, and the second was to figure out the identity of the BR commenter named ★FALCON★. BR uses the IntenseDebate plug-in, and to my knowledge it has no public API. It does, however, leak user email MD5 hashes for commenters who use Avatars supplied by gravatar.com, that is, most of them.
In order to display the avatar (unless the user signs in with Facebook), IntenseDebate generates a URL, for example this one for me:
The bit between the slash and the question mark (“56 1b b7 4e 93 a2 40 0e d2 35 cd 5d 3f c5 fa 43”) is the MD5 hash of my email address here at obamaconspiracy.org. All it takes to get that URL is to right-click on the avatar and select “Copy image address” (in Chrome) from the context menu. Even some generic looking avatars may have an MD5 hash, sometimes even users with the name “Guest.” Without an API, harvesting these gravatar MD5 hashes and entering them into a database is a tedious and time-consuming manual task, but I did it over several months, and collected 711 of them from BR (not counting a huge number of sock puppets I discovered and discarded). While my focus was Birther Report, it was not the only web site I looked at and found leaking MD5 hashes. CDR Kerchner, Fellowship of the Minds, drkatesview, Citizen WElls, Western Journalism, JAG Hunter, Impeach Obama Campaign and wtpotus were some others. Fortunately, not all websites required manual right-clicking, copying and pasting. Some could be scanned with automation that read the site’s HTML and navigated from page to page. All in all, I recorded 4,308 screen names and 4162 distinct email hashes from 27 sites (not all of the harvested email hashes belonged to birthers and not all sites were exclusively birther sites).
The next step was to collect lots of email addresses. While that process was largely automated, it took months also. Various Internet web sites contain bulk lists of emails in various formats, typically a hundred or two per page. Some accidentally leave lists around. I found a magazine’s subscriber list. I found lists of results from hacking attacks posted on the Internet at dazzlepod.com. I used email addresses listed in birther lawsuits, sloppy redaction by Orly Taitz, and amazingly an XML export of all the comments from a prominent birther website that was just laying around for Google to find. (I notified the site owner that the file existed and I believe it has since been deleted.) Eventually, I collected 146 million email addresses in my Microsoft SQL Server database, far more than I ever expected. I would let scanning and scraping programs run for days to get the email addresses from tens of thousands of pages of email listings. Some sites figured out what I was doing and blocked my IP address. I went to the Google cache. I could not have done this without my programming background and sometimes 10-hour days coding.
In none of this was anyone “hacked” nor any web site penetrated. No passwords were guessed. No malware was employed. No social engineering was used. All of the collected information, both hashes and email addresses, was freely available on the Internet. I just looked really hard and really long and really smart. Long story short, many emails were identified, but not Falcon’s.
Now we enter the second phase of the project that became known as OARPA (Obot Advanced Research Projects Administration). I dropped hints about OARPA, but they were largely misdirection. OARPA started out as software to generate email addresses: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org …. I collected huge lists of first and last names, lists of common words and uncommon words, and I assembled them in various ways into trial email addresses. I hashed about a trillion combinations of words, prefixes, suffixes, special characters and digits. ★FALCON★’s email address was low-hanging fruit for this approach because it consisted of a common word plus some digits at a popular email domain. In the end, however, it was not the email address that gave ★FALCON★ away, but his own rambling self disclosures on various web sites (more on that later).
This brute force approach was very productive, but I still had other nuts to crack. Barry Soetoro Esq. was still unidentified. I asked for help.
OARPANET was a distributed processing framework where remote computers could connect to the central OARPA server and check out a range of email guesses to scan and a list of unknown hashes. These were subsets of all possible letters, numbers and special symbols of a specific length, AAAAAAA, AAAAAAB, AAAAAAC …). The software was really pretty cool, including web services, and multi-processing (users could specify now many of their computers’ cores to dedicate to OARPA). A tremendous amount of effort went into optimizing the process for speed. I could check network progress remotely from my smartphone.
I collected over 3.6 million domain names, far too many to pair with all the generated random user names. Only a small list of common domains was used for most scans, but one particularly fruitful technique was to take known email names and try them combined with the full list of domains.
OBOT volunteers installed the software, and the network hummed along for months, generating and testing several millions of random email addresses per second, but maybe finding no more than one new email address match on a good day. Still, it paid off, and we did finally guess Barry Soetoro Esq.’s address in a scan of random 7-character strings at a common email domain. A computer that I bought primarily as a dedicated OARPA scanner got the trophy for nailing Barry and 57 others in the final round.
By mid 2015, OARPANET was shut down due to diminishing returns. Generated email account names were getting longer and hits scarcer. There is no way that every possible email domain could be searched against trillions (yes trillions) of sequentially generated email addresses. Unstructured user names consisting of letters, numbers and characters were exhaustively searched up to 8 characters in length. From all sources we matched 2,098 screen names to email addresses (including Dr. Deb, two Joe Mannixes and furtive), and specifically 68.6% of those taken from Birther Report. In total 1,961 distinct email addresses were uncovered.
Some birthers were more careful than others. Anyone taking even moderate precautions would never have told my project anything. A few of the 2226 forum names we didn’t crack include:
- Birther1 (this is Mike Volin, no secret, but his email address at BR is unknown)
- charlesmountain (two addresses)
- Fast Falcon
- Grand Birther
- Guest (several)
- John Gault
- Logical Patriot
- Miki Booth
- Orly Taitz
- Reality Checker
- TANGENT 01
The final phase of the OARPA project was to match email addresses to actual people. This is somewhat of an art. My commercial experience in record matching helped me to understand how easy it is to make a false match; it’s confirmation bias. I wrote about the difficulty with false positives in my article, “Confirmation v. prediction.” While a little automation was developed to reduce the manual effort of searching and recording information, that step involved nothing particularly innovative. It’s all Google. Here’s a screenshot of my Information Manager (click image to enlarge) for Dr. Deb. “BF115” in the “Source” column refers to the 115th run of the Brute Force scanner.
Each item recorded has an estimated confidence number along with it. For some, we found a lot of information–for others, nothing. With Barry Soetoro, Esq. I was able to connect him to a Facebook page, but that appeared to be under a fake name. A particularly information rich scenario was when someone had registered a domain under their real name using the same email address on the registration that they used to comment.
[Update: BSE was eventually identified in the Fall of 2017 by matching his email address to a website that contained his real name and location. His name is not one you would recognize.]
If a birther commented here or at the birther site that leaked the exported comment file, then I also had an IP address (an IP address may lead to a geographic location, although this is not 100% reliable). I also found a few LinkedIn profiles, Facebook pages (e.g., for Dr. Deb), resumes, work addresses, domain registrations and miscellaneous stuff. I decided not to record any phone numbers, although some were available.
During the entire process, no prominent individual was found commenting at any birther website.
The final product was a huge HTML file of everything. There are three copies in three locations with three custodians, so if anything happens to me …
A related project involved software and a database to collect IntenseDebate comments for around 70 selected individuals (including myself), both to prevent their loss if deleted and more importantly to make it easier to search them. I can add someone new and the software will load all previous comments from their IntenseDebate profile page. The software, if run soon after a comment is made, captures a more accurate time stamp than is available from Intense Debate later. Another big advantage is its ability to follow an Intense Debate user across web sites and particularly helpful in assembling the bread crumbs to his identity ★FALCON★ left across several sites.
In the final analysis I have to ask myself why go to all that trouble to gather information that will never be released. Part of the answer, and I think probably the main answer, is the challenge. It was one last big project for a retired software developer. It was hard problem. It forced me to learn new things. It also proved that I’m smarter than the average birther.
As for the Birthers, they still don’t know who RC is, and you’re not going to find out from me.
This concludes the Confession of Dr. Conspiracy.