The data collection challenge

Why would hackers be hawking a massive tranche of mostly stale data?

Blogs

Image: Stockfresh

18 January 2019

If, like me, you received an email this week from the web site HaveIBeenPwned.com, you may, also like me, be wondering what to do with the information contained therein.

My email informed me that…

“You signed up for notifications when your account was pwned in a data breach and unfortunately, it’s happened. Here’s what’s known about the breach:”

The data breach that is being referred to is of course the massive Collection#1 breach revealed by HaveIBeenPwned.com owner and operator Troy Hunt.

“Cyber criminals may well be gathering masses of data to train AI systems for cyber crime. This might be just the most casually traded tip of the iceberg that reveals a whole Pandora’s Box of potential criminal scams”

Hunt reported, via an extensive and meticulously researched blog post, that Collection#1 is a mishmash of information collected from thousands of individual breaches and sources, cobbled together and being sold around various underground marketplaces for as little as $60 in places.

The collection is some 773 million unique emails and around half a billion plain text passwords. Now before you go panicking, Hunt has done some extensive analysis and found that only 22 million or so of the credentials stolen are new and have not appeared in other breach hauls — a paltry 22 million!

Deja vu
A cursory check of the email address for which I requested ‘pwned’ notifications says that it has appeared 8 times in breaches already, from sources such as the October 2013 Adobe hack, the 2012 Dropbox hack and the 2016 LinkedIn incident.

Unfortunately, the advice on what to do remains stubbornly familiar. Change passwords immediately. Use strong password guidance. Do not re-use passwords between sites. Do not share passwords, do not email passwords. Use a password manager if possible, there are a number of good ones, and some are even free.

However, the issue here is more the fact that such a massive dump of stolen credentials — the directory containing Collection#1 comprised some 12,000 files and 87GB of data — can float about underground sites and be sold for as little as it has.

The general consensus among security professionals is that the data here is stale, having appeared numerous times before on such breach hauls. Consequently, the threat of identity theft in this instance is low, as professionals would have realised that any valid credentials would have been used before and any subsequent attempt to use them would likely raise too many flags, be caught too early and not only be unsuccessful but would draw too much unwanted attention.

This is, of course, not say that some bright spark might not be attempting to do so, but nevertheless, the one thing that can be said with certainty is that there is some purpose behind this collection and aggregation of data.

But what could that be?

What use?
Well, as soon as I read Hunt’s deep dive of the data, the overwhelming impression was of the sheer size of the collection, begging the question of why so much that at first appears of little use?

And then a comment from one Bob McArdle of TrendMicro came to mind. Speaking at the panel discussion at the Cyber Security Skills conference, McArdle was talking about whether hackers were using AI in cyber attacks. He said from experience, AI needs masses of data to be trained properly and that is not something that the blackhats generally have either the time or the inclination to amass. Hence, in his opinion, we were still some way off seeing AI being used in such circumstances.

In the context of current speculation that organisations are using unconventional means by which to amass data with which to train AI and machine learning efforts, this takes on a different complexion.

I have seen a number of respected infosec pros, as well as others, say it is entirely plausible that the current meme of the 10yrchallenge, whereby people post pictures of themselves on social media that are 10 years apart, is merely a data gathering exercise engineered to have people voluntarily provide data for an AI/ML training exercise.

In the same way that many so called DNA ancestry services are actually just ways of having people submit their DNA for anonymised use in research, it is thought that the 10 year challenge will amass data about where people are from, what age they are, their general state of health at two now fixed points in their life, plus the potential for a comparative analysis of the two pictures, informed by all of this data and then compared to others in the area, age group, ailment group, etc.

So, imagine these data sources being combined with nearly a billion valid identities from Collection#1.

Data potential
What potential is there for the blackhats in having such data at their disposal? Well, defeating or subverting facial recognition or possibly two factor authentication is one. Identifying susceptibility to certain diseases might be another. If you know you are in a risk group for a certain condition, and so do the blackhats, targeted scams saying you have an acute risk might get you to part with cash for supposed cures. Such scams have been uncovered on many occasions before.

Irrespective of the specifics, the broad trend is very worrying.

Cyber criminals may well be gathering masses of data to train AI systems for cyber crime. This might be just the most casually traded tip of the iceberg that reveals a whole Pandora’s Box of potential criminal scams.

So the next time someone sends you a 10yrChallenge — or similar — think twice. Ask what use that could be put to by someone else. Then change your passwords anyway, to be sure to be sure.

The data collection challenge

Sign up for the Technology Minute

Support our advertisers

Listen to Tech Radio

Most Popular