[Novalug] IMPORTANT: the "novalug" list is moving
Rich Kulawiec
rsk@gsp.org
Wed Jul 30 21:56:21 EDT 2014
On Thu, Jul 31, 2014 at 09:25:10AM +1200, Mark Smith wrote:
> i notice that users have unobscured email addresses on a web page
> that is unauthenticated. doesn't that make it a good target for
> scraping by spammers?
(a) No, it doesn't and (b) even if it did, it's unimportant.
Let me take this opportunity to explain that glib answer and do a little
education along the way. Grab a cup of coffee...or a beer...or the
beverage of your choice. Because this is going to go on for a bit.
Okay, more than a bit. But it's a complex topic and the reality
of the situation is somewhat counter-intuitive.
[ I'm plagiarizing myself from a comment on the mailman-users list
several years ago, so my apologies if you've already seen this. ]
Summary: Spammers now have so many ways of "harvesting" addresses from so
many systems, and so many ways of exchanging those with each other, that
any email address which is actually used WILL eventually be harvested.
(What "eventually" means varies widely, of course, but whatever it means
today...it will mean less tomorrow.) Pretending that address obfuscation
in mailing list [or newsgroup] archives will have any meaningful effect
on this process gives users a false sense of security and has absolutely
zero anti-spam value.
Summary of the summary: It's pointless.
Explanation: Spammers maintain extensive databases of email addresses.
Some of those databases are merely lists of addresses; others are
more sophisticated and include data such as "harvested-date",
"havesting-method", "last-seen date", "last-seen-context",
"last-known-valid date" and more. Some of these databases are private;
others are available for sale/lease. Some are maintained by spammers
themselves, others by spammer support services which don't directly
engage in spamming.
The harvesting engines used to acquire email addresses to populate
those databases are myriad, as are the methods by which spammers
acquire the raw data to use as input to them. *Some* of those methods,
and there are MANY more, include:
- subscribing to mailing lists
- acquiring Usenet news (NNTP) feeds
- querying mail servers
- acquiring corporate email directories
- insecure LDAP servers
- insecure AD servers
- use of backscatter/outscatter
- use of auto-responders
- use of mailing list mechanisms
- use of abusive "callback" mechanisms
- dictionary attacks
- construction of plausible addresses (e.g. "firstname.lastname")
- purchase of addresses in bulk on the open market.
- purchase of addresses from vendors, web sites, etc.
- purchase of addresses from registrars, ISPs, web hosts, etc.
- domain registration (some registrars ARE spammers) [1]
- misplaced/lost/sold media (hard disk, tape, CD, DVD,
USB stick, etc.)
and perhaps most significantly:
- harvesting of the mail, address books and any other files
present on any of the hundreds of millions of compromised systems [2]
Let's talk about that last one.
Consider for example: the first time a newly-created address is used
by someone who is sending a message TO it, it's now present on their
system: in their saved outbound mail, or perhaps in their address book
(if they have one), or in some cache. Any sensible malware resident on
their system will of course pick it up and eventually hand it over to a
harvesting agent. (Competent malware will harvest it in real time *and*
associate it with the sender's address.)
And if that particular system happens to be clean? Doesn't help much,
because the more times that address is used, the more systems it's
present on. And the more systems it's present on, the greater the
probability that one of them is already compromised or will be soon.
Thus even if we eliminate the originating end-user system as a possible
source, we still have to consider the outbound mail server used by that
end-user system, which is also a candidate for compromise. And then the
inbound mail server used by the recipient, and then the recipient end-user
system. And if there's some filtering appliance or intermediate system
in place at either end, then it's a possible compromise point as well.
If the message is forwarded to a third party, then another set of systems
is in play. If mail server logs are rolled up and moved to some central
location, then that system must also be included. If backups are made,
then any addresses present on live systems are present in their backups,
and subsequently may be present on any system where the backups are
read/restored. And finally, if the destination of a mail message isn't
an individual user, but an entire mailing list, then we must multiply the
number of possible harvesting points by at least the number of people on
the mailing list plus a factor for mail servers/gateways/filters/etc.
(modulo overlaps). This in turns means that messages to sent to lists
of any appreciable size (say, 1000 members) will turn up on considerably
more than 1000 systems -- and the chances that all 1000-plus are secure
are microscopic. [ And remember: it only takes one. What if the system
I'm typing this on right now, a system which has a complete archive of
novalug back to 2006 in Unix mbox format, gets compromised? Or how about
*your* system? There 443 addresses on the novalug roster. Presume one
computer per address. Do you think all 443 are secure? I think that
would be very nice, but I also think it's extremely unlikely. ]
Please note that the previous paragraph's recitation only covered the LAST
vector I enumerated in the [indented] list above: compromised systems.
That laundry list of methods also affords many, *many* other opportunities
for addresses to find their way into spammers' hands. As just one pointed
example out of a lot more that I could cite: how do you know that
the address user@example.com which has just subscribed to the list you
run is a real person and not just the front-end for an address-harvester
that will pick up every address used to send traffic to the list?
You don't.
[ Incidentally: I've caught address-harvesters on lists that
I've run. Not often, but I have to surmise that I probably
haven't caught them all. And won't. ]
And so on. There are far too many other harvesting methods to enumerate,
all of which have discussed at great length in anti-spam forums for
many years, and are depressingly familiar to experienced practicioners
working in the field. (Of which I'm one. I've been studying spam
for a very, very long time.)
The bottom line is that any email address which is actually used [3],
including any email address used to send traffic to a mailing list,
is GOING to be harvested. It's only a matter of when, not if, and
"when" is getting sooner all the time. There's nothing you or I or
anyone else can do about this because there are too many vectors and
not only do we not control most of them, we don't control the ones
that are the the most important.
Incidentally, everyone (including me) can produce anecdotal tales of
addresses that have remained surprisingly under-targeted by spammers
over long periods of time. But this is clearly not the way to bet: it
is in spammers' interests to harvest as many addresses as possible
and to use them as soon and as often as possible. Note, however,
that some addresses are *deliberately* under-targeted, so lack of
substantial spam traffic to a given address is NOT an indicator that the
address hasn't been harvested. That's because along with target lists,
spammers maintain "suppression" lists, which they use to avoid hitting
the addresses of people they think are likely to cause issues for them. [4]
And obviously, people with postmaster or mailing list roles would be
good candidates for membership on those lists. Skipping those would
be inconsequential when sending spam to a few hundred million addresses,
so I trust it's obvious why spammers benefit from doing that.
With all this in mind, it's clearly pointless to pretend that address
obfuscation in archives provides any protection at all. [5] It would be
better to remove the functionality entirely than to continue to maintain the
facade that it actually has any anti-spam value. Everyone should simply
presume that all email addresses are in the hands of spammers and prepare
defenses accordingly -- because even if that's not quite true yet, it will
be soon. Very soon.
Oh. One more thing, as long as I'm going on about this. There are some
people who like to pretend that obfuscation of the form "user at foo
dot example dot com" will work.
It won't. As demonstrated here:
echo "user at foo dot example dot com" | perl -pe 's/[ ]+dot[ ]+/./g; s/[ ]+at[ ]*/@/g'
Expanding that trivial bit of Perl to cover hundreds of extant variants
is left as an exercise to the reader...but rest assured that spammers did
that homework YEARS ago and all that obfuscation of this form accomplishes
is to annoy real live humans. And besides, the moment a human un-obfuscates
the address *and uses it*, then everything I said above applies anyway.
There's no way around that inevitability, because mail systems use and
require the unobfuscated form.
Conclusion:
Trying to hide/obfuscate email addresses is the security equivalent of
Wile E. Coyote holding an umbrella over his head while a grand piano
plummets toward him. It's never worked. It's not working. It's not
going to work. It's just wishful thinking/folklore/mythology.
Footnotes:
[1] I deliberately didn't mention mass WHOIS queries. While some efforts
in this direction were made by spammers years ago, they've found it far
more efficient and cost-effective to simply buy WHOIS data in bulk.
There's always someone who wants to sell, and a CD/DVD or USB stick
will suffice. This is why attempts by registrars to rate-limit queries
or restrict access are not only foolish, but disengenuous: spammers
already *have* the data, and can acquire updates at will, and they
are clearly doing so via processes that lead back to internal compromise
at the registrars themselves.
[2] The exact number of such systems is not only unknown, but unknowable,
since any compromised system which (a) doesn't make its presence
known (b) to a suitable detector will remain undetected indefinitely.
However, two things are clear: (1) any estimate under 200 million should
be laughed out of the room, and (2) there is no reason to suspect that
the number is decreasing, and there are numerous reasons to suspect that
it's increasing. Note, incidentally, that some detectors have reported
observing 200,000 new such systems in a single day; and further note that
it's now quite routine for individual botnets with several million
*known* members to turn up.
[3] Addresses which aren't used at all may remain out of spammer view
for considerable time, depending on the care with which they're selected
and maintained. However, this obviously excludes addresses used for
participation in mailing lists, addresses used for general correspondence,
addresses given to third parties, and addresses which can be plausibly
inferred from already-known ones. It's also beyond the technical
ability of nearly everyone, because most people have better things
to do with their time than study the tactics of spammer address
harvesting. ;)
[4] For the purpose of this discussion, I'm just talking about suppression
lists which enumerate individual email addresses. It's well-known that
spammers also maintain suppression lists of MX's, domains, network
allocations, ASNs, etc., in an attempt to avoid hitting spamtraps
and/or hitting the mailboxes of those who might be in a position to file
complaints or take action against them.
[5] The only people left who are impeded in the slightest by obfuscation
are NON-spammers: that is, people who are trying to contact someone
who has previously sent a message to some mailing list.
---rsk
More information about the Novalug
mailing list