[Novalug] SPF etc - mail not getting delivered [VERY LONG]

Rich Kulawiec rsk@gsp.org
Tue Dec 13 09:22:38 EST 2016


On Wed, Oct 12, 2016 at 05:39:07PM -0400, Rich Kulawiec via Novalug wrote:
> On Wed, Oct 12, 2016 at 10:58:44AM -0500, Beartooth via Novalug wrote:
> > 	Is there a replacement?
> 
> I still don't quite get the question, but if you mean "are there things
> that are better?", the answer is yes.  I've written about them at
> considerable length elsewhere, so I'll be hyper-brief here.

And now I'm going to be hyper-long, because I've finally gotten a chance
to finish writing a message I started composing two months ago.

[ This is a VERY LONG message.  If you follow up, please edit extensively
as a courtesy to others. ]

If you want to do this better, then you need to have control of the
MTA (mail transport agent) that handles your mail.  Pretty much the
only way to do that is to run your own: after all, services aren't
going to let you configure theirs. ;)   Of course not everybody
can do that and not everybody wants to do that, but if you can
if you want to, here's a look at one technique I strongly recommend.

Let me use a recent message which traversed the Linux
kernel mailing list to illustrate one way to approach this.
This is an example, and of course it's thus only one of many
possible examples.  Think of this as one particular walk through a very
large forest -- just an illustration of one of the many
basic approaches to stopping spam.  There are MANY other paths,
some of which are useful, some of which are dubious, and some of
which are (variously) wrong/stupid/abusive.  There are enough of
these things to fill a book.  Which is why I'm writing one.

The theme of this approach is "what can we ascertain by looking
at DNS records for hosts that are trying to send us email?"

Some terminology first (and these are not rigorous -- that's intentional):

	MTA = mail transport agent, e.g., sendmail, postfix, etc.
	DNS A record = hostname-to-IP address mapping
	DNS PTR record = IP-address-to-hostname mapping
	FCrDNS = Forward-Confirmed reverse DNS, e.g., matching A and PTR
	reject = refusing to accept a message during the SMTP conversation
	bounce = accepting a message then trying to return it later
	SMTP 2xx = shorthand for an SMTP response code like 220.  2xx
		response codes indicate message acceptance.
	SMTP 4xx = shorthand for an SMTP response code like 400.  4xx
		response codes mean "not now, try again later".
	SMTP 5xx = shorthand for an SMTP response code like 550.  5xx
		response codes mean "rejected, do not ever try again".
	HELO/EHLO = the hostname a sending MTA provides to a receiving
		MTA during the SMTP conversation
	FQDN = fully-qualified domain name, e.g., mx.example.net
	bracketed quad IP = an IPv4 address in brackets, e.g. [192.168.1.2]
	MX = refers either to a DNS Mail eXchanger record, that is, a record
		which specifies hosts that accept mail, or to the hosts
		themselves.  Use context to distinguish.
	FP = false positive: message classified as spam but isn't
	FN = false negative: message isn't classified as spam but it is
	RFC 1918 = RFC that specifies private/reserved IP space,
		e.g. 192.168.0.0/24 -- which is unroutable
	DNSBL = DNS blacklist.  The RBL was the first instance of such
		a beast.  DNSBLs can be queried with IP addresses and
		return records that indicate things about the address;
		in the simplest case, it's a binary "listed/not-listed".
		DNSBLs can be very useful in collaboratively blocking
		spam and other abuse.
	RHSBL = Right-Hand-Side blacklist.  Instead of listing IP
		addresses, these list domains/subdomains/hosts.

And let me give you a glimpse of what a typical conversation between
two MTAs looks like.   This is what a message sent from ukiah to taos,
from one of my addresses to another, looks like at the SMTP protocol
level -- comments in brackets.

	[ ukiah opens TCP connection to taos on port 25 ]

220-taos.firemountain.net ESMTP Sendmail Sat, 5 Nov 2016 18:27:04 -0400 (EDT)

	[ taos, the receiving side, has just said hello ]

HELO ukiah.firemountain.net

	[ ukiah, the sending side, identifies itself to taos ]

250 taos.firemountain.net Hello ukiah.firemountain.net [207.114.3.55], pleased to meet you

	[ taos ACKs ukiah's HELO ]

MAIL From:<rsk@gsp.org>

	[ initiate mail transmission, starting with sender ]

250 2.1.0 <rsk@gsp.org>... Sender ok

	[ sender is accepted by the receiving mail system ]

RCPT To:<rsk@firemountain.net>

	[ specify recipient ]

250 2.1.5 <rsk@firemountain.net>... Recipient ok

	[ recipient is accepted  by the receiving mail system]

DATA

	[ ukiah tells taos that the message body is imminent ]

354 Enter mail, end with "." on a line by itself

	[ taos tells ukiah to send it ]

blah blah blah
 .

	[ ukiah sends message; I indented the period here for clarity ]

250 2.0.0 uA5MR4UR018752 Message accepted for delivery

	[ taos accepts message with SMTP response 250/2.0.0 and
	provides the queue ID for it ]

QUIT

	[ ukiah is done sending mail ]

221 2.0.0 taos.firemountain.net closing connection

	[ taos ACKs the QUIT and closes the TCP connection ] 


What I'd like you to focus on here is the first part of the exchange,
up until the point that taos ACKs ukiah's HELO.  At that point in
the conversation, there is already enough information in taos' 
possession to make a go/no-go decision about the message -- that is,
enough to tell whether it should let this conversation go any further,
or whether ukiah has already put proof on the table that the message
is bogus and should be rejected.  I'll explain why using an example
message in a moment, but I want to emphasize that this particular
approach is one I recommend as a first step in MTAs, because it's
(a) simple (b) efficient (c) difficult to game (d) repeatable.

	[ That last one, repeatable, is really important.  All anti-spam
	systems make mistakes.  It's inevitable.  So one important aspect
	of that is to make them *consistently*.  Everyone who's ever
	debugged code knows that the hardest bugs to find are the ones
	that come and go.  So it's highly desirable to design and build
	anti-spam systems that -- when wrong -- *stay* wrong.  This gives
	everyone a fighting chance of seeing the mistake, figuring out
	the mistake, and fixing the mistake. ]

The general idea of this approach is that you can make a lot of decisions
based just on hostnames and DNS records.  That's because real live
actual mail servers:

	- have real names, not generic names
	- have FCrDNS
	- HELO/EHLO as their hostname or at least a host in the same domain
	- HELO/EHLO as a hostname that passes FCrDNS
	- if also an MX, is not a CNAME
	- if also an MX, has a hostname that passes FCrDNS
	- if also an MX, resolves to a public IP address (not RFC 1918)

	[ Not every host that handles outbound mail handles inbound mail.
	That's why I say "if also an MX". ]

Go back and look at the conversation above.  Check the DNS records
for the hosts involved.  All these tests passed.

Some of this stuff is specified by RFC.   Some of it is longstanding best
practice.  All of it is being increasingly enforced by mail systems,
so even if you don't like the RFCs and don't care about best practices,
you should do it anyway...or never, ever, whine about your email being
refused, because, as John Levine has pointed out, the total budget across
all mail receivers for solving senders' problems is $0.  Besides, this is
trivial stuff to get right: if you can't manage this, then you have
no business running a mail system.

	[ Also, if your MX's resolve to RFC 1918 addresses...you
	won't be getting a lot of mail.  Think about it. ]

Anyway, keep those hostname/DNS items above in mind as we dive into this.

The example spam message that I want to discuss is here:

	http://www.firemountain.net/~rsk/lkml.html

and I've defanged it so that the payload URL won't work any more.
(It's not included inline here because I don't want to trip any filters
on your side.  Which is why content-sensitive filters are a horribly
bad idea and shouldn't be used...but that is another entire very long
message in its own right.)  This is a piece of spam that made it into
the linux-kernel mailing list (hence: lkml) a while back.

There are a number of things about this message that should have led
to its rejection by vger.kernel.org.  Let's focus on these lines in
the headers, for starters:

	Received: from 103-245-153-249.host.neural.net.au ([103.245.153.249]:63709
        "EHLO mail.digitaljunction.com.au" rhost-flags-OK-FAIL-OK-FAIL)
        by vger.kernel.org with ESMTP id S1750981AbcJILTb (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Sun, 9 Oct 2016 07:19:31 -0400

Note that this is the set of Received headers as the message hit
vger.kernel.org.  Under most circumstances, you should only trust
the last set of headers, i.e. the ones written by *your* server, because
any before that can be and often are forged.  But in this case, I'm going
to believe that vger.kernel.org is telling the truth, because, sheesh,
why wouldn't it?

	[ Also I happen to have a couple million samples of messages
	which have traversed that server, and no evidence that vger's
	ever faked a header.  So let's run with it.  This is, after
	all, an *example*. ]

These headers tell us that a system at 103.245.153.249 with a PTR
record of 103-245-153-249.host.neural.net.au connected to vger
on destination port 25 from source port 63709 and identified itself,
using the EHLO directive during the SMTP conversation,
as mail.digitaljunction.com.au.

	[ Why, you are probably asking yourself, did it EHLO instead
	of HELO?  Because once upon a time RFC 821 defined HELO.
	Later, RFC 2821 defined EHLO -- because it *also* defined
	ESMTP, which had more features.  So any mail server out
	there these days should EHLO when sending, but should be
	gracious enough to accept HELO when receiving.

	So why did my ukiah-taos example use HELO?  Because I did
	that manually, e.g., I ran "telnet taos.firemountain.net 25"
	from ukiah, and my fingers type HELO automatically unless I
	really slow down and think about it.  Old habit. ]

First clue: while 103.245.153.249 has a PTR record of
103-245-153-249.host.neural.net.au, 103-245-153-249.host.neural.net.au
has no A record.  Thus no A record and thus no FCrDNS.

Second clue: the sending system has a generic hostname with its IP
address included in it.  Bah.  Real mail servers have real names.
Systems with generic names, or with IP addressses in their names,
or with subdomains like "adsl" and "dynamic" and so on, do not have
real names and are not real mail servers.

Third clue: the sending system did not EHLO as its canonical name.
It didn't even EHLO as a name in the same domain indicated by rDNS,
that is, it didn't claim to be something.something.neural.net.au.
Real mail servers EHLO/HELO as their canonical name.  (Note that
the EHLO/HELO doesn't have to match the putative domain of the sender's
email address.  The mail server I'm sending this through uses its canonical
name while emitting mail from a dozen other domains.  But unless you have
a seriously good reason to do otherwise, and you don't, your mail
servers should have matching FCrDNS and HELO/EHLO.)

This message should have been rejected based on that information alone,
i.e., information available to the receiving mail server before it
even *saw* the incoming message.  vger.kernel.org should have given
it a 5xx SMTP response and dropped the TCP connection before the sending
side even got to the SMTP directive "DATA".

Now some folks will point out that once in a while, this leads to
false positives because some halfwit out there has done a horrible
job of configuring DNS for their mail server.  Fair enough.  But I
don't want mail from servers run by halfwits, any more than I want
mail from spammers.

Besides, rejecting all of their mail is really the only effective
way of signalling to them that they've screwed up, because people
who get this wrong are nearly always the same people who will also fail
to have functioning postmaster, hostmaster, abuse, etc. addresses...so even
if you want to be nice and go out of your way and tell them about their
mistake, they have their fingers firmly in their ears and can't hear you.

	[ These addresses ("postmaster" et.al.) are specified in RFC
	2142 and elsewhere.  If you run a mail server, you had better
	damn well have a functioning "postmaster" address.  If you run
	ANYTHING connected to the Internet, you should have a functional
	"abuse" address.  There have been blacklists that enumerate
	operations which fail at this, e.g., "RFC-IGNORANT". ]

Fourth clue: The EHLO string is mail.digitaljunction.com.au, which
is a FQDN and thus may be valid.

	[ The HELO/HELO must either be a FQDN or a bracketed dotted
	quad IP address.  If a FQDN, it must resolve in DNS.  Thus
	foo.example.com and [192.168.1.2] are valid -- provided
	foo.example.com resolves, and foo and 10.1.2.3
	are not. ]

But...let's look up the IP address of that host:

	mail.digitaljunction.com.au has address 117.55.232.21

Okay, so it resolves, but it doesn't match the IP address
the connection is coming from.  And:

	Host 21.232.55.117.in-addr.arpa. not found: 3(NXDOMAIN)

It doesn't have a PTR record, which means it's NOT a mail server,
despite the hostname suggesting that it earnestly wants to be.

Fifth clue: The putative sender is basicmark@victoriakasunic.com.
The single MX for that domain is mail.victoriakasunic.com.  That
host resolves to 103.245.153.249, but 103.245.153.249 has no PTR
record.  In other words, sender's domain has an MX without FCrDNS.

	[ "putative" sender because we are not looking at proof
	that this IS the sender.  Note that even with things
	like DKIM, we still don't have proof that the sender
	is the PERSON specified: we only have the assertion
	of the sending host (which might be compromised) that
	the sender's account (which might be compromised) is
	the one originating the message.  This is why all the
	blather about "stopping email forgery" in a time when
	we routinely see hundreds of millions of compromised hosts
	and billions of compromised accounts is just happyland
	wishful thinking...at best. ]

So armed with these five clues, there's no reason to accept this
message.  (Actually: my mail servers would reject it based on
the first one.  That's sufficient.)  It's very, VERY likely spam
or from a mail server so horribly misconfigured that no sane person
would want to accept its traffic.  Either way, like I said above:
send a 550 SMTP response, drop the TCP connection, and move on.
No need to even *look* at the message: it's irrelevant.

	[ That means that this conversation should have been terminated
	as soon as the far side said EHLO.  Who cares who the message
	claims to be from or to?  Who cares what's in the message itself?
	The far side has already made a darn good argument that it's
	not a real mail server: there's no need to waste bandwidth, CPU,
	etc. letting it stack the proof even higher.  This may seem
	like a trivial point, but it's not: a large/busy mail server
	can waste a substantial portion of its resources talking to
	bogus hosts like this and processing messages -- with things
	like SpamAssassin -- that it simply does not need to.  One of
	the fundamental principles of mail system defense is that you
	should always deploy defenses in order from "uses least resources"
	to "uses most resources", which is why this is a good one to
	have pretty near the front of the list. ]


Now let's talk about some other topics pertinent to this message.

Note the X-Greylist header.  It suggests that vger.kernel.org
imposed a 1600-second delay on the sender, which means that it queued
and retried, which means that it's probably a bot with a sufficiently
sophisticated SMTP engine capable of doing that.  (Or it's a real
mail server, albeit one horribly configured, that's been hijacked.)

	[ Grey/graylisting means sending back an SMTP 4xx response
	on the first delivery attempt and continuing to send that
	response until time T, T usually between 10 minutes and 2 hours,
	has elapsed.  The idea is that real mail servers will back off,
	queue, and retry at some sensible interval.  Bots, especially
	1st and 2nd-generation ones, dont have queueing mechanisms and
	won't retry.  This stops spam and identifies bots.  Of course
	some delirious mail servers retry at 1-minute intervals -- this
	is abusive -- or don't retry -- this is broken -- or misintepret
	a 4xx as a 5xx -- this is horribly broken -- or or or...

	Anyway, graylisting also works against some queueing bots, because
	the delay buys time for spamtraps to pick up hits from them and
	to trigger DNSBL listings.  Thus: 192.168.7.11 tries at 10:35 AM,
	gets graylisted for 90 minutes, queues, tries again at 11:35 AM
	(because its retry interval is 1 hour), is still graylisted,
	requeues, tries again at 12:35 PM -- but the DNSBL listing for
	it went up at 12:02 PM and now it gets an outright rejection with
	a 5xx SMTP response.   Yes, this means that the receiving side
	got lucky, but with the large number of IP addresses in play,
	and the large number of spamtraps, this actually happens. ]


Now then, to take one step back and consider a larger view:

All of this is great if you're running the MTA at vger.kernel.org,
but none of which will help you much if you're on the LKML and therefore
this was delivered to you anyway.

You *could* deploy SpamAssassin or another content-sensitive program
on this.  In fact: somebody did.  Look at the headers.  And it didn't work:
this message is an example of a FN, and thus I suppose, illustrative
of why SA fails so badly: spammers have copies of it too and can
test their messages against it (configured a bazillion different ways,
thank you cloud computing) in order to find form/content likely to pass
most of the instantiations of SA that are deployed in the field.

	[ Alternate explanation: they didn't deploy SA, and those SA headers
	are complete fabrications.  But that just illustrates another
	problem: since anyone can make those up anytime/anywhere
	they want, they don't mean anything unless they're your own.)

Sure, SA's configuration could be changed to deal with this...just like
the last 3,827 times it's been changed.  And just like the next 2,264
times it'll need to be changed.  Or vger.kernel.org could just enforce
basic DNS and SMTP sanity checks and summarily reject everything that fails
them without expending the bandwidth to provisionally accept the message
and the CPU cycles to crunch it with SA.

	[ To explain: you can't feed the message-body to SA or anything
	else until you have it.  So you have to let the SMTP conversation
	continue into the DATA portion, where the sending side transmits
	the message over the network.  You need to wait for the whole
	thing to come across -- so that you're done with DATA.  Then you
	can process it with SA or whatever you like.  Once you're done,
	you need to go back to the SMTP conversation and tell the sending
	side what your response is.  So you might spend a lot of time
	and bandwidth, in the aggregate, accepting lots of traffic in
	the DATA phase...even though you have zero need to do that. ]

Which is why blocking spam as near its source as possible is best.
This message should have/could have been blocked on attempted delivery
at vger.kernel.org.  But after vger accepted it and fed it through its
instance of majordomo and majordomo handed it off to the outbound MTA on
vger and that MTA delivered it to my MTA...it's pretty much too late.
All that's left now is making the best of a bad situation.  To explain:

It's too late to reject it, because my MTA isn't having a conversation
with the original sending MTA; it's having a converation with vger.
If I reject it, I'm either going to make more work for whoever's
the postmaster at vger and/or I'm going to create the incorrect 
impression at vger that my address is undeliverable.  Neither is helpful.
And bouncing it would be abusive.

	[ A reject is what happens when MTA 1 is trying to deliver
	a message to MTA 2 and the latter rejects it with a 5xx
	SMTP response.  A bounce is what happens when MTA 2 accepts
	the message and then changes its mind and *then* tries to
	send it back where it came from.  This is wrong and abusive,
	and all mail system should be configured to avoid it.
	It's difficult to get this completely right -- there are
	a few edge cases that are tricky -- but there is absolutely
	no excuse for getting it darn close to right.  Systems
	which bounce emit a form of spam known as backscatter
	(or more accurately, outscatter) and they are often --
	correctly -- blacklisted for abuse, because spammers find
	them, target them, and use them.  Reject good, bounce bad. ]

If I can't reject it and I shouldn't bounce it, then my only other
option, at the MTA level, is to accept it.  And if I accept it at
the MTA level, with a 2xx response code, then it's going to be
delivered (locally), because it has to be.  Like I said, it's
much too late to do anything really useful with this, because
the message is now too far from the its origin.

And thus I suppose this illustrates a general principle: if you're
getting traffic from mailing lists, then you shouldn't be trying to do
anti-spam work at the MTA level on that subset of traffic...because pretty
much everything you could do is wrong.  You need to defer that task to the
people running the mail servers hosting the mailing lists, because they're
the only ones in a position to do something useful.  As someone who's on
more mailing lists than any of you, I can tell you that some folks do a
terrific job of this.  Some do a terrible job (and the kernel.org lists,
unfortunately, fall into that category).

The fix for the latter isn't technical.  It's human (always much harder)
and consists of convincing those responsible to do better.

Now the good news: most of these sorts of checks, e.g., DNS/rDNS checking,
MX checking, etc., are all enforceable in open-source MTAs.

	[ If you're not running an open-source MTA, you're an idiot.
	I don't care if your corporation wants to use braindamaged crap
	like Exchange internally, you should NEVER expose that to the
	open Internet -- for your sake and ours.  Traffic going in/out
	should be gatewayed through a sensibly-configured instance of
	a real MTA like sendmail, postfix, exim, or courier running on a
	real operating system. ]

It's pretty much just a matter of switching on those checks *and*
monitoring the logs so that you can see what they're doing.  You may
also have to occasionally make exceptions for operations you really want
email from but which are not properly configured.

	[ For example, I'm aware of a sizable university-related credit
	union which has had its DNS records for its mail servers set up
	incorrectly for over a decade.	This of course immediately calls
	into question their competence: if they can't get something
	this rudimentary right, then why should anyone believe that
	they can handle something much more complex like security?  Yes,
	I've told them.  Repeatedly.  I talk to the wind: the wind does
	not hear.  (Greg Lake memorial reference) ]

These DNS/rDNS/etc. checks are also very efficient: they don't take
a lot of CPU, a lot of memory, or a lot of bandwidth.  And if you're
running a local-only caching resolver on the mail server itself (something
that you really should be doing) then you can take advantage of that
to make these checks even more efficient.

And they're hard to game, because even though the owners of bots
have control over those systems, they don't control forward DNS
(that's usually handled by the domain owner, the ISP, or the web host)
and they don't control reverse DNS (similar).  Thus they're not in
a position to fix broken FCrDNS even if they want to, and this is
a good thing.

Also this is a repeatable (go back to what I said waaaay above) measure.
It will behave exactly the same way every time, which means that if it's
due to something like a DNS oops, and if whoever is running the mail
server is *paying attention to their own logs*, then it should be
obvious and should get fixed quickly.  And then the blocking will
evaporate without any need to change anything on the receiving MTA side.

A nuance: sometimes these checks fail because of a temporary, not a
permanent, DNS botch, e.g., a response timeout.  Sensible MTAs like
sendmail will note this and instead of rejecting with a 5xx, will defer
with a 4xx.  This is goodness, because it means that if/when the sending
side queues and retries, DNS might be working again and the problem will
go away...with the only downside being delayed delivery, which is of
course inconsequential.  Provided everyone does what they should, e.g.,
correctly functioning resolvers, reasonable retry attempt duration,
etc., this is a self-correcting problem.  (I saw quite a bit of this
during the most recent IoT-driven 'net meltdown, both with inbound and
outbound traffic.  As the DDoS subsided, mail queues slowly drained.)

Of course if the DNS botch persists long enough, then the maximum
queueing interval (often 3 days) will be exceeded, the sending side
will stop retrying and will return to sender.  This doesn't happen
often because it requires that the keepers of the sending mail server
fail to notice lots and lots and LOTS of 4xx deferrals, outbound
queues filling up, etc.: pretty much, they have to be ignoring their
own mail server(s) to miss this.  Which most people don't.  And if they do?
I still don't want email from operations run by halfwits, see above.

Anyway: this long, winding message is, like I said at the beginning,
only one path through a large forest.  But it does happen to a path
that I recommend exploring if you run a mail system, because it's
effective and efficient.  DNS/hostname checks like this stop a large
fraction of the spam that shows up here, and they do so using very
minimal resources.

---rsk


More information about the Novalug mailing list