[Bug 1785] make DNS cache generic, change MX query tests to use DNS cache

Discussion:

b***@hughes-family.org

2003-04-15 12:32:37 UTC

http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1785

------- Additional Comments From ***@pathname.com 2003-04-15 05:32 -------
Well, after profiling with DProf on some recent spam and ham, I'm not entirely
certain the MX tests are a big performance problem (not even close to taking
a lot of time unless DProf is somehow missing it), but I still think it would
be a good idea to make the DNS code more general.

This is somewhat contradictory with the idea that we need backgrounded DNS
lookups. I'm not sure what to think, but I assume we had a foregrounded
version at one point, right?

I'll attach the profile (DNS and *lots* of RBLs since many are in testing, but
no DCC or Razor) in a moment.

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

b***@hughes-family.org

2003-04-15 12:35:44 UTC

Permalink

http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1785

------- Additional Comments From ***@pathname.com 2003-04-15 05:35 -------
Created an attachment (id=883)
profile of DNS on 250 spam and 250 ham

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

b***@hughes-family.org

2003-04-15 12:36:18 UTC

Permalink

http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1785

------- Additional Comments From ***@pathname.com 2003-04-15 05:36 -------
attached ... it's a gzip file

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

b***@hughes-family.org

2003-04-15 17:36:06 UTC

Permalink

http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1785

------- Additional Comments From ***@optusnet.com.au 2003-04-15 10:36 -------
Subject: Re: [SAdev] make DNS cache generic, change MX query tests to use DNS cache

Post by b***@hughes-family.org
Well, after profiling with DProf on some recent spam and ham, I'm not entirely
certain the MX tests are a big performance problem (not even close to taking
a lot of time unless DProf is somehow missing it), but I still think it would
be a good idea to make the DNS code more general.

yep. that's good news btw.

Post by b***@hughes-family.org
This is somewhat contradictory with the idea that we need backgrounded DNS
lookups. I'm not sure what to think, but I assume we had a foregrounded
version at one point, right?

We *did* a long time ago, but Marc Merlin added the bgsend() code -- ie.
we already have backgrounded lookups, as far as I know. For RBL tests
anyway.

Not sure if we need bg lookups for lookup_ptr(), lookup_mx etc., since I
would guess any decent nameserver will already have looked up a lot of
that data anyway and loaded the glue records, so they should be quite
fast. Caching that data is probably helpful though.

Big spamd machines may gain a benefit from a local cache of DNS results.
But big spamd machines should have a local DNS cache running anyway,
in the form of a caching nameserver, I should think! so I'm unsure
if there's really a need to add more code to SpamAssassin there...

--j.

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

b***@hughes-family.org

2003-04-16 02:57:07 UTC

Permalink

http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1785

------- Additional Comments From ***@pathname.com 2003-04-15 19:57 -------
Subject: Re: [SAdev] make DNS cache generic, change MX query tests to use DNS cache

Post by b***@hughes-family.org
yep. that's good news btw.

I should note that I do have a local DNS cache (bind 9), but I suspect
there are a couple of issues:

1. It seems like some lame (or down?) server failures don't get cached at
all, I'm not sure why this is the case. For example: "host -v
208.229.236.14" *always* takes 10 seconds on my machine.

2. Some DNS blacklist lookups seem to take longer than other and it almost
seems to be more per the IP address being looked-up as opposed to the
DNS blacklist server.

I set-up a small test bench of 250 recent spam and 250 recent ham. I
repeated the following test three times (once to prime my cache and then
twice to see how the cached version worked).

./mass-check --net -j 8 -f corpus.small

and I logged out the number of seconds per RBL query in the SA DNS code.
Then, I averaged the lookup time per IP address and also per RBL. The per
RBL version was fine, the averages were from about 1 to 3 seconds.
However, for the per IP address version, there were a few addresses that
had VERY high averages relative to the others.

Here are the top 20 lookup times (total time, number of lookups, average
time, *reversed* IP address):

run two:

5960 234 25.47 203.226.26.203
4012 3312 1.21 249.183.17.209
2173 1282 1.70 146.77.235.207
2081 1352 1.54 206.250.35.66
1989 78 25.50 248.226.26.203
1989 78 25.50 241.226.26.203
1989 78 25.50 178.226.26.203
975 39 25.00 79.216.44.207
751 504 1.49 173.76.146.129
688 351 1.96 12.179.185.208
626 585 1.07 15.35.17.212
514 351 1.46 19.134.144.129
499 473 1.05 207.163.71.64
402 273 1.47 43.98.18.192
376 195 1.93 44.240.36.66
370 234 1.58 181.89.141.203
340 216 1.57 51.14.68.9
323 180 1.79 240.211.103.216
312 273 1.14 26.229.108.195
303 195 1.55 45.1.146.129

run three:

5938 234 25.38 203.226.26.203
4243 3312 1.28 249.183.17.209
2094 1282 1.63 146.77.235.207
1989 78 25.50 241.226.26.203
1977 78 25.35 178.226.26.203
1971 78 25.27 248.226.26.203
1775 1352 1.31 206.250.35.66
975 39 25.00 79.216.44.207
654 585 1.12 15.35.17.212
653 504 1.30 173.76.146.129
615 351 1.75 12.179.185.208
556 473 1.18 207.163.71.64
461 351 1.31 19.134.144.129
423 234 1.81 181.89.141.203
392 216 1.81 51.14.68.9
391 180 2.17 9.99.120.67
390 156 2.50 240.28.13.206
367 220 1.67 2.29.209.63
337 216 1.56 55.183.195.128
328 273 1.20 43.98.18.192

Note there are 39 blacklists ATM since a ton are being tested. The really
bad ones with 25 second averages always take about 25 seconds.

Post by b***@hughes-family.org

Post by b***@hughes-family.org
This is somewhat contradictory with the idea that we need backgrounded
DNS lookups. I'm not sure what to think, but I assume we had a
foregrounded version at one point, right?

We *did* a long time ago, but Marc Merlin added the bgsend() code -- ie.
we already have backgrounded lookups, as far as I know. For RBL tests
anyway.

I *meant* to question whether backgrounded lookups really help performance
or hurt it. I suspect they help, but I'm not 100% sure about that. Maybe
the code just needs some refinement.

Post by b***@hughes-family.org
Not sure if we need bg lookups for lookup_ptr(), lookup_mx etc., since I
would guess any decent nameserver will already have looked up a lot of
that data anyway and loaded the glue records, so they should be quite
fast. Caching that data is probably helpful though.

I've done a few simple experiments with caching A and TXT records in the
current code and it doesn't seem to help at all, although my experiments
only cached successful lookups, not failed ones.

Post by b***@hughes-family.org
Big spamd machines may gain a benefit from a local cache of DNS results.
But big spamd machines should have a local DNS cache running anyway,
in the form of a caching nameserver, I should think! so I'm unsure
if there's really a need to add more code to SpamAssassin there...

I doubt we want to add our own DNS cache, but I'm wondering about some
short-term caching of failed lookups now.

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

b***@hughes-family.org

2003-04-16 13:30:12 UTC

Permalink

http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1785

------- Additional Comments From ***@spamcop.net 2003-04-16 06:30 -------
dns2.colltech.com is dead, which is why the above lookup will have to fail the
first query before the second is attempted. What is particularly evil about
that situation is they have (for reasons beyond my understanding) the
dns2.colltech.com server listed as authoritative twice, so even in the best
case, 2/3 of all queries will fail outright. In reality, the dns1.colltech.com
server also fails the PTR lookup (it appears that this block is misdelegated).

However borken _this_ domain is, it highlights an aspect of this discussion that
I wanted to focus on: it is not appropriate for SA to make up for stupid admin
tricks. Unless you want to write an independent resolving client from within SA
code (i.e. make only non-recursive queries and walk the tree yourself), there is
no way to prevent these timeouts.

If you were to examine the response that was received after the timeout for a
query like the above that actually returned an answer (i.e. one server is down,
but another answers), you would see that the main query suceeded, even though
the subquery failed. You won't have anything to cache (i.e. NXDOMAIN or
temporary error) because the resolving client itself will try harder and get
answer, even if it has to query multiple servers.

The only possible way to handle this in SA would be to set a shorter timeout on
the query and keep track of which lookups failed because of the timeout. The
problem with this is that some slow links will take longer than others, so it
would be unclear what timeout to use. Perhaps you could start out with a longer
timeout and then back off until three strikes, at which point you stop asking at
all (for some period of time).

I would argue that it is much more robust to rely on the machine's DNS resolver
to do what it takes, and just have SA use a timer to bail on _any_ query that
takes longer than X seconds (this should be a configuration parameter). Don't
try and keep track of previous fails; don't try and create a secondary DNS cache
inside SA; don't overthink this problem.

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

Daniel Quinlan

2003-04-16 15:59:27 UTC

Permalink

Post by b***@hughes-family.org
However borken _this_ domain is, it highlights an aspect of this discussion
that I wanted to focus on: it is not appropriate for SA to make up for
stupid admin tricks. Unless you want to write an independent resolving
client from within SA code (i.e. make only non-recursive queries and walk
the tree yourself), there is no way to prevent these timeouts.

Prevent these timeouts, no, but SA *does* need to make up for stupid
admin tricks if they significantly affect filtering performance. The
longer DNS queries take (causing idle perl processes to build up), the
fewer sites can use DNS tests (or have to use shorter timeouts), the
more spam gets through.

I agree that we shouldn't write our own resolver. That would be an
insane extreme, I'm not sure why you think we'd want to do that.

Post by b***@hughes-family.org
The only possible way to handle this in SA would be to set a shorter timeout
on the query and keep track of which lookups failed because of the timeout.
The problem with this is that some slow links will take longer than others,
so it would be unclear what timeout to use. Perhaps you could start out
with a longer timeout and then back off until three strikes, at which point
you stop asking at all (for some period of time).

That could be helpful, but I doubt it's needed. What would help the
most, I think, would be remembering which RBLs are slow and stop using
them temporarily. I might be okay with a 30 second timeout here and
there, but if one blacklist is taking 20 seconds for every query, it
needs to be disabled until it's working better.

The current code doesn't really let us track per-RBL speed since we
harvest results in the same order every time and wait on them in order.
That might go on my list, though.

Skipping the RBLs that exceed a firm timeout is relatively easy, though.

Post by b***@hughes-family.org
I would argue that it is much more robust to rely on the machine's DNS
resolver to do what it takes, and just have SA use a timer to bail on _any_
query that takes longer than X seconds (this should be a configuration
parameter). Don't try and keep track of previous fails; don't try and
create a secondary DNS cache inside SA; don't overthink this problem.

We already do exactly that. The DNS cache data structure in SA is only
used to allow us to background queries and run other CPU intensive tests
while the DNS queries are running. The contents of the cache are
currently thrown out after each message. I've done some simple
experiments with adding persistence, but only for successful positive
queries and it didn't improve performance (as expected since I run a
local DNS server that does caching).

I also figured out the cause of much of the slow RBL lookup time in my
mini-benchmark. It was partially that I wasn't accounting for the fact
we wait on each uncompleted RBL in succession (after all other tests are
done) and also a side-effect of running parallel checks (multiple RBL
lookups for the same IP could be slow or timing out at the same time and
all of them would be logged as slow).

--
Daniel Quinlan anti-spam (SpamAssassin), Linux, and open
http://www.pathname.com/~quinlan/ source consulting (looking for new work)

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

b***@hughes-family.org

2003-04-16 15:59:23 UTC

Permalink

http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1785

------- Additional Comments From ***@pathname.com 2003-04-16 08:59 -------
Subject: Re: [SAdev] make DNS cache generic, change MX query tests to use DNS cache

b***@hughes-family.org

2003-04-16 16:25:38 UTC

Permalink

http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1785

Post by b***@hughes-family.org
I agree that we shouldn't write our own resolver. That would be an
insane extreme, I'm not sure why you think we'd want to do that.

I wasn't suggesting; I was afraid that it would come up as a possible solution
to the problem. ;~)

Post by b***@hughes-family.org
Skipping the RBLs that exceed a firm timeout is relatively easy, though.

We are talking about two different things, methinks. Your example of a long
timeout was specifically an rDNS query, which SA is helpless to deal with. That
was my only point there.

It would be very nice to have a rolling average of response times from RBL's,
however, since they are much more predictible than the random ISP/admin's messed
up DNS (specifically rDNS). Then if there was a temporary problem with, say
relays.osirusoft.com, because someone cut the fiber to the neighborhood, SA
could stop making queries to that server until it came back up. It would
require keeping track of when a RBL query timed out and re-adding that server to
the potential queue at less frequent intervals until the response comes back to
an acceptable range. Sure, you say, more persistent storage, but how else are
you going to do it???

Post by b***@hughes-family.org
We already do exactly that. The DNS cache data structure in SA is only
used to allow us to background queries and run other CPU intensive tests
while the DNS queries are running. The contents of the cache are
currently thrown out after each message.

You may want to take a look at firedns, which is a service that MessageWall uses
to perform parallel DNS queries either on it's own or as a front end to an
existing caching resolver. I don't know if you want to have an independent
service which interfaces with the resolver library and handles all of the query
dispatches internally, but it is a very fast methodology. To be honest, I do
all of my harsh RBL blocking in MessageWall and then use SA to do the harder
categorization phase on the 60% that makes it past MW.

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf