North American Network Operators Group|
Date Prev | Date Next |
Date Index |
Thread Index |
Author Index |
2006.06.06 NANOG-NOTES network-level spam behaviour
- From: Matthew Petach
- Date: Wed Jun 07 07:19:12 2006
- Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:sender:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth; b=HlpXRSjp1b7XAtLoaLtflBodsiBi0etsC23NSKoROqhzTQoGiM66LMRoL/LZMC2IWH9tNMfQ2Ex4E+U2uaualS8ie/9oA+o9B/6Sh8oUAZhn9N24yX0r5dH2zxShkfKw3b0UrRAKgJGhoc0RcelGcK6l7lHro1iDzI6Pk1lzE+A=
2006.06.06 Nick Feamster, Network-level spam behaviour
[slides are at:
unsolicited commercial email
feb 2005, 90% of all email is spam
common filtering techniques are
DNS balcklist queries are significant fraction
of DNS traffic today. (DNSbls)
Using IP address based spam black lists isn't so
How spammers evade blacklists will be discussed
Problems with content-based filters
...uh oh, some technical glitches...
Content-based properties are malleable
low cost to evasion
altering content based on scripts is too easy
customized emails are easy to generate
content based filters need fuzzy hashes over
high cost to filter maintainers
as content changes, filters need to be updated.
constantly tweaking spamassasain rules is a pain.
false positives are always an issue.
Content-based filters are applied at the destination
too little, too late -- wasted network bandwidth,
storage, etc. ; many users recieve and store the
same spam content.
Network level spam filtering is robust (hypothesis)
network-level propeerties are more fixed
hosting or upstream ISP (as number)
location in the network
IP address block
are there common ISPs that host the spammers, for
Avoid receiving mail from machines that are part
Challenge--which properties are most useful for
distinguishing spam traffic from legitimate email?
very little if anything is known about these
Randy gave a lightning talk last NANOG about some
Some properties listed.
mostly botnets, of course
other techniques too
we're trying to quantify this
how we're doing this
correlations with Bobax victims
from georgia tech botnet sinkhole
other possilities: heuristics
distance of client IP from the MX record
coordinated, low-bandwidth sending
looked at pcaps coming in from hijacked command
and control station from bots trying to talk to
it; spamming bots, Bobax drone botnet, exclusively
used to send spam.
two domains instrumented with MailAvenger (both on
the same network)
sinkhole domain 1
continuous spam collection since aug 2004
no real email addresses--sink everything
10 million + pieces of spam
sinkhole domain #2
recently registered Nov 2005
"clean control" domain posted at a few places
not much spam yet--perhaps being too conservative
contact page with random email contact, look at
who crawls, and then who spams the unique email
Monitoring BGP route advertisments from same network
Also capturing traceroutes, DNSBL results, passive
TCP host fingerprinting, simultaneous with spam arrival
(results in this talk focus on BGP+ spam only)
Mail Avenger, not an MTA, it forks to sendmail or
postfix, it sits in front of MTA, does things
like do DNSBL lookups, add headers, passive OS
fingerprinting, as the spam is arriving.
Also logged BGP routes from same network that got
the spam; see connectivity to the spamming machine
at the time.
Picture of collection up at MIT network.
Mail Collection: MailAvenger
best guess at operating system, POF, DNSBL
lookups, traceroutes back to mail relay at the
time the mail was sent (used for debugging BGP)
distribution across IP space
plot /24 prefix vs how much spam coming from it.
steeper lines mean more spam from that part
of the IP space; you can see where spam is
coming from. bunch comes from apnic, cable
modem space, etc.
few interesting things to note; still redoing
legitimate mail characteristics.
from georgia tech mail machines, it's legit plus
spam, need to split out better.
between 90.* and 180.*, legitimate mail mainly.
Is IP-based blacklisting enough?
Probably not: more than half of spamming client IPs
appear less than twice.
Roughly 50% of the IPs showed up less than twice;
but that's a single sinkhole domain, would help
more across multiple domains.
emphasizes need to collaborate across multiple
domains to build blacklists; any one domain
won't see repeated patterns of IPs.
Distribution across ASes
40% of spam coming from the US
BGP spectrum agility
Log IP addresses of SMTP relays
Join with BGP route advertisements seen at network
where spam trap is co-located.
A small club of persistent players appears to be using
somewhere between 1-10% of all spam (some clearly
intentional, others might be flapping)
about 10 minute announcement time of the /8 while
spam is flooded out.
Might be interesting to couple this with route
hijacking alerting to filter out if this is
really a hijacking vs a flapping legitimate route.
A slightly different pattern;
announce-spam-withdraw on a minute-by-minute basis.
really really egregious!
Why such big prefixes?
flexibility: client IPs can be scattered throughout
dark space within a large /8
same sender usually returns with different IP
visibility: route typically won't be filtered (nice
and short prefix length)
Characteristics of IP-agile senders
IP addresses are widely distributed across the /8 spce
IP addresses typically appear only once at the sinkhole
Depending on which /8, 60-80% of these IP addresses
were not reachable by traceroute when we spot-checked
some IP addresses were in allocated, albeit unannounced
Some AS paths associated with the routes contained
reserved AS numbers
Odd AS numbers injected, usually well-known to make
it look more legitimate.
Length of short-lived BGP epochs
10% of spam coming from short-lived BGP events
Spam from Botnets
approximate size: 100k bots
one sinkhole domain--this is ONLY stuff that is
verifiable as coming from bots via command and
control hijacked IPs, intersect the single sinkhole
domain, so much smaller data subset, but well
correlated and verified.
Proportionally less spam from bots in 61-90
range; that tends to be where BGP route hijacks
Most Bot IP addresses do not return
65% of bots only send mail to a domain once over
Some hang around for a *long* time.
About 20% stick around for several months.
collaborative spam filtering seems to be helping
track bot IP addresses.
Most bots send low volumes of spam
most bot IP addresses send very little spam regardless
of how long they have been spamming
Effectiveness of blacklisting:
only about half of the IPs spamming from short-lived
BGP are listed in any blacklist
spam from IP-agile senders tend to be listed in fewer
Looking at 8 different spam blacklists, checking when
the spam arrives at the sinkhole.
Known Bobax drones listed in more DNSbls than the
BGP agile senders.
About 90-95% of the Bobax bot drones are listed
in one or more DNSBLs.
Suggests some of the spamming bots are listed more
than other techniques--that is, bots are easier to
identify than BGP-agile spammers or spammers using
tracking web-based harvesting
register domain, set up MX record
post, link to page with randomly generated email addresses
a flood of email for a phishing attack for paypal.com
all to: addresses harvested in a single crawl on
January 16th 2006
emails received from IPs different from those who
X-mailer headers totally diffrent.
Lessons for better spam filters:
effective spam filtering requires a btter notion of
distribution of spamming IP addresses is highly
detection based on network-wide, aggregate behavioru
may be more fruitful than focusing on individual IPs
large, emergent properties.
two critical pieces of the puzzle
securing the internet's routing infrastructure
compare distributions of spam to legitimate mail,
see if certain spaces are more likely to send spam
than legitimate mail.
Q: Steve Bellovin, columbia university
bots from strange ASes, is tunnelling taking
place from bots to BGP speakers?
A: Not sure if there's evidence or not; some data
but TORS latency may be too high.
Q: Fingerprinting to try to identify who is doing
things, see how many hosts are actually doing
Many addresses being used, how many hosts
does it actually represent?
A: Not sure, haven't checked that.
Haven't checked on aliasing, since not much
was seen from a single IP.
What about hosts hopping? (same host using multiple
Not sure, they didn't do that correlation.
Q: Randy Bush, IIJ, they did do OS fingerprinting,
so some of that are in the paper.
didn't do anything with the traceroutes, though.
Q: Matt asks what the difference between the two
domains was; was one of them a recognizable word
or name, or were they both random character strings?
A: they were both random character strings, but one
of them had been used to host a real website for a
while, which might explain why it gets such a huge
volume of spam compared to the other.
Q: Matt points out that for some networks, receiving
spam is actually a good thing, as it helps balance
out traffic ratios, which helps during peering
Q: Randy Bush, IIJ, responding to Matt about traffic
ratios: only those backbones who are on ADSL should
they care which way traffic goes. :P
Curious to work with large networks, see if filters
could be installed to detect it, and possibly take