Evasion with Unicode format characters

Discussion:

Cedric Knight

2018-10-30 11:07:20 UTC

Hello

I thought of submitting a patch via Bugzilla, but then decided to first
ask and check that I understood the general principles of body checks,
and SpamAssassin's current approach to Unicode. Apologies for the length
of this message. I hope the main points make sense.

A fair number of webcam bitcoin 'sextortion' scams have evaded detection
and worried recipients because of including relevant credentials.
(Incidentally, I assume the credentials and addresses are mostly from
the 2012 LinkedIn breach, but someone on the RIPE abuse list reports
Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of
this spam, but on writing body regexes to catch the wave around 16
October, I noticed that my rules weren't matching because the source was
liberally injected with invisible characters:
Content preview: I a<U+200C>m a<U+200C>wa<U+200C>re blabla is one of
your pa<U+200C>ss. L<U+200C>ets g<U+200C>et strai<U+200C>ght
to<U+200C> po<U+200C>i<U+200C>nt. No<U+200C>t o<U+200C>n<U+200C>e

These characters are encoded as decimal HTML entities ‌ and in the
text/plain part as UTF-8 byte sequences.

Without working these characters into a body rule pattern, that pattern
will not match, yet such Unicode 'format' characters barely affect
display or legibility, if at all. This could be a more general concern
about obfuscation. Invisible characters could be used to evade all the
ADVANCE_FEE* rules for example. There are over 150 non-printing 'Format'
characters in Unicode:
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Format:]
I find it counterintuitive that such non-printing characters match
[:print:] and [:graph:] rather than [:cntrl:], but this is how the
classes are defined at:
https://www.unicode.org/reports/tr18/#Compatibility_Properties

As minor points, 'Format' excludes a couple of separator characters in
the same range that instead match [:space:]
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character:]
Then there is the C1 [:cntrl:] set, which some MUA's may render
silently, I think including the 0x9D matched by the recent
__UNICODE_OBFU_ZW (what's the significance of UNICODE in the rule name?):
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control:]
Finally, there may be a case for including as 'almost' invisible narrow
blanks like U+200A &hairsp; U+202F and maybe U+205F. The Perl Unicode
database may not be completely up-to-date here, and Perl 5.18 doesn't
recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24
does.

I've also seen many format characters in legitimate email, including in
the middle of 7-bit ASCII text. Google uses 0xFEFF (BOM) as a zero-width
word joiner (use deprecated since 2002), and U+200C apparently occurs in
corporate sigs. So their mere presence isn't much evidence of
obfuscation. I presume they may prevent legitimate patterns being
matched, including by Bayes.

So my patch was going to be something to eliminate Format characters
from get_rendered_body_text_array() like:
--- lib/Mail/SpamAssassin/Message.pm (revision 1844922)
+++ lib/Mail/SpamAssassin/Message.pm (working copy)
@@ -1167,6 +1167,8 @@
$text =~ s/\n+\s*\n+/\x00/gs; # double newlines => null
# $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace (incl. VT, NBSP) => space
# $text =~ tr/ \t\n\r\x0b/ /s; # whitespace (incl. VT) => single space
+ # do not render zero-width Unicode characters used as obfuscation:
+ $text =~
s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs;
$text =~ s/\s+/ /gs; # Unicode whitespace => single space
$text =~ tr/\x00/\n/; # null => newline

One problem here is that I'm not clear at this point if $text is a
intended to be a character string (UTF8 flag set) or a byte string, and
the code immediately following tests this with `if
utf8::is_utf8($text)`. \p{Format} includes U+00AD (soft hyphen), which
is also a continuation byte in UTF-8 encoding such as in the letter 'í'
(LATIN SMALL LETTER I WITH ACUTE), so might be incorrectly removed if
$text is a byte string.

Prior to SA 3.4.1, it seems sometimes body rules would be matching
against a character string, and sometimes against a binary string. This
is mentioned in bug 7490, where a single '.' was matching 'á' until
version SA 3.4.1. As a postscript to that bug, I suspect what was
happening was 'normalize_charset 1' was set, and _normalize() was
attempting utf8::downgrade() but failed, perhaps because the message
contained some non-Latin-1 text.

On the other hand, will `s/\s+/ /gs` fail to normalise all Unicode
[:blank:] characters correctly unless $text is marked as a character
string? What are the design decisions here? Can I find them on this
list, the wiki or elsewhere? Also what is the approach to 7-bit
characters [\x00-\x1f\x7f] ?

Here are some significant commits that seem to be work make the process
of decoding and rendering more reliable and more like email client
display but don't solve the format character issue:
http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message.pm?r1=1707582&r2=1707597
http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message/Node.pm?r1=1749286&r2=1749798

IMHO it would be nice if it were possible to change related behaviour
via a plugin, at the parsed_metadata() or start_rules() hook, but AFAICS
there is no way for a plugin to alter the rendered message. You can use
`replace_rules`/`replace_tag` to pre-process a rule (this fuzziness has
the advantage that the same code-point may obfuscate, say, both I and L,
but doesn't help much with invisible characters at the moment). However,
there is nothing to pre-process and canonicalise the text being matched
to simplify rule writing.

I have often been unclear on what I need to do to get a body rule to
match accented or Cyrillic characters, sometimes checking the byte
stream in different encodings and transcribing to hex by hand. 'rawbody'
rules should no doubt match the encoded 'raw' data, but I wonder if
'body' rules would work better if they concentrated on the meaning of
the words without having to worry about multiple possible encodings and
transmission systems. So if I can venture a radical suggestion, should
body rules actually match against a character string, as they have
sometimes been doing apparently unintentionally? Could this be a
configuration setting, as a function of or in addition to normalize_charset?

Very little cannot be represented in a character string, which seems to
be Perl's preferred model since version 5.8. Although there may be some
obscure encodings that could require some work to decode, is it better
to decode and normalise what can be decoded reasonably reliably, and
represent the rest as Unicode code points with the same value as the
bytes? (That should match \xNN for rare encodings.) Is there still a
performance issue? To make such functionality (if enabled) as compatible
as possible with existing rulesets, the Conf module might detect valid
UTF-8 literals in body regexes and decode those, and where there are
\xNN escape sequences (up to 62 subrules in main rules), if they form
valid contiguous UTF-8, they can be decoded too. Where there are more
complex sequences like __BENEFICIARY or
\xef(?:\xbf[\xb9-\xbb]|\xbb\xbf), then perhaps those should have been in
rawbody rules anyway, or rewritten to be encoding-independent and
eliminate any finesses of Unicode like the Format characters?

I'd be grateful for advice as to whether there's merit in filing these
concerns as one or more issues on Bugzilla, or for relevant background.

CK

Kevin A. McGrail

2018-10-30 13:05:53 UTC

Permalink

I've been looking at Zero-Width chars and the evasion. Look at KAM.cf and
search ZWNJ and KAM_CRIM rules and see if it helps.
--
Kevin A. McGrail
VP Fundraising, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171

Post by Cedric Knight
Hello
I thought of submitting a patch via Bugzilla, but then decided to first
ask and check that I understood the general principles of body checks,
and SpamAssassin's current approach to Unicode. Apologies for the length
of this message. I hope the main points make sense.
A fair number of webcam bitcoin 'sextortion' scams have evaded detection
and worried recipients because of including relevant credentials.
(Incidentally, I assume the credentials and addresses are mostly from
the 2012 LinkedIn breach, but someone on the RIPE abuse list reports
Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of
this spam, but on writing body regexes to catch the wave around 16
October, I noticed that my rules weren't matching because the source was
Content preview: I a<U+200C>m a<U+200C>wa<U+200C>re blabla is one of
your pa<U+200C>ss. L<U+200C>ets g<U+200C>et strai<U+200C>ght
to<U+200C> po<U+200C>i<U+200C>nt. No<U+200C>t o<U+200C>n<U+200C>e
These characters are encoded as decimal HTML entities ‌ and in the
text/plain part as UTF-8 byte sequences.
Without working these characters into a body rule pattern, that pattern
will not match, yet such Unicode 'format' characters barely affect
display or legibility, if at all. This could be a more general concern
about obfuscation. Invisible characters could be used to evade all the
ADVANCE_FEE* rules for example. There are over 150 non-printing 'Format'
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Format
:]
I find it counterintuitive that such non-printing characters match
[:print:] and [:graph:] rather than [:cntrl:], but this is how the
https://www.unicode.org/reports/tr18/#Compatibility_Properties
As minor points, 'Format' excludes a couple of separator characters in
the same range that instead match [:space:]
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character
:]
Then there is the C1 [:cntrl:] set, which some MUA's may render
silently, I think including the 0x9D matched by the recent
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control
:]
Finally, there may be a case for including as 'almost' invisible narrow
blanks like U+200A &hairsp; U+202F and maybe U+205F. The Perl Unicode
database may not be completely up-to-date here, and Perl 5.18 doesn't
recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24
does.
I've also seen many format characters in legitimate email, including in
the middle of 7-bit ASCII text. Google uses 0xFEFF (BOM) as a zero-width
word joiner (use deprecated since 2002), and U+200C apparently occurs in
corporate sigs. So their mere presence isn't much evidence of
obfuscation. I presume they may prevent legitimate patterns being
matched, including by Bayes.
So my patch was going to be something to eliminate Format characters
--- lib/Mail/SpamAssassin/Message.pm (revision 1844922)
+++ lib/Mail/SpamAssassin/Message.pm (working copy)
@@ -1167,6 +1167,8 @@
$text =~ s/\n+\s*\n+/\x00/gs; # double newlines => null
# $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace (incl. VT, NBSP) =>
space
# $text =~ tr/ \t\n\r\x0b/ /s; # whitespace (incl. VT) => single
space
+ $text =~
s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs;
$text =~ s/\s+/ /gs; # Unicode whitespace => single
space
$text =~ tr/\x00/\n/; # null => newline
One problem here is that I'm not clear at this point if $text is a
intended to be a character string (UTF8 flag set) or a byte string, and
the code immediately following tests this with `if
utf8::is_utf8($text)`. \p{Format} includes U+00AD (soft hyphen), which
is also a continuation byte in UTF-8 encoding such as in the letter 'Ã'
(LATIN SMALL LETTER I WITH ACUTE), so might be incorrectly removed if
$text is a byte string.
Prior to SA 3.4.1, it seems sometimes body rules would be matching
against a character string, and sometimes against a binary string. This
is mentioned in bug 7490, where a single '.' was matching 'Ã¡' until
version SA 3.4.1. As a postscript to that bug, I suspect what was
happening was 'normalize_charset 1' was set, and _normalize() was
attempting utf8::downgrade() but failed, perhaps because the message
contained some non-Latin-1 text.
On the other hand, will `s/\s+/ /gs` fail to normalise all Unicode
[:blank:] characters correctly unless $text is marked as a character
string? What are the design decisions here? Can I find them on this
list, the wiki or elsewhere? Also what is the approach to 7-bit
characters [\x00-\x1f\x7f] ?
Here are some significant commits that seem to be work make the process
of decoding and rendering more reliable and more like email client
http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message.pm?r1=1707582&r2=1707597
http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message/Node.pm?r1=1749286&r2=1749798
IMHO it would be nice if it were possible to change related behaviour
via a plugin, at the parsed_metadata() or start_rules() hook, but AFAICS
there is no way for a plugin to alter the rendered message. You can use
`replace_rules`/`replace_tag` to pre-process a rule (this fuzziness has
the advantage that the same code-point may obfuscate, say, both I and L,
but doesn't help much with invisible characters at the moment). However,
there is nothing to pre-process and canonicalise the text being matched
to simplify rule writing.
I have often been unclear on what I need to do to get a body rule to
match accented or Cyrillic characters, sometimes checking the byte
stream in different encodings and transcribing to hex by hand. 'rawbody'
rules should no doubt match the encoded 'raw' data, but I wonder if
'body' rules would work better if they concentrated on the meaning of
the words without having to worry about multiple possible encodings and
transmission systems. So if I can venture a radical suggestion, should
body rules actually match against a character string, as they have
sometimes been doing apparently unintentionally? Could this be a
configuration setting, as a function of or in addition to
normalize_charset?
Very little cannot be represented in a character string, which seems to
be Perl's preferred model since version 5.8. Although there may be some
obscure encodings that could require some work to decode, is it better
to decode and normalise what can be decoded reasonably reliably, and
represent the rest as Unicode code points with the same value as the
bytes? (That should match \xNN for rare encodings.) Is there still a
performance issue? To make such functionality (if enabled) as compatible
as possible with existing rulesets, the Conf module might detect valid
UTF-8 literals in body regexes and decode those, and where there are
\xNN escape sequences (up to 62 subrules in main rules), if they form
valid contiguous UTF-8, they can be decoded too. Where there are more
complex sequences like __BENEFICIARY or
\xef(?:\xbf[\xb9-\xbb]|\xbb\xbf), then perhaps those should have been in
rawbody rules anyway, or rewritten to be encoding-independent and
eliminate any finesses of Unicode like the Format characters?
I'd be grateful for advice as to whether there's merit in filing these
concerns as one or more issues on Bugzilla, or for relevant background.
CK

Bill Cole

2018-10-31 02:25:39 UTC

Permalink

Post by Cedric Knight
I'd be grateful for advice as to whether there's merit in filing these
concerns as one or more issues on Bugzilla, or for relevant
background.

I do not believe the codebase is the place to address these issues,
which are addressable in carefully created rules. Because your approach
would hide useful data patterns from rules, it is exactly the wrong way
to go about "solving" a problem with a novel flavor of spam. As John &
Kevin have noted, they have worked on the specific case of the extortion
spams in publicly available rules. I also have an ancient bundle of
rules that I've been adjusting for the modern world and existence
outside of my idiosyncratic environment (where severe FPs are
evaded/mitigated) which is promising and will be public in some way
soon.

Also, change this substantial in the core behavior of SA would be almost
certain to NOT get into 3.4.3, which will be out soon and is likely to
be dominant in production systems for some time despite the (coming
soon) 4.0 release. If this were done in code rather than in rules, it
would never be usable for sites not ready or able to go to 4.0

--
Bill Cole
***@scconsult.com or ***@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole