Cedric Knight
2018-10-30 11:07:20 UTC
Hello
I thought of submitting a patch via Bugzilla, but then decided to first
ask and check that I understood the general principles of body checks,
and SpamAssassin's current approach to Unicode. Apologies for the length
of this message. I hope the main points make sense.
A fair number of webcam bitcoin 'sextortion' scams have evaded detection
and worried recipients because of including relevant credentials.
(Incidentally, I assume the credentials and addresses are mostly from
the 2012 LinkedIn breach, but someone on the RIPE abuse list reports
Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of
this spam, but on writing body regexes to catch the wave around 16
October, I noticed that my rules weren't matching because the source was
liberally injected with invisible characters:
Content preview: I a<U+200C>m a<U+200C>wa<U+200C>re blabla is one of
your pa<U+200C>ss. L<U+200C>ets g<U+200C>et strai<U+200C>ght
to<U+200C> po<U+200C>i<U+200C>nt. No<U+200C>t o<U+200C>n<U+200C>e
These characters are encoded as decimal HTML entities ‌ and in the
text/plain part as UTF-8 byte sequences.
Without working these characters into a body rule pattern, that pattern
will not match, yet such Unicode 'format' characters barely affect
display or legibility, if at all. This could be a more general concern
about obfuscation. Invisible characters could be used to evade all the
ADVANCE_FEE* rules for example. There are over 150 non-printing 'Format'
characters in Unicode:
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Format:]
I find it counterintuitive that such non-printing characters match
[:print:] and [:graph:] rather than [:cntrl:], but this is how the
classes are defined at:
https://www.unicode.org/reports/tr18/#Compatibility_Properties
As minor points, 'Format' excludes a couple of separator characters in
the same range that instead match [:space:]
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character:]
Then there is the C1 [:cntrl:] set, which some MUA's may render
silently, I think including the 0x9D matched by the recent
__UNICODE_OBFU_ZW (what's the significance of UNICODE in the rule name?):
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control:]
Finally, there may be a case for including as 'almost' invisible narrow
blanks like U+200A   U+202F and maybe U+205F. The Perl Unicode
database may not be completely up-to-date here, and Perl 5.18 doesn't
recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24
does.
I've also seen many format characters in legitimate email, including in
the middle of 7-bit ASCII text. Google uses 0xFEFF (BOM) as a zero-width
word joiner (use deprecated since 2002), and U+200C apparently occurs in
corporate sigs. So their mere presence isn't much evidence of
obfuscation. I presume they may prevent legitimate patterns being
matched, including by Bayes.
So my patch was going to be something to eliminate Format characters
from get_rendered_body_text_array() like:
--- lib/Mail/SpamAssassin/Message.pm (revision 1844922)
+++ lib/Mail/SpamAssassin/Message.pm (working copy)
@@ -1167,6 +1167,8 @@
$text =~ s/\n+\s*\n+/\x00/gs; # double newlines => null
# $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace (incl. VT, NBSP) => space
# $text =~ tr/ \t\n\r\x0b/ /s; # whitespace (incl. VT) => single space
+ # do not render zero-width Unicode characters used as obfuscation:
+ $text =~
s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs;
$text =~ s/\s+/ /gs; # Unicode whitespace => single space
$text =~ tr/\x00/\n/; # null => newline
One problem here is that I'm not clear at this point if $text is a
intended to be a character string (UTF8 flag set) or a byte string, and
the code immediately following tests this with `if
utf8::is_utf8($text)`. \p{Format} includes U+00AD (soft hyphen), which
is also a continuation byte in UTF-8 encoding such as in the letter 'í'
(LATIN SMALL LETTER I WITH ACUTE), so might be incorrectly removed if
$text is a byte string.
Prior to SA 3.4.1, it seems sometimes body rules would be matching
against a character string, and sometimes against a binary string. This
is mentioned in bug 7490, where a single '.' was matching 'á' until
version SA 3.4.1. As a postscript to that bug, I suspect what was
happening was 'normalize_charset 1' was set, and _normalize() was
attempting utf8::downgrade() but failed, perhaps because the message
contained some non-Latin-1 text.
On the other hand, will `s/\s+/ /gs` fail to normalise all Unicode
[:blank:] characters correctly unless $text is marked as a character
string? What are the design decisions here? Can I find them on this
list, the wiki or elsewhere? Also what is the approach to 7-bit
characters [\x00-\x1f\x7f] ?
Here are some significant commits that seem to be work make the process
of decoding and rendering more reliable and more like email client
display but don't solve the format character issue:
http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message.pm?r1=1707582&r2=1707597
http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message/Node.pm?r1=1749286&r2=1749798
IMHO it would be nice if it were possible to change related behaviour
via a plugin, at the parsed_metadata() or start_rules() hook, but AFAICS
there is no way for a plugin to alter the rendered message. You can use
`replace_rules`/`replace_tag` to pre-process a rule (this fuzziness has
the advantage that the same code-point may obfuscate, say, both I and L,
but doesn't help much with invisible characters at the moment). However,
there is nothing to pre-process and canonicalise the text being matched
to simplify rule writing.
I have often been unclear on what I need to do to get a body rule to
match accented or Cyrillic characters, sometimes checking the byte
stream in different encodings and transcribing to hex by hand. 'rawbody'
rules should no doubt match the encoded 'raw' data, but I wonder if
'body' rules would work better if they concentrated on the meaning of
the words without having to worry about multiple possible encodings and
transmission systems. So if I can venture a radical suggestion, should
body rules actually match against a character string, as they have
sometimes been doing apparently unintentionally? Could this be a
configuration setting, as a function of or in addition to normalize_charset?
Very little cannot be represented in a character string, which seems to
be Perl's preferred model since version 5.8. Although there may be some
obscure encodings that could require some work to decode, is it better
to decode and normalise what can be decoded reasonably reliably, and
represent the rest as Unicode code points with the same value as the
bytes? (That should match \xNN for rare encodings.) Is there still a
performance issue? To make such functionality (if enabled) as compatible
as possible with existing rulesets, the Conf module might detect valid
UTF-8 literals in body regexes and decode those, and where there are
\xNN escape sequences (up to 62 subrules in main rules), if they form
valid contiguous UTF-8, they can be decoded too. Where there are more
complex sequences like __BENEFICIARY or
\xef(?:\xbf[\xb9-\xbb]|\xbb\xbf), then perhaps those should have been in
rawbody rules anyway, or rewritten to be encoding-independent and
eliminate any finesses of Unicode like the Format characters?
I'd be grateful for advice as to whether there's merit in filing these
concerns as one or more issues on Bugzilla, or for relevant background.
CK
I thought of submitting a patch via Bugzilla, but then decided to first
ask and check that I understood the general principles of body checks,
and SpamAssassin's current approach to Unicode. Apologies for the length
of this message. I hope the main points make sense.
A fair number of webcam bitcoin 'sextortion' scams have evaded detection
and worried recipients because of including relevant credentials.
(Incidentally, I assume the credentials and addresses are mostly from
the 2012 LinkedIn breach, but someone on the RIPE abuse list reports
Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of
this spam, but on writing body regexes to catch the wave around 16
October, I noticed that my rules weren't matching because the source was
liberally injected with invisible characters:
Content preview: I a<U+200C>m a<U+200C>wa<U+200C>re blabla is one of
your pa<U+200C>ss. L<U+200C>ets g<U+200C>et strai<U+200C>ght
to<U+200C> po<U+200C>i<U+200C>nt. No<U+200C>t o<U+200C>n<U+200C>e
These characters are encoded as decimal HTML entities ‌ and in the
text/plain part as UTF-8 byte sequences.
Without working these characters into a body rule pattern, that pattern
will not match, yet such Unicode 'format' characters barely affect
display or legibility, if at all. This could be a more general concern
about obfuscation. Invisible characters could be used to evade all the
ADVANCE_FEE* rules for example. There are over 150 non-printing 'Format'
characters in Unicode:
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Format:]
I find it counterintuitive that such non-printing characters match
[:print:] and [:graph:] rather than [:cntrl:], but this is how the
classes are defined at:
https://www.unicode.org/reports/tr18/#Compatibility_Properties
As minor points, 'Format' excludes a couple of separator characters in
the same range that instead match [:space:]
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character:]
Then there is the C1 [:cntrl:] set, which some MUA's may render
silently, I think including the 0x9D matched by the recent
__UNICODE_OBFU_ZW (what's the significance of UNICODE in the rule name?):
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control:]
Finally, there may be a case for including as 'almost' invisible narrow
blanks like U+200A   U+202F and maybe U+205F. The Perl Unicode
database may not be completely up-to-date here, and Perl 5.18 doesn't
recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24
does.
I've also seen many format characters in legitimate email, including in
the middle of 7-bit ASCII text. Google uses 0xFEFF (BOM) as a zero-width
word joiner (use deprecated since 2002), and U+200C apparently occurs in
corporate sigs. So their mere presence isn't much evidence of
obfuscation. I presume they may prevent legitimate patterns being
matched, including by Bayes.
So my patch was going to be something to eliminate Format characters
from get_rendered_body_text_array() like:
--- lib/Mail/SpamAssassin/Message.pm (revision 1844922)
+++ lib/Mail/SpamAssassin/Message.pm (working copy)
@@ -1167,6 +1167,8 @@
$text =~ s/\n+\s*\n+/\x00/gs; # double newlines => null
# $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace (incl. VT, NBSP) => space
# $text =~ tr/ \t\n\r\x0b/ /s; # whitespace (incl. VT) => single space
+ # do not render zero-width Unicode characters used as obfuscation:
+ $text =~
s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs;
$text =~ s/\s+/ /gs; # Unicode whitespace => single space
$text =~ tr/\x00/\n/; # null => newline
One problem here is that I'm not clear at this point if $text is a
intended to be a character string (UTF8 flag set) or a byte string, and
the code immediately following tests this with `if
utf8::is_utf8($text)`. \p{Format} includes U+00AD (soft hyphen), which
is also a continuation byte in UTF-8 encoding such as in the letter 'í'
(LATIN SMALL LETTER I WITH ACUTE), so might be incorrectly removed if
$text is a byte string.
Prior to SA 3.4.1, it seems sometimes body rules would be matching
against a character string, and sometimes against a binary string. This
is mentioned in bug 7490, where a single '.' was matching 'á' until
version SA 3.4.1. As a postscript to that bug, I suspect what was
happening was 'normalize_charset 1' was set, and _normalize() was
attempting utf8::downgrade() but failed, perhaps because the message
contained some non-Latin-1 text.
On the other hand, will `s/\s+/ /gs` fail to normalise all Unicode
[:blank:] characters correctly unless $text is marked as a character
string? What are the design decisions here? Can I find them on this
list, the wiki or elsewhere? Also what is the approach to 7-bit
characters [\x00-\x1f\x7f] ?
Here are some significant commits that seem to be work make the process
of decoding and rendering more reliable and more like email client
display but don't solve the format character issue:
http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message.pm?r1=1707582&r2=1707597
http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message/Node.pm?r1=1749286&r2=1749798
IMHO it would be nice if it were possible to change related behaviour
via a plugin, at the parsed_metadata() or start_rules() hook, but AFAICS
there is no way for a plugin to alter the rendered message. You can use
`replace_rules`/`replace_tag` to pre-process a rule (this fuzziness has
the advantage that the same code-point may obfuscate, say, both I and L,
but doesn't help much with invisible characters at the moment). However,
there is nothing to pre-process and canonicalise the text being matched
to simplify rule writing.
I have often been unclear on what I need to do to get a body rule to
match accented or Cyrillic characters, sometimes checking the byte
stream in different encodings and transcribing to hex by hand. 'rawbody'
rules should no doubt match the encoded 'raw' data, but I wonder if
'body' rules would work better if they concentrated on the meaning of
the words without having to worry about multiple possible encodings and
transmission systems. So if I can venture a radical suggestion, should
body rules actually match against a character string, as they have
sometimes been doing apparently unintentionally? Could this be a
configuration setting, as a function of or in addition to normalize_charset?
Very little cannot be represented in a character string, which seems to
be Perl's preferred model since version 5.8. Although there may be some
obscure encodings that could require some work to decode, is it better
to decode and normalise what can be decoded reasonably reliably, and
represent the rest as Unicode code points with the same value as the
bytes? (That should match \xNN for rare encodings.) Is there still a
performance issue? To make such functionality (if enabled) as compatible
as possible with existing rulesets, the Conf module might detect valid
UTF-8 literals in body regexes and decode those, and where there are
\xNN escape sequences (up to 62 subrules in main rules), if they form
valid contiguous UTF-8, they can be decoded too. Where there are more
complex sequences like __BENEFICIARY or
\xef(?:\xbf[\xb9-\xbb]|\xbb\xbf), then perhaps those should have been in
rawbody rules anyway, or rewritten to be encoding-independent and
eliminate any finesses of Unicode like the Format characters?
I'd be grateful for advice as to whether there's merit in filing these
concerns as one or more issues on Bugzilla, or for relevant background.
CK