Subtest __E_LIKE_LETTER and __LOWER_E listed many times in message header

Discussion:

Henrik Krohns

2018-12-10 06:46:32 UTC

To make this determination, the rules require the 'multiple' flag without
a cap on thne number of matches which a 'maxhits' parameter would set.

Please don't do unlimited maxhits, it's terrible if message accidently or
intentionally contains thousands of e's. The eval code runs all sorts of
crap for every hit, not to mention the mass of debug lines it potentially
creates.

If I read right, isn't it enough to set __LOWER_E maxhits=21 and
__E_LIKE_LETTER maxhits=211 for the clause to evaluate as true?

body __LOWER_E /e/i
tflags __LOWER_E multiple
replace_rules __E_LIKE_LETTER
body __E_LIKE_LETTER /<E>/
tflags __E_LIKE_LETTER multiple
meta MIXED_ES ( __LOWER_E > 20 ) && ( __E_LIKE_LETTER > ( (__LOWER_E * 14 ) / 10) ) && ( __E_LIKE_LETTER > ( 10 * __LOWER_E ) )
describe MIXED_ES Too many es are not es

Henrik K

2018-12-10 06:56:10 UTC

Permalink

Post by Henrik Krohns

To make this determination, the rules require the 'multiple' flag without
a cap on thne number of matches which a 'maxhits' parameter would set.

Please don't do unlimited maxhits, it's terrible if message accidently or
intentionally contains thousands of e's. The eval code runs all sorts of
crap for every hit, not to mention the mass of debug lines it potentially
creates.
If I read right, isn't it enough to set __LOWER_E maxhits=21 and
__E_LIKE_LETTER maxhits=211 for the clause to evaluate as true?
body __LOWER_E /e/i
tflags __LOWER_E multiple
replace_rules __E_LIKE_LETTER
body __E_LIKE_LETTER /<E>/
tflags __E_LIKE_LETTER multiple
meta MIXED_ES ( __LOWER_E > 20 ) && ( __E_LIKE_LETTER > ( (__LOWER_E * 14 ) / 10) ) && ( __E_LIKE_LETTER > ( 10 * __LOWER_E ) )
describe MIXED_ES Too many es are not es

Also consider limiting __HAS_IMG_SRC, __HAS_HREF, __HAS_IMG_SRC_ONECASE,
__HAS_HREF_ONECASE

I would use non-greedy .*? in all those also

/^[^>].*<img src=/i

Bill Cole

2018-12-10 16:44:25 UTC

Permalink

Post by Henrik K

Post by Henrik Krohns

To make this determination, the rules require the 'multiple' flag without
a cap on thne number of matches which a 'maxhits' parameter would set.

Please don't do unlimited maxhits, it's terrible if message
accidently or
intentionally contains thousands of e's. The eval code runs all sorts of
crap for every hit, not to mention the mass of debug lines it
potentially
creates.
If I read right, isn't it enough to set __LOWER_E maxhits=21 and
__E_LIKE_LETTER maxhits=211 for the clause to evaluate as true?
body __LOWER_E /e/i
tflags __LOWER_E multiple
replace_rules __E_LIKE_LETTER
body __E_LIKE_LETTER /<E>/
tflags __E_LIKE_LETTER multiple
meta MIXED_ES ( __LOWER_E > 20 ) && (
__E_LIKE_LETTER > ( (__LOWER_E * 14 ) / 10) ) && ( __E_LIKE_LETTER >
( 10 * __LOWER_E ) )
describe MIXED_ES Too many es are not es

Also consider limiting __HAS_IMG_SRC, __HAS_HREF,
__HAS_IMG_SRC_ONECASE,
__HAS_HREF_ONECASE

Done.

Post by Henrik K
I would use non-greedy .*? in all those also
/^[^>].*<img src=/i

Done.

Thanks for the input!

--
Bill Cole
***@scconsult.com or ***@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole

Bill Cole

2018-12-10 16:27:29 UTC

Permalink

Post by Henrik Krohns

To make this determination, the rules require the 'multiple' flag without
a cap on thne number of matches which a 'maxhits' parameter would set.

I recognize this as an issue, and I'm trying to think up alternative
approaches. The ruleqa performance of this rule is puzzling.

Post by Henrik Krohns
If I read right, isn't it enough to set __LOWER_E maxhits=21 and
__E_LIKE_LETTER maxhits=211 for the clause to evaluate as true?

That would break the *correct* logic, which I just noticed was mangled
by a typo in the revision I made yesterday to evade the 'possible divide
by zero' mis-parse.

The goal is to identify messages where the ratio of all e-like
characters (__E_LIKE_LETTER ) to simple Latin 'e' characters (__LOWER_E)
is between 1.4 and 10. My reasoning for a range of ratios is that
messages of any significant size will use one script predominantly, but
perhaps not exclusively.

Consider a message with 200 U+0065 characters and 220 U+0435 characters:
__LOWER_E = 200, __E_LIKE_LETTER = 420. The ratio is 2.1, so this is a
message which would match the intended logic. However, with your
proposed maxhits limits: __LOWER_E = 21, __E_LIKE_LETTER = 211 so the
ratio is 10.05, no match.

Also consider a message with 200 U+0065 characters and 9 U+0435
characters: __LOWER_E = 200, __E_LIKE_LETTER = 209. The ratio is 1.045,
so this is a message which would NOT match the intended logic. However,
with your proposed maxhits limits: __LOWER_E = 21, __E_LIKE_LETTER =
209 so the ratio is 9.95, a match.

Finding a fine-tuned pair of maxhits values is hard, particularly since
I don't have a good corpus of the target spam or of ham that
*apparently* (according to ruleqa stats) is being matched by the current
rule in some corpora. I've set maxhits at 250 and 400 for now on the
principle that the spam I'm really targeting has less than half of
those.

Post by Henrik Krohns
body __LOWER_E /e/i
tflags __LOWER_E multiple
replace_rules __E_LIKE_LETTER
body __E_LIKE_LETTER /<E>/
tflags __E_LIKE_LETTER multiple
meta MIXED_ES ( __LOWER_E > 20 ) && (
__E_LIKE_LETTER > ( (__LOWER_E * 14 ) / 10) ) && ( __E_LIKE_LETTER > (
10 * __LOWER_E ) )

This is now fixed:

meta MIXED_ES ( __LOWER_E > 20 ) && ( __E_LIKE_LETTER > ( (__LOWER_E *
14 ) / 10) ) && ( __E_LIKE_LETTER < ( 10 * __LOWER_E ) )

--
Bill Cole
***@scconsult.com or ***@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole