N-AI Spam#
2022's Baliza challenge (by Bitdefender) was all about creating a C (or Python) program that could detect 2000's spam emails based on their title and content. My algorithm won 1st place, having an F-score of \(F_1=95.63\%\).
The programs were NOT allowed to use any kind of AI, so they were limited by the quality of the chosen heuristics. Moreover, the participants were given 150 emails to test their algorithms on, but these were different than the emails used in the final evaluation. We were not given any hints or guides, so every implementaion detail was up to us.
The project's name, N-AI Spam, is a wordplay. It doesn't just stand for Non-AI Spam (detector), but it directly translates from Romanian and it means "you don't have spam".
How does it work?#
Based on what I knew, I came up with a scoring system that would change once a certain value would be passed:
-
\(\text{score}\lt 35\): the email is definitely HAM (DHAM), a.k.a. non-spam;
-
\(35\leq\text{score}\lt 42\): the email is probably HAM (PHAM);
-
\(42\leq\text{score}\lt 50\): the email is a (very) suspicious HAM (SHAM);
-
\(50\leq\text{score}\): the email is SPAM (SPAM);
Heuristics used#
Some of the heuristics used include:
-
Punctuation could indicate a SPAM email - for example, dollar signs could indicate money transfers wanted by potential malicious users, while exclamation marks could indicate urgency. A score is determined based on the number of characters relative to the number of words counted in the email;
-
Uppercase letters are usually used in excess or not at all when it comes to SPAM email. Here is the formula I found best suited for computing this score. In the end, everything is converted relative to the toal count of non-space characters;
-
Consonants don't appear naturally in groups of 4 or more, so this clue could raise a red flag. Obviously, links should be ignored;
-
Keywords that trigger a SPAM are very important - they include words such as 'money', 'purchase', 'deposit', 'diamond', 'risk', 'bank' etc. However, there are some words that could lower this score, including 'forwarded', 'newsletter' and 'yahoo' (it was big back then). If the same word occures multiple times, say n times in the same email, it should be counted only about \(\sqrt{n}\) times (my formula is a little bit more complicated, but for small values this works too);
-
Known spam email addresses should also be taken into account.
There are some other metrics that were used, but all in all, this is the algorithm. In the end, it also checks if there are duplicate emails inputed, since these are most likely SPAM. If an email is considered a PHAM or an SHAM, its score will be reevaluated to make sure it isn't actually a SPAM.
Final notes#
The algorithm was written in C and it was tested using some custom made tests as well. The code is very well documented.
The original challenge also contained some other task, but that one was trivial and it was only used to filter those who wanted to win an award from those who weren't all that interested.
Source code#
For more info, please check out the open-source code repository on GitHub. The code is made available under the MIT license.
I value keeping my code open‑source. However, it's disheartening whenever I find that someone has copied my work without giving me proper credit. All I ask of you is to not claim my effort as your own.