Classifying spam with generalized additive neural networks
Abstract
E-mail is an important and convenient communication tool used by many people on a daily basis. For
individuals it is an inexpensive way to stay in contact with family and friends located around the
world. An e-mail address serves as an online identity when signing up for different online services
like social media (Facebook) and social networking (LinkedIn). Companies use e-mails to facilitate
communication between employees and to communicate with their clients by sending information such
as newsletters, invoice statements and promotional content. E-mails are also used for core business
marketing. Unfortunately, some of the benefits provided by the e-mail application like sending out mass
e-mails with little effort at a minimal cost to the sender, are abused by some e-mail users known as
spammers. A spammer's incentive for sending unsolicited e-mails in large quantities to an indiscriminate set of recipients is mostly driven by revenue generation. Most spam messages sent contain content related to promotional products and services, which might be a scam or phishing attempt to steal sensitive user information like banking details and passwords. Currently, more than 55.00% of all e-mail network traffic comprises unsolicited spam e-mails which clutters users' inboxes. Traditional spam-filtering approaches have thus far been unsuccessful in solving the spam problem. This is partly due to spammers who generate new spam message content on a regular basis making it difficult for spam filters to classify spam according to a fixed pattern.
The main purpose of this study is to determine the feasibility of employing a Generalized additive neural network (GANN) to filter spam e-mail messages with a specific automated construction algorithm.
The GANN is a relatively new supervised machine learning technique capable of recognising complex
patterns in data and able to adapt to changes over time. The use of GANN models is suggested for
classification problems where it might be important to understand the relationship between input
attributes and the expected target value. In this study the definition of spam, consequences of
unmanaged spam and current spam-filtering techniques are investigated. The current state of the
spam problem is summarised followed by a discussion on artificial neural networks that have pattern
recognition capabilities. Literature related to the GANN is reviewed with a discussion on both the
interactive and automated construction methodologies for the GANN. The latter will be considered as
a possible spam filter to try and mitigate the spam problem. A number of spam filtering experiments
are conducted on five publicly available spam corpora (Enron, GenSpam, PU1, SpamAssassin and
TREC2005) each with different pre-processing techniques and evaluation measures. The Bagging and
Boosting ensemble techniques which may improve on the GANN's results are also considered. The
GANN and ensembles are then compared to other spam filtering techniques applied to the five corpora
before being compared to each other. Results show that the GANN is a feasible spam filter able to
mitigate spam e-mails. It compares well to other spam filter techniques found in the literature. In
addition, both ensemble methods are able to improve on the GANN's results in most cases.
Collections
- Engineering [1422]