Semi-supervised segmentation within a predictive modelling context
Loading...
Date
Authors
Researcher ID
Supervisors
Journal Title
Journal ISSN
Volume Title
Publisher
North-West University (South Africa) , Potchefstroom Campus
Record Identifier
Abstract
Industry standards and best practices on robust model development have been refined
over many years. Even though many software tools are available to simplify the process
today, developing a practically implementable model for long-term use still involves
substantial human input. Subsequently, any methodologies that aid in the improvement
of model accuracy or increase the efficiency with which models can be developed is
welcomed by all involved.
Segmentation of the data that are used for predictive modelling is a well-established
practice in the industry. Segmentation of subjects (i.e. observations or customers) is
defined in this study as partitioning of the subjects into distinct groups, or subsets,
with the aim of developing predictive models on each of the groups separately. The
focus of our study will be on broadening the available techniques that can be used
for statistical segmentation. Currently two main streams of statistical segmentation
exist in the industry, namely unsupervised and supervised segmentation. Both these
streams make intuitive sense, depending on the application and the requirements of the
models developed, and many examples exist where the use of either technique improved
model performance. However, both these streams focus on a single aspect (i.e. either
target separation or explanatory variable distribution) and combining both aspects might
deliver better results in some instances.
The primary objective of this research is to develop and define a semi-supervised segmentation
algorithm as an alternative to the segmentation algorithms currently in use.
This algorithm should allow the user, when segmenting for predictive modelling, to not
only consider the explanatory variables (as is the case with unsupervised techniques such
as clustering) or the target variable (as is the case with supervised techniques such as
decision trees), but to be able to optimise both simultaneously during the segmentation
exercise.
Once we have defined the semi-supervised segmentation algorithm that is based on
standard k-means clustering, we comprehensively analyse it by applying it in several
different ways. We illustrate visually how the algorithm differs from standard k-means
clustering and how it is able to overcome some of the known weaknesses of k-means
clustering.
We apply the algorithm to actual data sets from various industries and compare the
results to results of other known segmentation algorithms on the same data sets. A
number of popular non-linear modelling techniques are also applied to the data sets to
compare the accuracy of those techniques to the accuracy obtained with the various
segmentation techniques. Simulated data serve to identify a few key data set characteristics
that may cause one segmentation technique to outperform another. In addition,
we define data set characteristics that suit the semi-supervised segmentation technique
best. Finally, we propose two alternative semi-supervised segmentation techniques and
measure how these techniques perform on the industry data sets already analysed. We,
furthermore, augment a supervised clustering technique found in literature and compare
its results to all other results obtained.
Sustainable Development Goals
Description
PhD (Risk Analysis), North-West University, Potchefstroom Campus, 2017
