NWU Institutional Repository

Influence of input matrix representation on topic modelling performance

dc.contributor.authorde Waal, Alta
dc.contributor.authorBarnard, Etienne
dc.date.accessioned2018-03-07T10:12:25Z
dc.date.available2018-03-07T10:12:25Z
dc.date.issued2010
dc.description.abstractTopic models explain a collection of documents with a small set of distributions over terms. These distributions over terms define the topics. Topic models ignore the structure of documents and use a bag-of-words approach which relies solely on the frequency of words in the corpus. We challenge the bag-of-word assumption and propose a method to structure single words into concepts. In this way, the inherent meaning of the feature space is enriched by more descriptive concepts rather than single words. We turn to the field of natural language processing to find processes to structure words into concepts. In order to compare the performance of structured features with the bag-of-words approach, we sketch an evaluation framework that accommodates different feature dimension sizes. This is in contrast with existing methods such as perplexity, which depend on the size of the vocabulary modelled and can therefore not be used to compare models which use different input feature sets. We use a stability-based validation index to measure a model’s ability to replicate similar solutions of independent data sets generated from the same probabilistic source. Stability-based validation acts more consistently across feature dimensions than perplexity or information-theoretic measures.en_US
dc.description.sponsorshipHuman Language Technology Competence Area, CSIR, Meraka Institute, Pretoria, South Africa Multilingual Speech Technologies Group, North-West University, Vanderbijlpark, South Africaen_US
dc.identifier.citationAlta De Waal and Etienne Barnard, “Influence of input matrix representation on topic modelling performance”, in Proc. Annual Symp. Pattern Recognition Association of South Africa (PRASA), pp 69-74, Stellenbosch, South Africa, 2010. [http://engineering.nwu.ac.za/multilingual-speech-technologies-must/publications]en_US
dc.identifier.urihttps://researchspace.csir.co.za/dspace/bitstream/handle/10204/4712/de%20Waal_2010.pdf?sequence=1&isAllowed=y
dc.identifier.urihttp://hdl.handle.net/10394/26554
dc.language.isoenen_US
dc.publisherPattern Recognition Association of South Africa and Mechatronics International Conferenceen_US
dc.subjectInput Matrix Representationen_US
dc.subjectTopic Modelling Performanceen_US
dc.subjectbag-of-words approachen_US
dc.titleInfluence of input matrix representation on topic modelling performanceen_US
dc.typePresentationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
dewaal-2010-influence-input-matrix.pdf
Size:
1.01 MB
Format:
Adobe Portable Document Format
Description:
dewaal-2010-influence-input-matrix

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.61 KB
Format:
Item-specific license agreed upon to submission
Description: