Processing human language is a wide field with many aspects that can be of interest. One of such aspects is to find out the actual topic that a piece of written text is writing about. For example, if we have a text written about a match between FC Barcelona and Manchester United, we may be interested in finding out that the actual topic is football, even though the particular word football is not used in the text.
This is in fact a form of text classification and should not be mistaken with topic modeling. In the latter, we are conceptually looking for latent words that are in the text that best describe what the text is about. For topic modeling, we usually need more evidence in the text and we cannot tell beforehand what the outcome will be, because it is subject to the actual text’s content. In practical settings, aggregating over topic modeling to find common topics is something that is hence not straightforward.
Text classification for finding topics – in contrast to topic modeling – can actually be used to find a predefined set of topics and hence allows for easy aggregation over them. A big issue with this approach however is that it becomes increasingly more difficult as the number of classes we seek increase.
In general, for any given text classification task (topic classification being one of them), it has been shown that when two people are asked to give a label to the text out of a set of two possible labels, those two people will disagree with each other in roughly 30% of all cases. This holds for any classification task with two outcomes – like; is this text positive, or negative? When the number of people ask increases, the disagreement between them being unanimous will go up too. When the number of outcomes increases, the disagreement will go up too. Imagine going after a moderate set of 10 different topics, the resulting error will be very high. An algorithmic approach will not be able to do much better.
Finding Topics – A Practical Approach
We are looking for a relatively large (say, more than 10) topics that we can define upfront and find out whether a text is on that topic or maybe on none at all. We know that topic modeling does not allow us to specify the outcomes upfront and that text classification seems doomed to give us poor accuracies. We hence need a different approach to doing this, we need actual topic classification.
We came up with an elegant way of doing this making use of word vectors. Word vectors are renowned for their ability to capture similarities between words (we will blog more on this later). We can hence use them to see if words relating to a specific topic – though not the actual words themselves per se – occur in the text we are trying to classify. Using a similarity metric like the cosine similarity we can determine if a text is similar enough to a topic to be considered to be about that topic. If no topics turn out to be similar enough, we say that the text is not on any of the topics at hand.
Using this method, we can predefine a list of topics that we want to be able to identify and hence aggregate over while not suffering from classification deterioration due to the large number of possible outcomes. As an added bonus, we do not need supervised data for this method and can easily apply this is any language – provided we have a good set of topics and corresponding words.
We have added this functionality to our software and are now offering this as a standard feature.