As I have written before, it is not always easy to understand the possibilities of current-day text analytics algorithms, much more can be done than we actually are aware of. Among some of the more difficult tasks of text analysis is the task of deep parsing – a term encompassing tasks such as constituent parsing or semantic role labeling. These forms of deep parsing yield so-called parse-trees that reveal syntactical or semantical structures in text in a hierarchical way.
Deep parsing is something that we can do reasonably well, especially nowadays with deep learning methods. The work by Collobert et al., captured in their academic open source SENNA is a good example hereof. There is one big challenge with doing deep parsing however.
Despite being one of the harder problems of text analysis, deep parsing is particularly useful. Having a parse tree – be it syntactic or semantic – available, tells us a great deal about the structure or meaning of text. While often not a goal on its own, this knowledge aids tremendously in solving other tasks, such as sentiment analysis or topic detection. Deep parsing is hence one of the most vital bricks of text analysis.
The English Problem
While deep parsing is a problem that has been tackled pretty well, most of the work – and this holds for many other text analysis problems as well – is highly focused on dominant languages, English in particular. One of the most accurate ways of performing deep parsing is by constructing a language model on labeled data, corpora with sentences that have been annotated with parse-trees by human experts. We often call these corpora treebanks.
While many treebank corpora exist for many languages, only a few are elaborate or specific enough to support in deep parsing tasks. Apart from that, the majority of these corpora are not free or only available for research. This makes commercial application of deep parsing in languages other than a few such as English, particularly difficult.
A Structural Solution
Using the earlier mentioned deep learning approaches to text analysis, has led us to a breakthrough in deep parsing. Using deep learning, we can create language models that ‘know’ about the structure in a language, without us explicitly outlining it or handcrafting the model using expert knowledge.
We applied this mechanism to deep parsing in a way where we learn structure in a language where deep parsing resources are plenty available, such as English. We do the same on a specific target language where such resources are scarcely or not at all available and we use the induced structure of both models to jointly learn how to perform deep parsing in the target language.
This approach to deep parsing has two major advantages.
- We do not need a manually annotated treebank in our target language to be able to deep parsing.
- To train a model for our target language, all we have to do is induce its structure, something that takes little time and very little human intervention.
Using this approach, we have been able to create deep parsing (constituency parsing and semantic role labeling) in Dutch, French and German in just a matter of hours. Without ever seeing a single manually annotated parse-tree in any of these languages.
What About Performance?
Very well, you might say, what about performance? This may be a nice story, but if the results don’t add up, no one will ever care. Well now, evaluating the performance of these models is not always straightforward since high quality treebanks are not always available for these target languages. Luckily however, for a couple of these languages there are. For German for example, there is the Negra corpus, which is known to be of decent quality, although only available for research only.
We conducted a small experiment on 100 sentences of the Negra corpus and compared our induced parse trees with those of the Negra corpus. We found an incredibly high accuracy of over 70 sentences that resulted in exactly the same parse tree using our method compared to the Negra corpus. In little over 20 other sentences, the errors made by our approach we solely due to flipping order in parsing small subtrees, for example picking one noun-phrase over the other first. All other errors which had more structural discrepancies, were not completely off but for example assigned verbs to wrong prepositions.
Of course, further evaluation and investigation is required to make more definitive conclusions, but given these promising results, we are already embodying unsupervised deep parsing in our advanced text analysis solutions at UnderstandLing, combining them for example with Recursive Auto-Encoders to perform state-of-the-art text classification tasks such as topic detection or sentiment analysis in a rapidly growing amount of European languages.
As I can imagine, you are interested in seeing some of this work into action. Without spoiling too much of its mystery, let me give you a single example of a parse tree induced on the Dutch sentence ‘Dit is een test in het Nederlands.’ (English: This is a test in Dutch) – in Penn Treebank notation: