If you’d ask me, one the most compelling fields in language processing is that of authorship profiling. In this field, we try to derive specific information about an (unknown) author of text based on what he or she writes alone, using a principle called stylometry. The idea is that what you write and how you write it, defines you as a person semi-uniquely. This is great for us because it means we can learn a lot of information about a person just by observing writings of that person.
In our setting of performing authorship profiling, we are interested in deriving a number of demographic traits from an author in addition to creating a psychometric profile of the author. For the demographic traits, we started off with learning the gender, age and education level from texts written by social media users such as Twitter and Facebook.
For the psychometric part, things become a bit more complicated. A popular approach in literature is to use the BIG5 personality traits to determine your personality. Israeli professor Moshe Koppel – now with IBM’s Watson team – has pioneered this field for a long time and published more than one fundamental paper on this topic. The results of his study however – while significantly better than the baseline – never go beyond a 60-65% accuracy. While such an increase over the baseline may seem good, as explained in a blog by The Privacy Foundation, chances of getting all five traits right is close to 0.
Moving Away from BIG5
We think that one of the issues with deriving BIG5 traits from text is that these traits are just not expressed well in text. Introverted people in the anonymity of the internet can write in the most extroverted way. On top of that, the BIG5 personality are precisely that, personality traits. They are fixed for pretty much your entire life. This means that temporal aspects of the world around us are not captured at all within these traits.
A great deal of psychological research explains that describing a human’s psychological state comes in different temporal forms. The BIG5 personality traits are pretty much fixed throughout your life. A faster changing way of measuring a human’s psychological state is to use temperament. Even more temporal is emotion or mood. In fast changing settings such as the online world, we think it makes more sense to measure the latter two.
Of course measuring mood and temperament is easier because those are actually expressed in text. In fact, we already measure that quite extensively through our emotion detection algorithms. The nice thing with measuring psychological state like this, is that we can also analyze people’s states over time.
Putting it all Together
Now that we have our ingredients to extract demographic traits and psychological state from just text alone, we can combine those ingredients together to create profiles of authors of text without knowing anything about these authors – besides their writing of course. We combine gender, age and education level with a temperamental and emotional portrait, just based on for example a couple of tweets.
This powerful method of authorship profiling is now available in CEMistry but we also offer it is a stand-alone service because we think the potential is big enough on its own. Combine this with our algorithm that determines online influence and you can profile any social user more accurately than ever before.
Just as a closer, you might be interested to know how accurately we can do this. Of course this vastly differs across languages and the source from which the texts come, but we managed to get accuracies that far succeed the currently published state of the art on all demographics. For example, to predict a Twitter user’s gender, we obtained an accuracy close to 90%. Emotion detection we do with human-level accuracy.