
I’ve been using Twitter as a data source for content generation and language classification because of the ubiquity of hashtags. For those of you not in the know, hashtags are the weird #something bits you see all over Twitter. In some ways, hashtags are used for the exact purpose of classifying language, providing an extra level of context to short form writing. Twitter makes it very easy to search and mine various hashtags with their occasionally moody, but generally effective search API.
There are, as with any mining method, a few major caveats. In my opinion, the first and foremost is one that impacts all language processing and mining projects; colloquialisms and regional variations in the meaning of words. This issue can be seen when doing a search for something like #sad. A tweet like “My girl left me
#sad” would fit into most people’s classification of the word sad when it is intended to mean unhappy or sorrowful. However, “That cab driver just picked his nose and wiped it on his dashboard! #sad” wouldn’t fall into the same classification as the previous tweet, but still extends a commonly accepted meaning of the word.
One way to get around this issue is to first train your classification system with a little hand fed data and then use this slight training to help you bootstrap your system. While you can paint yourself into a bit of a self-referential corner with this sort of method, building it in from the beginning can play an important role in making your system scalable and “grow more intelligent” as you add more training data.
The second issue is ensuring that you are only mining the written languages your classification system is being trained to classify. Because Twitter is a global community, it is not uncommon to have many languages appearing in search results. This issue is easier to overcome due to API-based language detection systems like the one provided by Google. You will occasionally get errors due to the non-traditional grammar frequently used on Twitter. It is also wise to parse out any @ and # tags before sending it through the language detection system.
After overcoming some of these initial issues, the benefit of having access to a constantly updated, semi-classified data set becomes evident. I currently am training a fork of my system using the method I’ve described with five separate hashtags to help assess the efficacy of the method. I run a cron job once every 15 minutes, parse the results, making sure I’m not processing the same tweet twice, and, if it passes the initial tests, add it to the system. While some queries produce considerably more tweets than others, my hope is that at least some of my training data will be complete enough to expand into other, less pre-qualified territory.
In a few weeks I plan on doing a bit of a subjective, qualitative test to discern the accuracy of this approach. I’ll post the results here when I’m done.
{ 0 comments }