The SIPHS Project

Semantic Interpretation of Personal Health MessageS: linguistic analysis of online health vocabularies


Open online data such as microblogs and discussion board messages have the potential to be an incredibly valuable source of information about health in populations. Going beyond simple keyword search and harnessing this data for public health represents both an opportunity and a challenge to natural language processing (NLP). SIPHS aims to help health experts leverage social media for their own clinical and scientific studies through automatic techniques that encode messages according to a machine understandable semantic representation.

At the technological level SIPHS seeks to pioneer new methods for NLP and machine learning (ML). Social media remains a challenging area for NLP for a variety of reasons: short de-contextualised messages, high levels of ambiguity/out of vocabulary words, use of slang and an evolving vocabulary, as well as inherent bias towards sensational topics.

The fellowship seeks to harness the progress made so far in NLP for social media analysis in the commercial domain and develop it further to provide meaningful public health evidence. One key aspect not previously addressed is in the clinical coding of patient messages. Although knowledge brokering systems exist for clinical and scientific texts (e.g. MetaMap), their performance on social media messages has been poor. SIPHS aims to utilise the rich availability of ontological resources in biomedicine together with ML on annotated message data to disambiguate informal language. Research will also aim to understanding the communicative function of messages, for example whether the message reports direct experience or is related to news, humour or marketing. If these problems are successfully overcome an important barrier to data integration with other types of clinical data will be removed.




SIPHS is funded by an EPSRC Experienced Researcher Fellowship (EP/M005089/1).