BY MATTHEW LOWRY
The CRC’s Apostle™ program is developing tools to help search and analyse various data sources, including textual and multimedia data. Our Footprint application focusses on online activity. One of the features we have implemented is related term suggestion. With this feature, when a user flags a certain word or phrase of particular interest to them, the application suggests other words or phrases that might also be of interest. In this post, we will discuss some of the challenges we faced and our approach to rapidly standing up a baseline solution before expending effort on more advanced solutions.
The feature we intended to implement was not an information retrieval style "query expansion" feature where the application automatically decides that if a user is searching for "cats", then "cat" is also relevant (because it is the stem of the plural noun). The Footprint application is designed around the principle that the user is in control and decides what is important; the application's job is to help the user by, for example, suggesting variants or translations of words the user has flagged as being of particular interest.
There are many ways we could have implemented this feature. Naturally we were drawn to data-driven approaches like word2vec. One of the notable advantages of a trained vector embedding approach is the ability to generate domain-specific models. However, for the Footprint application this leads to a chicken-and-egg problem - we need to be able to make suggestions to a user before the system has collected data for the user and is able to use that data to generate models for making suggestions. So we would be forced to use prebuilt general models (such as the seminal model Google built from Google News data). Another problem is supporting multi-lingual suggestions. While it is certainly the case that vector embedding techniques like word2vec can be extended to build multi-lingual models, producing the required training data and building a comprehensive model for many languages would have been a momentous effort for the CRC.
Given these considerations, we decided we could rapidly and efficiently implement the feature in a way that would still be useful to users via a knowledge graph driven approach using an off-the-shelf openly available knowledge graph. For this we chose ConceptNet.
ConceptNet describes itself as "an open, multilingual knowledge graph", and "a freely-available semantic network, designed to help computers understand the meanings of words that people use". Its origins are in a project that began almost 20 years ago at MIT Media Lab to generate a knowledge graph of "common sense knowledge" through a crowd-sourcing effort.
Traditionally knowledge graph construction efforts have focussed on taxonomic and ontological knowledge - for example the knowledge that "cockatoo is a bird" and "bird is a vertebrate". While ConceptNet aims to contain this type of "knowledge graph-ey" knowledge, it aims to also be a more general "semantic network" where the nodes and edges are less constrained by a formal ontology. For example, ConceptNet contains the knowledge that "listening to the radio causes knowledge" and "knowledge is capable of making a person sad".
ConceptNet is constructed by fusing together numerous sources of knowledge, including dictionary sources such as Wiktionary and Open Multilingual Wordnet. By virtue of these sources, if a user expresses an interest in the word "protest" then the application can suggest adding synonyms and related concepts like "rally", "direct action", "demonstration", etc. Of course it's likely the user can do this by themselves, however the suggestion feature enables the user to construct a list of interesting words and phrases more quickly and reliably. Of particular interest to us is the multi-lingual nature of these sources. This gives the application the ability to suggest terms that are synonyms or related concepts in other languages. So if a user expresses an interest in the word "cockatoo" (English), ConceptNet allows an application to suggest that "??????" (Arabic), "?????????????" (Thai), or "??????" (Russian) may also be of interest to them.
Here's a screenshot showing this in action in the Footprint user interface:
As the user is creating a list of key words or phrases, the application automatically suggests translations and related words or phrases. However, the user remains in control of the list; they can accept or reject suggestions or request alternative suggestions.
Deployment and Integration
One of the advantages of ConceptNet for us is the way it is distributed. At its core, ConceptNet is collection of raw data files containing assertions, and tools for building an indexed relational database so the assertions can be queried. You can integrate ConceptNet into an application at this level; for example loading a subset of ConceptNet via an embedded database directly into your application. However, ConceptNet is also distributed as a pre-built Docker container that provides a RESTful API.
Here at the D2D CRC we are Docker fans - we use it for a range of tasks including application deployment and integration testing. We already had numerous devops tools (e.g.Ansible playbooks) to help us rapidly stand up Docker containers on our hosts. Further, we had already adopted the approach of integrating analysis into the Footprint application via RESTful APIs where that made sense. So we found the prebuilt Docker container to be a quick and effective route for deployment and integration.
Everything Is A Movie or a Song
It is perhaps worth noting that like most things in this world, if it comes for free then it's probably not a "turn-key solution" giving you exactly what you want. One of the things we noticed when we added the related term suggestion feature into our development branch is that we saw "movie" and "song" being suggested surprisingly often. When we dug into it, it soon became clear why - a lot of words and short phrases have, at some point in time, been used as the title of a movie or song. And given the crowd-sourced nature of much of the knowledge in ConceptNet, it's not surprising that it is full of knowledge about movie and song titles.
Therefore, we found that it was necessary to put some effort into adding constraints and filters to the queries we made to ConceptNet. For example, we prioritise the suggestions shown to the user by the weight of the edge that connects the user's word or phrase to the suggestion. In ConceptNet edges are weighted according to how many sources assert the connection is true. As another example, we prioritise suggestions that are connected by certain types of edge, such as IsA edges (things that are specific types or instances of a general word or phrase the user entered) or PartOf edges (things that are parts or components of a thing the user entered).
ConceptNet gave us a rapid and efficient path to enabling our application to take the words and phrases a user expresses interest in and suggest to them related words and phrases - in both the language of the original word and other languages. Although far from perfect, we were able to rapidly deploy and integrate a baseline feature rapidly, before expending the development and integration effort required for a better solution.