From Text to Knowledge: The Information Extraction Pipeline
I am thrilled to present my latest project I have been working on. In this blog post, I will present my implementation of an information extraction data pipeline, following my passion for combining natural language processing and knowledge graphs. Later on, I will also explain why I see the combination of NLP and graphs as one of the paths to explainable AI.
If this in-depth educational content is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material.
Information extraction pipeline
What exactly is an information extraction pipeline? To put it in simple terms, information extraction is the task of extracting structured information from unstructured data such as text.
Steps in my implementation of the IE pipeline. Image by author
My implementation of the information extraction pipeline consists of four parts. In the first step, we run the input text through a coreference resolution model. The coreference resolution is the task of finding all expressions that refer to a specific entity. To put it simply, it links all the pronouns to the referred entity. Once that step is finished, it splits the text into sentences and removes the punctuations. I have noticed that the specific ML model used for named entity linking works better when we first remove the punctuations. In the named entity linking part of the pipeline, we try to extract all the mentioned entities and connect them to a target knowledge base. The target knowledge base, in this case, is Wikipedia. Named entity linking is beneficial because it also deals with entity disambiguation, which can be a big problem.
Once we have extracted the mentioned entities, the IE pipeline tries to infer relationships between entities that make sense based on the text’s context. The IE pipeline results are entities and their relationships, so it makes sense to use a graph database to store the output. I will show how to save the IE information to Neo4j .
I’ll use the following excerpt from Wikipedia to walk you through the IE pipeline.
Elon Musk is a business magnate, industrial designer, and engineer. He is the founder, CEO, CTO, and chief designer of SpaceX. He is also early investor, CEO, and product architect of Tesla, Inc. He is also the founder of The Boring Company and the co-founder of Neuralink. A centibillionaire, Musk became the richest person in the world in January 2021, with an estimated net worth of $185 billion at the time, surpassing Jeff Bezos. Musk was born to a Canadian mother and South African father and raised in Pretoria, South Africa. He briefly attended the University of Pretoria before moving to Canada aged 17 to attend Queen's University. He transferred to the University of Pennsylvania two years later, where he received dual bachelor's degrees in economics and physics. He moved to California in 1995 to attend Stanford University, but decided...