en
Tutorials
Week 4: Knowledge Extraction

Week 4: Knowledge Extraction

Learning Objectives

This week covers techniques for extracting structured knowledge from unstructured text. You will learn about Named Entity Recognition (NER), Relation Extraction (RE), and Entity Resolution.


1. Introduction to Knowledge Extraction

Structuring the Unstructured: "Documents to Data"

Most of the world's knowledge is locked in unstructured text—PDFs, emails, news articles. A Knowledge Graph can't directly "read" a PDF.

Knowledge Extraction is the pipeline that converts free text into structured Triples.

  • Raw Text: "Apple Inc. was founded by Steve Jobs."
  • Structured Data: (Apple Inc., founded_by, Steve Jobs)

It turns "Reading" into "Database Records".

The Knowledge Extraction Pipeline

Raw Text → Preprocessing → NER → Coreference → RE → Entity Resolution → KG
StageOutput
PreprocessingTokenized, cleaned text
NEREntity mentions with types
CoreferenceResolved entity references
Relation ExtractionEntity-relation triples
Entity ResolutionLinked, deduplicated entities

2. Named Entity Recognition (NER)

NER identifies and classifies named entities in text into predefined categories.

Common Entity Types

TypeExamples
PERSONJohn Smith, Marie Curie
ORGANIZATIONGoogle, United Nations
LOCATIONNew York, Mount Everest
DATEJanuary 2024, yesterday
MONEY$50 million, 100 euros

NER with spaCy

import spacy
 
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. was founded by Steve Jobs in Cupertino, California."
 
doc = nlp(text)
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")
 
# Output:
# Apple Inc.: ORG
# Steve Jobs: PERSON
# Cupertino: GPE
# California: GPE

Fine-tuning NER Models

import spacy
from spacy.training import Example
 
# Create training data
TRAIN_DATA = [
    ("SNOMED CT is a medical terminology",
     {"entities": [(0, 9, "TERMINOLOGY")]}),
    ("The Gene Ontology describes gene functions",
     {"entities": [(4, 17, "ONTOLOGY")]}),
]
 
# Add custom entity labels
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
ner.add_label("TERMINOLOGY")
ner.add_label("ONTOLOGY")
 
# Train the model
optimizer = nlp.begin_training()
for epoch in range(10):
    for text, annotations in TRAIN_DATA:
        example = Example.from_dict(nlp.make_doc(text), annotations)
        nlp.update([example], sgd=optimizer)

3. Relation Extraction

Relation Extraction identifies semantic relationships between entities.

Types of Relations

Relation TypeExample
Employment(Steve Jobs, founded, Apple)
Location(Apple, headquartered_in, Cupertino)
Part-Of(California, part_of, USA)
Temporal(Event, occurred_on, Date)

Pattern-Based Extraction

import spacy
from spacy.matcher import Matcher
 
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
 
# Pattern: PERSON founded ORG
pattern = [
    {"ENT_TYPE": "PERSON"},
    {"LEMMA": "found"},
    {"ENT_TYPE": "ORG"}
]
matcher.add("FOUNDED_BY", [pattern])
 
doc = nlp("Steve Jobs founded Apple in 1976.")
matches = matcher(doc)
 
for match_id, start, end in matches:
    span = doc[start:end]
    print(f"Found relation: {span.text}")

Neural Relation Extraction

from transformers import pipeline
 
# Using pre-trained RE model
re_pipeline = pipeline("relation-extraction",
                       model="Babelscape/rebel-large")
 
text = "Albert Einstein developed the theory of relativity at ETH Zurich."
relations = re_pipeline(text)
 
for rel in relations:
    print(f"({rel['head']}, {rel['type']}, {rel['tail']})")

4. Coreference Resolution

Coreference Resolution identifies all expressions referring to the same entity.

Example

Input:  "John bought a car. He drives it every day."
Output: "John bought a car. [John] drives [the car] every day."

Using neuralcoref

import spacy
import neuralcoref
 
nlp = spacy.load("en_core_web_sm")
neuralcoref.add_to_pipe(nlp)
 
text = "Marie Curie won the Nobel Prize. She was the first woman to do so."
doc = nlp(text)
 
# Get resolved text
print(doc._.coref_resolved)
# "Marie Curie won the Nobel Prize. Marie Curie was the first woman to do so."
 
# Access coreference clusters
for cluster in doc._.coref_clusters:
    print(f"Cluster: {cluster.main} -> {cluster.mentions}")

5. Entity Resolution (Entity Linking)

Entity Resolution links entity mentions to a knowledge base entry.

The Challenge

"Apple" → Apple Inc. (company) OR Apple (fruit)?
"Washington" → Washington D.C. OR George Washington OR Washington State?

Entity Linking with spaCy

import spacy
 
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("entityLinker", last=True)
 
text = "Einstein worked at Princeton University."
doc = nlp(text)
 
for ent in doc.ents:
    if ent._.kb_ents:
        kb_id = ent._.kb_ents[0][0]  # Top candidate
        print(f"{ent.text} -> {kb_id}")

Custom Entity Resolution

from rapidfuzz import fuzz, process
 
# Knowledge base entries
KB = {
    "Q937": "Albert Einstein",
    "Q312": "Apple Inc.",
    "Q89": "Apple (fruit)",
    "Q61": "Washington, D.C."
}
 
def resolve_entity(mention, candidates, threshold=80):
    """Find best matching KB entry for a mention."""
    matches = process.extract(
        mention,
        candidates.values(),
        scorer=fuzz.token_sort_ratio,
        limit=3
    )
 
    best_match = matches[0]
    if best_match[1] >= threshold:
        # Find the ID for the matched name
        for kb_id, name in candidates.items():
            if name == best_match[0]:
                return kb_id, name, best_match[1]
    return None, None, 0
 
# Example usage
entity_id, name, score = resolve_entity("Einstein", KB)
print(f"Resolved to: {name} ({entity_id}) with score {score}")

6. Building a Knowledge Extraction Pipeline

Complete Pipeline Example

import spacy
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, RDFS
 
class KnowledgeExtractor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_lg")
        self.graph = Graph()
        self.EX = Namespace("http://example.org/")
 
    def extract(self, text):
        doc = self.nlp(text)
 
        # Extract entities
        entities = {}
        for ent in doc.ents:
            entity_uri = self.EX[ent.text.replace(" ", "_")]
            entities[ent.text] = entity_uri
 
            # Add entity to graph
            self.graph.add((entity_uri, RDF.type, self.EX[ent.label_]))
            self.graph.add((entity_uri, RDFS.label, Literal(ent.text)))
 
        # Extract relations (simplified)
        for token in doc:
            if token.dep_ == "nsubj" and token.head.pos_ == "VERB":
                subj = token.text
                pred = token.head.lemma_
                for child in token.head.children:
                    if child.dep_ == "dobj":
                        obj = child.text
                        self._add_relation(subj, pred, obj, entities)
 
        return self.graph
 
    def _add_relation(self, subj, pred, obj, entities):
        if subj in entities and obj in entities:
            pred_uri = self.EX[pred]
            self.graph.add((entities[subj], pred_uri, entities[obj]))
 
# Usage
extractor = KnowledgeExtractor()
text = "Apple acquired Beats Electronics for $3 billion."
graph = extractor.extract(text)
print(graph.serialize(format="turtle"))

Project: Movie Recommendation Knowledge Graph

Progress

WeekTopicProject Milestone
1Ontology IntroductionMovie domain design completed
2RDF & RDFS10 movies converted to RDF
3OWL & ReasoningInference rules applied
4Knowledge ExtractionCollect movie information from Wikipedia
5Neo4jStore in graph DB and query
6GraphRAGNatural language queries
7Ontology AgentAutomatic updates for new movies
8Domain ExtensionMedical/Legal/Finance cases
9Service DeploymentAPI + Dashboard

Week 4 Milestone: Automatically Extracting Movie Information from Wikipedia

This week, you will automate the data collection process. Instead of manually entering 10 movies, you will extract movie information from Wikipedia and IMDB.

Extraction Pipeline:

Wikipedia Text
    | NER (Named Entity Recognition)
[Christopher Nolan] [Inception] [2010]
    | Relation Extraction
(Nolan) --directed--> (Inception)
    | Entity Resolution (Entity Linking)
Nolan = dbpedia:Christopher_Nolan
    | RDF Triple Generation

Information to Extract:

  • Movies: title, release date, genre, runtime
  • People: name, birth date, role (director/actor)
  • Relationships: directed, actedIn, hasGenre

Goal: Automatically collect data for 100 movies

In the project notebook, you'll automatically collect movie data from Wikipedia.

In the project notebook, you will implement:

  • Crawl 100 movies using Wikipedia API
  • Auto-extract director/actor names with spaCy NER
  • Extract "directed", "actedIn" relationships with LLM
  • Entity Resolution: merge "Chris Nolan" = "Christopher Nolan"

What you'll build by Week 9: An AI agent that answers "Recommend sci-fi movies like Nolan's style" by reasoning over director-genre-rating relationships in the knowledge graph


Practice Notebook

For deeper exploration of the theory:

The practice notebook covers additional topics:

  • Fine-tuning custom spaCy NER models
  • GPT-4 prompt engineering techniques
  • Embedding-based Entity Resolution
  • Multi-source data integration strategies

Interview Questions

What are the main challenges in Named Entity Recognition?

Key Challenges:

  • Ambiguity: "Apple" (company vs fruit), "Jordan" (person vs country)
  • Nested Entities: "New York Times" contains "New York"
  • Out-of-Vocabulary: New entities not in training data
  • Domain Adaptation: Models trained on news may fail on medical text
  • Multilinguality: Different entity structures across languages
  • Entity Boundaries: Determining where an entity starts/ends

Premium Content

Want complete solutions with detailed explanations and production-ready code?

Check out the Ontology & Knowledge Graph Cookbook Premium (opens in a new tab) for:

  • Complete notebook solutions with step-by-step explanations
  • Real-world case studies and best practices
  • Interview preparation materials
  • Production deployment guides

Next Steps

In Week 5: Neo4j, you will learn about the Labeled Property Graph model and how to work with Neo4j.