Week 4: Knowledge Extraction

Learning Objectives

This week covers techniques for extracting structured knowledge from unstructured text. You will learn about Named Entity Recognition (NER), Relation Extraction (RE), and Entity Resolution.

1. Introduction to Knowledge Extraction

Structuring the Unstructured: "Documents to Data"

Most of the world's knowledge is locked in unstructured text—PDFs, emails, news articles. A Knowledge Graph can't directly "read" a PDF.

Knowledge Extraction is the pipeline that converts free text into structured Triples.

Raw Text: "Apple Inc. was founded by Steve Jobs."
Structured Data: (Apple Inc., founded_by, Steve Jobs)

It turns "Reading" into "Database Records".

The Knowledge Extraction Pipeline

Raw Text → Preprocessing → NER → Coreference → RE → Entity Resolution → KG

Stage	Output
Preprocessing	Tokenized, cleaned text
NER	Entity mentions with types
Coreference	Resolved entity references
Relation Extraction	Entity-relation triples
Entity Resolution	Linked, deduplicated entities

2. Named Entity Recognition (NER)

NER identifies and classifies named entities in text into predefined categories.

Common Entity Types

Type	Examples
PERSON	John Smith, Marie Curie
ORGANIZATION	Google, United Nations
LOCATION	New York, Mount Everest
DATE	January 2024, yesterday
MONEY	$50 million, 100 euros

NER with spaCy

import spacy
 
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. was founded by Steve Jobs in Cupertino, California."
 
doc = nlp(text)
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")
 
# Output:
# Apple Inc.: ORG
# Steve Jobs: PERSON
# Cupertino: GPE
# California: GPE

Fine-tuning NER Models

import spacy
from spacy.training import Example
 
# Create training data
TRAIN_DATA = [
    ("SNOMED CT is a medical terminology",
     {"entities": [(0, 9, "TERMINOLOGY")]}),
    ("The Gene Ontology describes gene functions",
     {"entities": [(4, 17, "ONTOLOGY")]}),
]
 
# Add custom entity labels
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
ner.add_label("TERMINOLOGY")
ner.add_label("ONTOLOGY")
 
# Train the model
optimizer = nlp.begin_training()
for epoch in range(10):
    for text, annotations in TRAIN_DATA:
        example = Example.from_dict(nlp.make_doc(text), annotations)
        nlp.update([example], sgd=optimizer)

3. Relation Extraction

Relation Extraction identifies semantic relationships between entities.

Types of Relations

Relation Type	Example
Employment	(Steve Jobs, founded, Apple)
Location	(Apple, headquartered_in, Cupertino)
Part-Of	(California, part_of, USA)
Temporal	(Event, occurred_on, Date)

Pattern-Based Extraction

import spacy
from spacy.matcher import Matcher
 
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
 
# Pattern: PERSON founded ORG
pattern = [
    {"ENT_TYPE": "PERSON"},
    {"LEMMA": "found"},
    {"ENT_TYPE": "ORG"}
]
matcher.add("FOUNDED_BY", [pattern])
 
doc = nlp("Steve Jobs founded Apple in 1976.")
matches = matcher(doc)
 
for match_id, start, end in matches:
    span = doc[start:end]
    print(f"Found relation: {span.text}")

Neural Relation Extraction

from transformers import pipeline
 
# Using pre-trained RE model
re_pipeline = pipeline("relation-extraction",
                       model="Babelscape/rebel-large")
 
text = "Albert Einstein developed the theory of relativity at ETH Zurich."
relations = re_pipeline(text)
 
for rel in relations:
    print(f"({rel['head']}, {rel['type']}, {rel['tail']})")

4. Coreference Resolution

Coreference Resolution identifies all expressions referring to the same entity.

Example

Input:  "John bought a car. He drives it every day."
Output: "John bought a car. [John] drives [the car] every day."

Using neuralcoref

import spacy
import neuralcoref
 
nlp = spacy.load("en_core_web_sm")
neuralcoref.add_to_pipe(nlp)
 
text = "Marie Curie won the Nobel Prize. She was the first woman to do so."
doc = nlp(text)
 
# Get resolved text
print(doc._.coref_resolved)
# "Marie Curie won the Nobel Prize. Marie Curie was the first woman to do so."
 
# Access coreference clusters
for cluster in doc._.coref_clusters:
    print(f"Cluster: {cluster.main} -> {cluster.mentions}")

5. Entity Resolution (Entity Linking)

Entity Resolution links entity mentions to a knowledge base entry.

The Challenge

"Apple" → Apple Inc. (company) OR Apple (fruit)?
"Washington" → Washington D.C. OR George Washington OR Washington State?

Entity Linking with spaCy

import spacy
 
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("entityLinker", last=True)
 
text = "Einstein worked at Princeton University."
doc = nlp(text)
 
for ent in doc.ents:
    if ent._.kb_ents:
        kb_id = ent._.kb_ents[0][0]  # Top candidate
        print(f"{ent.text} -> {kb_id}")

Custom Entity Resolution

from rapidfuzz import fuzz, process
 
# Knowledge base entries
KB = {
    "Q937": "Albert Einstein",
    "Q312": "Apple Inc.",
    "Q89": "Apple (fruit)",
    "Q61": "Washington, D.C."
}
 
def resolve_entity(mention, candidates, threshold=80):
    """Find best matching KB entry for a mention."""
    matches = process.extract(
        mention,
        candidates.values(),
        scorer=fuzz.token_sort_ratio,
        limit=3
    )
 
    best_match = matches[0]
    if best_match[1] >= threshold:
        # Find the ID for the matched name
        for kb_id, name in candidates.items():
            if name == best_match[0]:
                return kb_id, name, best_match[1]
    return None, None, 0
 
# Example usage
entity_id, name, score = resolve_entity("Einstein", KB)
print(f"Resolved to: {name} ({entity_id}) with score {score}")

6. Building a Knowledge Extraction Pipeline

Complete Pipeline Example

import spacy
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, RDFS
 
class KnowledgeExtractor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_lg")
        self.graph = Graph()
        self.EX = Namespace("http://example.org/")
 
    def extract(self, text):
        doc = self.nlp(text)
 
        # Extract entities
        entities = {}
        for ent in doc.ents:
            entity_uri = self.EX[ent.text.replace(" ", "_")]
            entities[ent.text] = entity_uri
 
            # Add entity to graph
            self.graph.add((entity_uri, RDF.type, self.EX[ent.label_]))
            self.graph.add((entity_uri, RDFS.label, Literal(ent.text)))
 
        # Extract relations (simplified)
        for token in doc:
            if token.dep_ == "nsubj" and token.head.pos_ == "VERB":
                subj = token.text
                pred = token.head.lemma_
                for child in token.head.children:
                    if child.dep_ == "dobj":
                        obj = child.text
                        self._add_relation(subj, pred, obj, entities)
 
        return self.graph
 
    def _add_relation(self, subj, pred, obj, entities):
        if subj in entities and obj in entities:
            pred_uri = self.EX[pred]
            self.graph.add((entities[subj], pred_uri, entities[obj]))
 
# Usage
extractor = KnowledgeExtractor()
text = "Apple acquired Beats Electronics for $3 billion."
graph = extractor.extract(text)
print(graph.serialize(format="turtle"))

Project: Movie Recommendation Knowledge Graph

Progress

Week	Topic	Project Milestone
1	Ontology Introduction	Movie domain design completed
2	RDF & RDFS	10 movies converted to RDF
3	OWL & Reasoning	Inference rules applied
4	Knowledge Extraction	Collect movie information from Wikipedia
5	Neo4j	Store in graph DB and query
6	GraphRAG	Natural language queries
7	Ontology Agent	Automatic updates for new movies
8	Domain Extension	Medical/Legal/Finance cases
9	Service Deployment	API + Dashboard

Week 4 Milestone: Automatically Extracting Movie Information from Wikipedia

This week, you will automate the data collection process. Instead of manually entering 10 movies, you will extract movie information from Wikipedia and IMDB.

Extraction Pipeline:

Wikipedia Text
    | NER (Named Entity Recognition)
[Christopher Nolan] [Inception] [2010]
    | Relation Extraction
(Nolan) --directed--> (Inception)
    | Entity Resolution (Entity Linking)
Nolan = dbpedia:Christopher_Nolan
    | RDF Triple Generation

Information to Extract:

Movies: title, release date, genre, runtime
People: name, birth date, role (director/actor)
Relationships: directed, actedIn, hasGenre

Goal: Automatically collect data for 100 movies

In the project notebook, you'll automatically collect movie data from Wikipedia.

In the project notebook, you will implement:

Crawl 100 movies using Wikipedia API
Auto-extract director/actor names with spaCy NER
Extract "directed", "actedIn" relationships with LLM
Entity Resolution: merge "Chris Nolan" = "Christopher Nolan"

What you'll build by Week 9: An AI agent that answers "Recommend sci-fi movies like Nolan's style" by reasoning over director-genre-rating relationships in the knowledge graph

Practice Notebook

For deeper exploration of the theory:

The practice notebook covers additional topics:

Fine-tuning custom spaCy NER models
GPT-4 prompt engineering techniques
Embedding-based Entity Resolution
Multi-source data integration strategies

Interview Questions

What are the main challenges in Named Entity Recognition?

Key Challenges:

Ambiguity: "Apple" (company vs fruit), "Jordan" (person vs country)
Nested Entities: "New York Times" contains "New York"
Out-of-Vocabulary: New entities not in training data
Domain Adaptation: Models trained on news may fail on medical text
Multilinguality: Different entity structures across languages
Entity Boundaries: Determining where an entity starts/ends

Premium Content

Want complete solutions with detailed explanations and production-ready code?

Check out the Ontology & Knowledge Graph Cookbook Premium (opens in a new tab) for:

Complete notebook solutions with step-by-step explanations
Real-world case studies and best practices
Interview preparation materials
Production deployment guides

Next Steps

In Week 5: Neo4j, you will learn about the Labeled Property Graph model and how to work with Neo4j.

Week 3: OWL & Reasoning Week 5: Neo4j