Week 4: Knowledge Extraction
Learning Objectives
This week covers techniques for extracting structured knowledge from unstructured text. You will learn about Named Entity Recognition (NER), Relation Extraction (RE), and Entity Resolution.
1. Introduction to Knowledge Extraction
Structuring the Unstructured: "Documents to Data"
Most of the world's knowledge is locked in unstructured text—PDFs, emails, news articles. A Knowledge Graph can't directly "read" a PDF.
Knowledge Extraction is the pipeline that converts free text into structured Triples.
- Raw Text: "Apple Inc. was founded by Steve Jobs."
- Structured Data:
(Apple Inc., founded_by, Steve Jobs)
It turns "Reading" into "Database Records".
The Knowledge Extraction Pipeline
Raw Text → Preprocessing → NER → Coreference → RE → Entity Resolution → KG| Stage | Output |
|---|---|
| Preprocessing | Tokenized, cleaned text |
| NER | Entity mentions with types |
| Coreference | Resolved entity references |
| Relation Extraction | Entity-relation triples |
| Entity Resolution | Linked, deduplicated entities |
2. Named Entity Recognition (NER)
NER identifies and classifies named entities in text into predefined categories.
Common Entity Types
| Type | Examples |
|---|---|
| PERSON | John Smith, Marie Curie |
| ORGANIZATION | Google, United Nations |
| LOCATION | New York, Mount Everest |
| DATE | January 2024, yesterday |
| MONEY | $50 million, 100 euros |
NER with spaCy
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. was founded by Steve Jobs in Cupertino, California."
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
# Output:
# Apple Inc.: ORG
# Steve Jobs: PERSON
# Cupertino: GPE
# California: GPEFine-tuning NER Models
import spacy
from spacy.training import Example
# Create training data
TRAIN_DATA = [
("SNOMED CT is a medical terminology",
{"entities": [(0, 9, "TERMINOLOGY")]}),
("The Gene Ontology describes gene functions",
{"entities": [(4, 17, "ONTOLOGY")]}),
]
# Add custom entity labels
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
ner.add_label("TERMINOLOGY")
ner.add_label("ONTOLOGY")
# Train the model
optimizer = nlp.begin_training()
for epoch in range(10):
for text, annotations in TRAIN_DATA:
example = Example.from_dict(nlp.make_doc(text), annotations)
nlp.update([example], sgd=optimizer)3. Relation Extraction
Relation Extraction identifies semantic relationships between entities.
Types of Relations
| Relation Type | Example |
|---|---|
| Employment | (Steve Jobs, founded, Apple) |
| Location | (Apple, headquartered_in, Cupertino) |
| Part-Of | (California, part_of, USA) |
| Temporal | (Event, occurred_on, Date) |
Pattern-Based Extraction
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Pattern: PERSON founded ORG
pattern = [
{"ENT_TYPE": "PERSON"},
{"LEMMA": "found"},
{"ENT_TYPE": "ORG"}
]
matcher.add("FOUNDED_BY", [pattern])
doc = nlp("Steve Jobs founded Apple in 1976.")
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(f"Found relation: {span.text}")Neural Relation Extraction
from transformers import pipeline
# Using pre-trained RE model
re_pipeline = pipeline("relation-extraction",
model="Babelscape/rebel-large")
text = "Albert Einstein developed the theory of relativity at ETH Zurich."
relations = re_pipeline(text)
for rel in relations:
print(f"({rel['head']}, {rel['type']}, {rel['tail']})")4. Coreference Resolution
Coreference Resolution identifies all expressions referring to the same entity.
Example
Input: "John bought a car. He drives it every day."
Output: "John bought a car. [John] drives [the car] every day."Using neuralcoref
import spacy
import neuralcoref
nlp = spacy.load("en_core_web_sm")
neuralcoref.add_to_pipe(nlp)
text = "Marie Curie won the Nobel Prize. She was the first woman to do so."
doc = nlp(text)
# Get resolved text
print(doc._.coref_resolved)
# "Marie Curie won the Nobel Prize. Marie Curie was the first woman to do so."
# Access coreference clusters
for cluster in doc._.coref_clusters:
print(f"Cluster: {cluster.main} -> {cluster.mentions}")5. Entity Resolution (Entity Linking)
Entity Resolution links entity mentions to a knowledge base entry.
The Challenge
"Apple" → Apple Inc. (company) OR Apple (fruit)?
"Washington" → Washington D.C. OR George Washington OR Washington State?Entity Linking with spaCy
import spacy
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("entityLinker", last=True)
text = "Einstein worked at Princeton University."
doc = nlp(text)
for ent in doc.ents:
if ent._.kb_ents:
kb_id = ent._.kb_ents[0][0] # Top candidate
print(f"{ent.text} -> {kb_id}")Custom Entity Resolution
from rapidfuzz import fuzz, process
# Knowledge base entries
KB = {
"Q937": "Albert Einstein",
"Q312": "Apple Inc.",
"Q89": "Apple (fruit)",
"Q61": "Washington, D.C."
}
def resolve_entity(mention, candidates, threshold=80):
"""Find best matching KB entry for a mention."""
matches = process.extract(
mention,
candidates.values(),
scorer=fuzz.token_sort_ratio,
limit=3
)
best_match = matches[0]
if best_match[1] >= threshold:
# Find the ID for the matched name
for kb_id, name in candidates.items():
if name == best_match[0]:
return kb_id, name, best_match[1]
return None, None, 0
# Example usage
entity_id, name, score = resolve_entity("Einstein", KB)
print(f"Resolved to: {name} ({entity_id}) with score {score}")6. Building a Knowledge Extraction Pipeline
Complete Pipeline Example
import spacy
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, RDFS
class KnowledgeExtractor:
def __init__(self):
self.nlp = spacy.load("en_core_web_lg")
self.graph = Graph()
self.EX = Namespace("http://example.org/")
def extract(self, text):
doc = self.nlp(text)
# Extract entities
entities = {}
for ent in doc.ents:
entity_uri = self.EX[ent.text.replace(" ", "_")]
entities[ent.text] = entity_uri
# Add entity to graph
self.graph.add((entity_uri, RDF.type, self.EX[ent.label_]))
self.graph.add((entity_uri, RDFS.label, Literal(ent.text)))
# Extract relations (simplified)
for token in doc:
if token.dep_ == "nsubj" and token.head.pos_ == "VERB":
subj = token.text
pred = token.head.lemma_
for child in token.head.children:
if child.dep_ == "dobj":
obj = child.text
self._add_relation(subj, pred, obj, entities)
return self.graph
def _add_relation(self, subj, pred, obj, entities):
if subj in entities and obj in entities:
pred_uri = self.EX[pred]
self.graph.add((entities[subj], pred_uri, entities[obj]))
# Usage
extractor = KnowledgeExtractor()
text = "Apple acquired Beats Electronics for $3 billion."
graph = extractor.extract(text)
print(graph.serialize(format="turtle"))Project: Movie Recommendation Knowledge Graph
Progress
| Week | Topic | Project Milestone |
|---|---|---|
| 1 | Ontology Introduction | Movie domain design completed |
| 2 | RDF & RDFS | 10 movies converted to RDF |
| 3 | OWL & Reasoning | Inference rules applied |
| 4 | Knowledge Extraction | Collect movie information from Wikipedia |
| 5 | Neo4j | Store in graph DB and query |
| 6 | GraphRAG | Natural language queries |
| 7 | Ontology Agent | Automatic updates for new movies |
| 8 | Domain Extension | Medical/Legal/Finance cases |
| 9 | Service Deployment | API + Dashboard |
Week 4 Milestone: Automatically Extracting Movie Information from Wikipedia
This week, you will automate the data collection process. Instead of manually entering 10 movies, you will extract movie information from Wikipedia and IMDB.
Extraction Pipeline:
Wikipedia Text
| NER (Named Entity Recognition)
[Christopher Nolan] [Inception] [2010]
| Relation Extraction
(Nolan) --directed--> (Inception)
| Entity Resolution (Entity Linking)
Nolan = dbpedia:Christopher_Nolan
| RDF Triple GenerationInformation to Extract:
- Movies: title, release date, genre, runtime
- People: name, birth date, role (director/actor)
- Relationships: directed, actedIn, hasGenre
Goal: Automatically collect data for 100 movies
In the project notebook, you'll automatically collect movie data from Wikipedia.
In the project notebook, you will implement:
- Crawl 100 movies using Wikipedia API
- Auto-extract director/actor names with spaCy NER
- Extract "directed", "actedIn" relationships with LLM
- Entity Resolution: merge "Chris Nolan" = "Christopher Nolan"
What you'll build by Week 9: An AI agent that answers "Recommend sci-fi movies like Nolan's style" by reasoning over director-genre-rating relationships in the knowledge graph
Practice Notebook
For deeper exploration of the theory:
The practice notebook covers additional topics:
- Fine-tuning custom spaCy NER models
- GPT-4 prompt engineering techniques
- Embedding-based Entity Resolution
- Multi-source data integration strategies
Interview Questions
What are the main challenges in Named Entity Recognition?
Key Challenges:
- Ambiguity: "Apple" (company vs fruit), "Jordan" (person vs country)
- Nested Entities: "New York Times" contains "New York"
- Out-of-Vocabulary: New entities not in training data
- Domain Adaptation: Models trained on news may fail on medical text
- Multilinguality: Different entity structures across languages
- Entity Boundaries: Determining where an entity starts/ends
Premium Content
Want complete solutions with detailed explanations and production-ready code?
Check out the Ontology & Knowledge Graph Cookbook Premium (opens in a new tab) for:
- Complete notebook solutions with step-by-step explanations
- Real-world case studies and best practices
- Interview preparation materials
- Production deployment guides
Next Steps
In Week 5: Neo4j, you will learn about the Labeled Property Graph model and how to work with Neo4j.