Understanding Search: From Keywords to Knowledge Graphs

Search systems were once the precise, deterministic gateways to structured human knowledge. Today, they often behave more like probabilistic guessing engines—sometimes brilliant, sometimes incoherent, and frequently confident but wrong.

This shift wasn’t an accident; it was an architectural choice. We traded exactness for semantic flexibility, and in doing so, we broke the structural scaffolding that makes information reliable.

This article explains why search broke—from the perspective of information retrieval (IR) engineering—and why Knowledge Graphs are emerging not just as an enhancement, but as the necessary foundation for the next generation of AI systems.

Open Table of Contents

The Age of the Catalog: Early Search Foundations
- The PageRank Revolution
The Era of Vectors (and Their Limitations)
- Enter Semantic Search
- The Critical Weakness
The Great Shredder: RAG and Context Fragmentation
- The Chunking Problem
The Return to Structure: Knowledge Graphs
- Example: Relational Reasoning
The Next Chapter: GraphRAG
- The Evolution of Search
References

The Age of the Catalog: Early Search Foundations

In the 1990s, the web was effectively a chaotic, ever-expanding public library. To manage it, early search engines like AltaVista, Yahoo, and Excite relied on inverted indexes.

An inverted index is a straightforward data structure: a map of Token → List[Documents]. If you searched for “Apple,” the system looked up the token and returned every document containing it. This was literal matching. It had no concept of semantics, intent, or disambiguation. It couldn’t distinguish between Apple the fruit, Apple the computer company, or Apple Records.

The PageRank Revolution

The first major paradigm shift came with PageRank.

Larry Page and Sergey Brin realized that the web wasn’t just a collection of documents; it was a graph. They viewed hyperlinks as directed edges and treated every link as a vote of authority. Mathematically, PageRank treats the web as a massive Markov chain. The rank of a page is effectively the stationary distribution of a random walk across the graph.

This moved search from simple term frequency (how often a word appears) to link-based authority estimation (how important the page is). This structural approach dominated IR systems for two decades because it successfully proxied “quality” through “connectivity.”

But users evolved faster than the infrastructure.

The Era of Vectors (and Their Limitations)

By the 2010s, user behavior shifted. Queries evolved from keyword fragments like weather paris to complex, intent-driven questions:

“Is it safe to travel to Paris this weekend given the protests?”

Keyword search and link analysis failed here. A keyword search for “safe” and “protests” might return articles from 2019 because the tokens matched, even if the context was obsolete.

Enter Semantic Search

To solve this, the industry pivoted to embedding-based semantic search.

We began using models (like BERT and later SentenceTransformers) to compress sentences and documents into high-dimensional vectors. Retrieval became a geometry problem: we measured the cosine distance between the query vector and document vectors in a metric space. We scaled this using Approximate Nearest Neighbor (ANN) algorithms like HNSW, FAISS, and ScaNN.

This felt genuinely transformative—it allowed systems to retrieve semantically related passages even if they didn’t share a single keyword.

The Critical Weakness

However, embeddings introduced a subtle but critical structural weakness:

Embeddings encode proximity, not relationships.

Vectors operate in a metric space, not a relational graph. They are excellent at determining that “climate change” and “agriculture” are related concepts (they are close in vector space). But they are blind to the logic of that relationship. They cannot explicitly encode:

Causality (Does X cause Y?)
Hierarchy (Is X a type of Y?)
Temporal dependency (Did X happen before Y?)

We built systems that could find things that sounded similar, but we lost the ability to reason about how they were connected.

The Great Shredder: RAG and Context Fragmentation

When Large Language Models (LLMs) arrived, we attempted to bridge this gap with Retrieval-Augmented Generation (RAG). The idea was simple: retrieve relevant data and feed it to the LLM as context.

However, the engineering implementation of RAG introduced a fundamental flaw: Context Fragmentation.

The Chunking Problem

To build vector indexes, we typically split long documents into fixed-size fragments (e.g., 300–1200 tokens). We call this “chunking,” but from an information theory perspective, it is a lossy compression strategy.

graph LR
    Doc[Structured Document] -->|Chunking| C1[Chunk 1]
    Doc -->|Chunking| C2[Chunk 2]
    Doc -->|Chunking| C3[Chunk 3]
    C1 -.->|Loss of Context| C2
    C2 -.->|Loss of Logic| C3
    style Doc fill:#f9f,stroke:#333,stroke-width:2px
    style C1 fill:#eee,stroke:#333
    style C2 fill:#eee,stroke:#333
    style C3 fill:#eee,stroke:#333

Chunking severs:

Cross-references (Section A referring to Section F)
Definitions (A term defined in the intro is used undefined in the conclusion)
Causal chains (The cause is in Chunk 1, the effect is in Chunk 3)

A RAG pipeline retrieves a handful of these disconnected chunks and passes them to an LLM, expecting the model to reconstruct the global context. The result is often hallucinated reasoning or incoherent synthesis.

The failure wasn’t in the LLM; it was in the knowledge representation pipeline. We destroyed the map and expected the model to navigate the territory.

The Return to Structure: Knowledge Graphs

Where embeddings excel at capturing similarity, Knowledge Graphs excel at preserving structure.

A Knowledge Graph models reality strictly as:

Nodes: Entities, events, and concepts (e.g., “Argentina”, “Drought”)
Edges: Explicit relationships (e.g., “CAUSES”, “LOCATED_IN”)
Properties: Metadata attached to nodes/edges (e.g., timestamps, confidence scores)
Constraints: Domain rules that govern logic

This structure supports multi-hop retrieval—the ability to traverse a chain of facts to answer a complex question.

Example: Relational Reasoning

Consider the query:

“How do weather patterns in South America affect US markets?”

A vector search might fail because “weather in Argentina” and “US inflation” rarely appear in the same paragraph (chunk). A graph system, however, can traverse the explicit path:

Drought in Argentina 
  → CAUSES_DECREASE_IN → Soybean Supply 
  → IMPACTS → Global Export Prices 
  → AFFECTS → US Inflation

This traversal isn’t a guess; it’s a logical deduction derived from the data topology.

Graph algorithms—such as PageRank (for importance), Betweenness Centrality (for influence), and Shortest Path (for connection)—allow us to reason over the data, rather than just retrieving it.

The Next Chapter: GraphRAG

The emerging architecture for next-generation search is GraphRAG. It is not a replacement for vector search, but a unification of methods.

It combines:

Vectors for broad, fuzzy semantic recall (“Find concepts related to agriculture”)
Graphs for precise, relational traversal (“Follow the supply chain dependencies”)
LLMs for synthesis and natural language explanation

This hybrid approach solves the “chunking artifact” problem. By linking chunks back to their parent entities in a graph, we preserve the global context even when retrieval is local.

The Evolution of Search

Search is evolving through distinct engineering eras:

Era	Technique	Core Limitation
1990s	Inverted Indexes	Literal matching; no semantics
2000s	Link Analysis	Authority-focused; blindly trusts links
2010s	Embeddings	Encodes proximity, but ignores structure
2020s	Naive RAG	Context fragmentation via chunking
2025+	GraphRAG	Restores structure and reasoning

Search broke when we optimized purely for matching (finding words) instead of mapping (understanding worlds). It will be fixed by rebuilding the structural foundation underlying our data.

References

Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab.
Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. (The foundational RAG paper)
Edge, D., et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Microsoft Research. (The paper defining modern GraphRAG)
Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS. (For context on embedding limitations)
Liu, N., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. (Explaining why larger context windows don’t fix structure)

Understanding Search: From Keywords to Knowledge Graphs

Understanding Search: From Keywords to Knowledge Graphs

Table of Contents

The Age of the Catalog: Early Search Foundations

The PageRank Revolution

The Era of Vectors (and Their Limitations)

Enter Semantic Search

The Critical Weakness

The Great Shredder: RAG and Context Fragmentation

The Chunking Problem

The Return to Structure: Knowledge Graphs

Example: Relational Reasoning

The Next Chapter: GraphRAG

The Evolution of Search

References

Share this post

Related Posts

RAG in the Real World: Building a High-Precision Retrieval System for LLM Pipelines

You Can't Improve What You Don't Measure: Evaluating Knowledge-Grounded Retrieval Systems

My VS Code Gained a Superpower- Taming LangChain with Live Context