Intro to Neo4j (Graph Databases)

Written on October 26, 2024

A primer on Graph Databases with Neo4j.

What are Graph Databases?

Graph databases are specialized database systems designed to store and process data in terms of vertices (nodes) and edges (relationships). Unlike traditional databases, graph databases explicitly store relationships between data points, making them ideal for handling highly connected data.

Key Use Cases

Graph databases excel in scenarios where relationships between entities are as important as the entities themselves:

Social networks (friend connections, interactions)
Fraud detection (identifying suspicious patterns)
Knowledge graphs (connecting related concepts)
Recommendation engines (product suggestions)
Supply chain optimization (logistics networks)

Comparison with Traditional Databases

Limitations of SQL Databases

Traditional SQL databases face several challenges when handling graph-like data:

Implicit Relationships: Connections between data must be established through foreign keys and joins
Performance Issues: Complex queries requiring multiple joins become increasingly inefficient
Schema Rigidity: Different types of nodes often require separate tables, leading to a complex schema
Query Complexity: Graph traversal operations require multiple JOIN operations

Limitations of NoSQL Databases

Document and key-value stores also struggle with graph data:

Multiple Lookups: Traversing relationships requires multiple separate queries
Limited Relationship Modeling: Relationships are not first-class citizens
Complex Graph Operations: Operations like shortest path algorithms are difficult to implement efficiently

Neo4j Architecture

Key Concepts

Neo4j is built from the ground up to handle graph data efficiently through:

Native Graph Processing: Optimized for traversing relationships
Index-free Adjacency: Each node maintains direct references to its neighbors
Query Performance: Query time is proportional to the searched subgraph, not the total graph size

Storage Architecture

Neo4j’s storage system is organized into specialized store files:

Node Storage

Fixed-size records for fast lookups
Structure (15 bytes per record):
- In-use flag (1 byte)
- Relationship pointer (4 bytes)
- Property pointer (4 bytes)
- Label pointer (5 bytes)
- Flags (1 byte)

Relationship Storage

Fixed-size records containing:
- Start and end node references
- Relationship type pointer
- Next/previous relationship pointers for both nodes
- Chain position flags

This can be thought of as maintaing two separate doubly linked lists for each relationship. To traverse you start with a node which contains the reference to its first relationship.

Database Properties

Caching: Uses LRU-K page cache
Storage: Divides store files into discrete regions
Page Management: Fixed number of regions per store file
Eviction Policy: Least frequently used with page popularity consideration

Working with Neo4j

API Layers

Neo4j provides multiple ways to interact with the database:

Cypher: Declarative query language for graph operations
Kernel API: Low-level access to database transactions
Core API: Imperative Java API for direct graph manipulation
Traversal Framework: Declarative Java API for graph traversal

Code Example: Movie Database

// Create movie nodes
CREATE (inception:Movie {
    title: "Inception",
    released: 2010,
    tagline: "Your mind is the scene of the crime"
});

// Create person nodes
CREATE (leonardo:Person {
    name: "Leonardo DiCaprio",
    born: 1974
});

// Create relationships
MATCH (leonardo:Person {name: "Leonardo DiCaprio"})
MATCH (inception:Movie {title: "Inception"})
CREATE (leonardo)-[:ACTED_IN {roles: ["Cobb"]}]->(inception);

// Query directors and their movies
MATCH (director:Person)-[:DIRECTED]->(movie:Movie)
RETURN director.name, movie.title;

Future Exploration Areas

Knowledge Graphs: Neo4j’s application in AI and knowledge representation
Facebook’s TAO: Understanding large-scale graph storage systems
Core Database Concepts:
- Transaction management and ACID properties
- Write-ahead logging (WAL)
- Lock management
- Replication and availability strategies

Resources

Neo4j O’Reilly book
Neo4j Documentation
Neo4j GitHub Repository
- Neo4j node disk format code