Strings & Text

Database Sharding in Rust

Database sharding is a technique to scale out databases by breaking them into smaller, more manageable pieces called shards. It’s…

By Luis SoaresFebruary 2, 20247 min readOriginal on Medium

Database sharding is a technique to scale out databases by breaking them into smaller, more manageable pieces called shards. It’s particularly useful for applications that need to handle large volumes of data and high throughput.

Let’s see how we can implement database sharding in Rust! 🦀

Sharding Schemes

There are several sharding schemes, each with its own advantages and use cases. Let’s explore some of the common sharding schemes with a barefoot code snippet:

1. Key-Based (Hash-Based) Sharding

In key-based sharding, a shard is determined by applying a consistent hash function to a sharding key associated with each record. The hash function maps each key to a shard. This approach ensures an even distribution of data across shards, provided the hash function is chosen well.

use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};

fn hash_based_shard<T: Hash>(key: &T, number_of_shards: usize) -> usize {
    let mut hasher = DefaultHasher::new();
    key.hash(&mut hasher);
    (hasher.finish() as usize) % number_of_shards
}

// Usage
let shard_id = hash_based_shard(&"user123", 10);
println!("Shard ID for 'user123': {}", shard_id);

Pros: Even distribution of data, simplicity.
Cons: Rebalancing data can be challenging when adding or removing nodes.

2. Range-Based Sharding

Range-based sharding involves dividing data into shards based on ranges of a certain key. Each shard holds the data for a specific range of values. For example, in a user database, one shard might hold users with IDs from 1 to 1000, another from 1001 to 2000, and so on.

fn range_based_shard(key: i32, range_size: i32, number_of_shards: usize) -> usize {
    ((key / range_size) as usize) % number_of_shards
}

// Usage
let shard_id = range_based_shard(12345, 1000, 10);
println!("Shard ID for key 12345: {}", shard_id);

Pros: Easy to implement, efficient for range queries.
Cons: Can lead to uneven data and load distribution if the data isn’t uniformly distributed.

3. Directory-Based Sharding

Directory-based sharding uses a lookup table to keep track of which shard holds which data. When a query comes in, the system consults the lookup table to determine where to route the query.

use std::collections::HashMap;

struct DirectorySharder {
    directory: HashMap<String, usize>, // Maps a key to a shard ID
}

impl DirectorySharder {
    fn new() -> Self {
        Self {
            directory: HashMap::new(),
        }
    }

    fn add_key(&mut self, key: &str, shard_id: usize) {
        self.directory.insert(key.to_string(), shard_id);
    }

    fn get_shard(&self, key: &str) -> Option<usize> {
        self.directory.get(key).cloned()
    }
}

// Usage
let mut sharder = DirectorySharder::new();
sharder.add_key("user123", 1);
println!("Shard ID for 'user123': {:?}", sharder.get_shard("user123"));

Pros: Flexibility in data distribution, easy to add new shards.
Cons: The lookup table can become a bottleneck if not managed properly.

4. Geographic Sharding

Geographic sharding involves distributing data based on geographic locations. This can be particularly useful for services that are region-specific and can significantly reduce latency by locating data closer to its users.

fn geographic_shard(region: &str) -> usize {
    match region {
        "North America" => 0,
        "Europe" => 1,
        "Asia" => 2,
        _ => usize::MAX, // Unknown or fallback shard
    }
}

// Usage
let shard_id = geographic_shard("Europe");
println!("Shard ID for Europe: {}", shard_id);

Pros: Reduced latency for geographically distributed applications, improved local data compliance.
Cons: Complexity in managing data consistency across regions.

5. Vertical Sharding

Vertical sharding, also known as functional partitioning, involves splitting a database into shards based on features or services. For example, user-related data might be stored in one shard, while product-related data might be stored in another.

fn vertical_shard(data_type: &str) -> usize {
    match data_type {
        "User Data" => 0,
        "Order Data" => 1,
        "Product Data" => 2,
        _ => usize::MAX, // Fallback shard for unknown data types
    }
}

// Usage
let shard_id = vertical_shard("Order Data");
println!("Shard ID for Order Data: {}", shard_id);

Pros: Isolation of workloads, potential for performance optimization.
Cons: Can lead to data duplication and complicates transactions that span multiple shards.

Practice what you learned

Reinforce this article with hands-on coding exercises and AI-powered feedback.

String Manipulation ToolkitIntermediate Regex-Free Pattern MatcherIntermediate Build a head CommandBeginner

View all exercises

6. Tenant-Based Sharding (Multi-Tenancy)

In multi-tenant applications, data from different tenants (customers, organizations) is stored in separate shards. Each tenant’s data is isolated and can be managed independently.

fn tenant_based_shard(tenant_id: &str, number_of_shards: usize) -> usize {
    // Simple hash-based approach for tenant ID
    let mut hasher = DefaultHasher::new();
    tenant_id.hash(&mut hasher);
    (hasher.finish() as usize) % number_of_shards
}

// Usage
let shard_id = tenant_based_shard("tenant123", 10);
println!("Shard ID for tenant 'tenant123': {}", shard_id);

Pros: Data isolation, scalability per tenant, easier backup and restore.
Cons: Overhead in managing multiple tenants, potential underutilization of resources.

A working example

For simplicity, let’s see a conceptual Rust example that simulates the sharding logic and demonstrates how you might perform CRUD operations across multiple shards. This example will use a hash-based sharding approach.

Assumptions

We’re simulating the database layer to focus on sharding logic.
We’ll use a simple in-memory structure to represent each shard.
The example will be simplified and not production-ready.

Setup

Rust Environment: Ensure you have Rust installed on your system.
New Project: Create a new Rust project by running cargo new rust_sharding_crud and navigate into the project directory.

Dependencies

This example doesn’t require external crates for simplicity, but in a real application, you might consider crates like diesel for ORM, tokio for async, and serde for serialization.

Code Implementation

Replace the content of src/main.rs with the following code:

use std::collections::hash_map::DefaultHasher;
use std::collections::HashMap;
use std::hash::{Hash, Hasher};

const NUMBER_OF_SHARDS: usize = 4;

struct Shard {
    data: HashMap<String, String>, // Simulating a simple key-value store
}

impl Shard {
    fn new() -> Self {
        Shard {
            data: HashMap::new(),
        }
    }

    fn insert(&mut self, key: String, value: String) {
        self.data.insert(key, value);
    }

    fn get(&self, key: &str) -> Option<&String> {
        self.data.get(key)
    }

    fn update(&mut self, key: String, value: String) -> Option<String> {
        self.data.insert(key, value)
    }

    fn delete(&mut self, key: &str) -> Option<String> {
        self.data.remove(key)
    }
}

struct ShardedDatabase {
    shards: Vec<Shard>,
}

impl ShardedDatabase {
    fn new() -> Self {
        let mut shards = Vec::with_capacity(NUMBER_OF_SHARDS);
        for _ in 0..NUMBER_OF_SHARDS {
            shards.push(Shard::new());
        }
        ShardedDatabase { shards }
    }

    fn determine_shard<T: Hash>(&self, key: &T) -> usize {
        let mut hasher = DefaultHasher::new();
        key.hash(&mut hasher);
        (hasher.finish() as usize) % NUMBER_OF_SHARDS
    }

    fn insert(&mut self, key: String, value: String) {
        let shard_id = self.determine_shard(&key);
        self.shards[shard_id].insert(key, value);
    }

    fn get(&self, key: &str) -> Option<&String> {
        let shard_id = self.determine_shard(key);
        self.shards[shard_id].get(key)
    }

    fn update(&mut self, key: String, value: String) -> Option<String> {
        let shard_id = self.determine_shard(&key);
        self.shards[shard_id].update(key, value)
    }

    fn delete(&mut self, key: &str) -> Option<String> {
        let shard_id = self.determine_shard(key);
        self.shards[shard_id].delete(key)
    }
}

fn main() {
    let mut db = ShardedDatabase::new();

    // Insert some data
    db.insert("user1".to_string(), "Alice".to_string());
    db.insert("user2".to_string(), "Bob".to_string());

    // Retrieve and print a value
    if let Some(name) = db.get("user1") {
        println!("Found: {}", name);
    }

    // Update a value
    db.update("user1".to_string(), "Alicia".to_string());

    // Delete a value
    db.delete("user2");

    // Try to retrieve a deleted value
    if db.get("user2").is_none() {
        println!("User2 deleted successfully");
    }
}

Explanation

Shard Structure: Represents a database shard. It’s a simple key-value store for this example.
ShardedDatabase Structure: Manages multiple shards and distributes data among them based on the hash of the key.
CRUD Operations: Implemented as methods on ShardedDatabase, which delegate operations to the appropriate shard based on the key.

Running the Example

Execute your program with Cargo:

cargo run

This will compile and run the Rust application, demonstrating simple CRUD operations across sharded in-memory data stores.

Extending to a Real Database

To extend this example to work with a real database:

Setup Database Instances: Each acting as a shard.
Database Connections: Use a Rust database driver (like diesel for SQL databases) to connect to and interact with each shard.
Error Handling: Add comprehensive error handling for database operations.
Asynchronous Operations: Use async/await for non-blocking database IO operations, likely requiring an async runtime like tokio.