Database sharding is a technique to scale out databases by breaking them into smaller, more manageable pieces called shards. It’s particularly useful for applications that need to handle large volumes of data and high throughput.
Let’s see how we can implement database sharding in Rust! 🦀
Sharding Schemes
There are several sharding schemes, each with its own advantages and use cases. Let’s explore some of the common sharding schemes with a barefoot code snippet:
1. Key-Based (Hash-Based) Sharding
In key-based sharding, a shard is determined by applying a consistent hash function to a sharding key associated with each record. The hash function maps each key to a shard. This approach ensures an even distribution of data across shards, provided the hash function is chosen well.
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
fn hash_based_shard<T: Hash>(key: &T, number_of_shards: usize) -> usize {
let mut hasher = DefaultHasher::new();
key.hash(&mut hasher);
(hasher.finish() as usize) % number_of_shards
}
// Usage
let shard_id = hash_based_shard(&"user123", 10);
println!("Shard ID for 'user123': {}", shard_id);
- Pros: Even distribution of data, simplicity.
- Cons: Rebalancing data can be challenging when adding or removing nodes.
2. Range-Based Sharding
Range-based sharding involves dividing data into shards based on ranges of a certain key. Each shard holds the data for a specific range of values. For example, in a user database, one shard might hold users with IDs from 1 to 1000, another from 1001 to 2000, and so on.
fn range_based_shard(key: i32, range_size: i32, number_of_shards: usize) -> usize {
((key / range_size) as usize) % number_of_shards
}
// Usage
let shard_id = range_based_shard(12345, 1000, 10);
println!("Shard ID for key 12345: {}", shard_id);
- Pros: Easy to implement, efficient for range queries.
- Cons: Can lead to uneven data and load distribution if the data isn’t uniformly distributed.
3. Directory-Based Sharding
Directory-based sharding uses a lookup table to keep track of which shard holds which data. When a query comes in, the system consults the lookup table to determine where to route the query.
use std::collections::HashMap;
struct DirectorySharder {
directory: HashMap<String, usize>, // Maps a key to a shard ID
}
impl DirectorySharder {
fn new() -> Self {
Self {
directory: HashMap::new(),
}
}
fn add_key(&mut self, key: &str, shard_id: usize) {
self.directory.insert(key.to_string(), shard_id);
}
fn get_shard(&self, key: &str) -> Option<usize> {
self.directory.get(key).cloned()
}
}
// Usage
let mut sharder = DirectorySharder::new();
sharder.add_key("user123", 1);
println!("Shard ID for 'user123': {:?}", sharder.get_shard("user123"));
- Pros: Flexibility in data distribution, easy to add new shards.
- Cons: The lookup table can become a bottleneck if not managed properly.
4. Geographic Sharding
Geographic sharding involves distributing data based on geographic locations. This can be particularly useful for services that are region-specific and can significantly reduce latency by locating data closer to its users.
fn geographic_shard(region: &str) -> usize {
match region {
"North America" => 0,
"Europe" => 1,
"Asia" => 2,
_ => usize::MAX, // Unknown or fallback shard
}
}
// Usage
let shard_id = geographic_shard("Europe");
println!("Shard ID for Europe: {}", shard_id);
- Pros: Reduced latency for geographically distributed applications, improved local data compliance.
- Cons: Complexity in managing data consistency across regions.
5. Vertical Sharding
Vertical sharding, also known as functional partitioning, involves splitting a database into shards based on features or services. For example, user-related data might be stored in one shard, while product-related data might be stored in another.
fn vertical_shard(data_type: &str) -> usize {
match data_type {
"User Data" => 0,
"Order Data" => 1,
"Product Data" => 2,
_ => usize::MAX, // Fallback shard for unknown data types
}
}
// Usage
let shard_id = vertical_shard("Order Data");
println!("Shard ID for Order Data: {}", shard_id);
- Pros: Isolation of workloads, potential for performance optimization.
- Cons: Can lead to data duplication and complicates transactions that span multiple shards.
6. Tenant-Based Sharding (Multi-Tenancy)
In multi-tenant applications, data from different tenants (customers, organizations) is stored in separate shards. Each tenant’s data is isolated and can be managed independently.
fn tenant_based_shard(tenant_id: &str, number_of_shards: usize) -> usize {
// Simple hash-based approach for tenant ID
let mut hasher = DefaultHasher::new();
tenant_id.hash(&mut hasher);
(hasher.finish() as usize) % number_of_shards
}
// Usage
let shard_id = tenant_based_shard("tenant123", 10);
println!("Shard ID for tenant 'tenant123': {}", shard_id);
- Pros: Data isolation, scalability per tenant, easier backup and restore.
- Cons: Overhead in managing multiple tenants, potential underutilization of resources.


