Mar 2026 • 12 min read
Rewriting xxHash in Rust
A clean-room Rust reimplementation of xxHash: bit-exact parity across all four variants, NEON/SSE2/AVX2 SIMD paths, and comparable CLI-level throughput to the C reference on Apple Silicon.
Rewrite study
First benchmark passCLI-level throughput across four scenarios on Apple Silicon. The Rust implementation matches or exceeds the C reference on XXH64 at 16 MiB and trails by about 8% on XXH3_128 at 1 MiB.
Parity
508/508 tests
XXH64 16 MiB
Comparable to C
XXH3_128 1 MiB
~8% behind C
Scenarios
4 of 8 declared
Samples
Smoke-level (2 per tool)
- Bit-exact output across XXH32, XXH64, XXH3_64, and XXH3_128 for all tested input lengths, seeds, and streaming patterns.
- SIMD-optimized XXH3 paths for NEON (aarch64), SSE2, and AVX2 (x86_64) all produce bit-exact output matching the scalar fallback.
- On XXH64 at 16 MiB, the Rust and C implementations are comparable (~3,972 vs ~3,694 MB/s cross-run median CLI-level throughput).
- On XXH3_128 at 1 MiB, the C reference leads by about 8% (~448 vs ~414 MB/s).
- At 4 KiB payloads, all comparators converge to the same throughput floor (~2 MB/s), dominated entirely by process startup overhead.
I reimplemented the xxHash family of hash functions in Rust from the published specification, covering all four variants (XXH32, XXH64, XXH3_64, XXH3_128) plus a CLI tool with behavioral parity against the upstream xxhsum. Then I benchmarked the result against the C reference and two contrast comparators.
The short version: the Rust implementation produces bit-exact output for every variant and passes 508 parity tests -- 169 CLI behavioral tests and 339 library-level tests covering hash vectors, streaming equivalence, and SIMD parity. On CLI-level throughput, it matches or exceeds the C reference on XXH64 at 16 MiB and trails by about 8% on XXH3_128 at 1 MiB. At small payloads, process startup dominates and all tools converge to the same floor.
Correctness came first
Before touching benchmarks, I validated that the Rust hash core produces the same output as the C reference across every variant and edge case.
The parity suite covers 508 individual test points across two layers:
Library-level (339 tests): Hash vector validation for all four algorithms at boundary lengths (0, 1, 3, 4, 8, 16, 17, 128, 129, 240, 241, and larger), seeded variants with both default and non-zero seeds, streaming equivalence (reset/update/digest patterns match one-shot hashing across multiple chunking patterns), and SIMD-vs-scalar parity on all three instruction sets.
CLI-level (169 tests): Algorithm selection (31 tests), output format parity including GNU/BSD modes, --tag, --little-endian, and escaped filename handling (40 tests), input flow parity for named files, stdin, mixed files, and large file streaming (16 tests), file-list parity via --files-from (11 tests), and the full check-mode policy stack across --quiet, --status, --warn, --strict, and --ignore-missing (71 tests).
All 508 tests pass at the measured revision. (evidence: parity_summary.json)
The hash core
The xxHash algorithms are built on multiply-rotate-accumulate rounds. The XXH64 round function is compact:
#[inline(always)]
fn round64(acc: u64, input: u64) -> u64 {
let acc = acc.wrapping_add(input.wrapping_mul(PRIME64_2));
let acc = rotl64(acc, 31);
acc.wrapping_mul(PRIME64_1)
}wrapping_add and wrapping_mul make the integer overflow semantics explicit -- Rust's default arithmetic panics on overflow in debug builds, so wrapping must be opted into. The C reference uses implicit unsigned overflow, which is defined behavior in C but not in Rust's type system. Every arithmetic operation in the hash core uses the wrapping_* methods to match C semantics exactly.
XXH3 is more involved. The core accumulate step XORs input data against a secret buffer, then splits each 64-bit value into 32-bit halves and multiplies them:
pub fn accumulate_stripe_scalar(
acc: &mut [u64; 8], stripe: &[u8],
secret: &[u8], secret_offset: usize,
) {
for i in 0..8 {
let data_val = read_le_u64(stripe, i * 8);
let secret_val = read_le_u64(secret, secret_offset + i * 8);
let value = data_val ^ secret_val;
acc[i ^ 1] = acc[i ^ 1].wrapping_add(data_val);
acc[i] = acc[i].wrapping_add(
(value & 0xFFFFFFFF).wrapping_mul(value >> 32),
);
}
}The i ^ 1 index swap is a deliberate part of the xxHash3 spec: data from lane 0 contributes to lane 1's accumulator and vice versa, which improves diffusion without an extra mixing step.
SIMD acceleration
The scalar accumulate function processes one 64-bit lane at a time. The SIMD variants process multiple lanes per instruction using platform-specific intrinsics.
NEON (aarch64) processes two 64-bit lanes per iteration using 128-bit vector registers:
pub unsafe fn accumulate_stripe_neon(
acc: &mut [u64; 8], stripe: &[u8],
secret: &[u8], secret_offset: usize,
) {
for i in 0..4 {
let lane = i * 2;
let data_offset = lane * 8;
let sec_offset = secret_offset + lane * 8;
let data_vec = vld1q_u64(stripe.as_ptr().add(data_offset) as *const u64);
let secret_vec = vld1q_u64(secret.as_ptr().add(sec_offset) as *const u64);
let value = veorq_u64(data_vec, secret_vec);
let value_lo = vmovn_u64(value);
let value_hi = vshrn_n_u64(value, 32);
let product = vmull_u32(value_lo, value_hi);
let data_swapped = vcombine_u64(
vget_high_u64(data_vec), vget_low_u64(data_vec),
);
let acc_vec = vld1q_u64(acc.as_ptr().add(lane));
let result = vaddq_u64(vaddq_u64(acc_vec, data_swapped), product);
vst1q_u64(acc.as_mut_ptr().add(lane), result);
}
}SSE2 (x86_64) uses the same 128-bit width but with _mm_* intrinsics. AVX2 widens to 256-bit __m256i registers, processing four 64-bit lanes per iteration in two passes instead of four. The runtime dispatches to AVX2 when available via is_x86_feature_detected!("avx2"), falling back to SSE2 or scalar.
All three SIMD paths produce bit-exact output matching the scalar reference -- verified by the 12 SIMD-vs-scalar parity tests.
CLI behavioral parity
The CLI tool achieves behavioral parity with the upstream xxhsum for the validated surface: algorithm selection (-H0 through -H128), seed support, file and stdin hashing, GNU and BSD output formats, little-endian output, escaped filename handling (newlines, carriage returns, backslashes), file-list input via --files-from, and the full check-mode policy stack.
Parity is validated through direct output comparison against the C reference binary (when XXHASH_REFERENCE_ROOT is set). Each test invokes both binaries with the same arguments and asserts identical stdout, stderr, and exit code.
Benchmark methodology
The benchmarks measure end-to-end CLI throughput: each comparator is invoked as an external process that reads a payload file and produces a digest on stdout. This captures the full cost of process startup, I/O, and hashing rather than isolating the hash function in a microbenchmark loop.
Comparators
| ID | Role | Version |
|---|---|---|
c_xxhsum | Reference | xxhsum 0.8.3 (Yann Collet) |
rust_xxhash_rs | Subject | xxhash-rs 0.1.0 |
b3sum | Contrast | b3sum 1.8.3 |
md5 | Contrast | macOS system /sbin/md5 |
c_xxhsum and rust_xxhash_rs are parity oracles: the harness verifies they produce the same digest before accepting timing samples. b3sum and md5 provide throughput context from different hash families.
Scenarios
| Scenario | Algorithm | Payload |
|---|---|---|
xxh64-4k | XXH64 | 4 KiB |
xxh64-1m | XXH64 | 1 MiB |
xxh64-16m | XXH64 | 16 MiB |
xxh3-128-1m | XXH3_128 | 1 MiB |
Each scenario uses 1 warmup iteration (discarded) followed by 2 measured iterations. The summary statistic is the median of measured samples. A hard correctness gate verifies c_xxhsum and rust_xxhash_rs agree on the output digest before timing results are accepted. Three pinned runs provide the cross-run medians reported below. (evidence: benchmark_summary.json)
Results
XXH64 at 16 MiB
At this payload size, process startup is a small fraction of total time, and the numbers primarily reflect hash throughput.
| Comparator | Median throughput |
|---|---|
rust_xxhash_rs | ~3,972 MB/s |
b3sum | ~3,965 MB/s |
c_xxhsum | ~3,694 MB/s |
md5 | ~532 MB/s |
The Rust implementation, C reference, and BLAKE3 all land in the same range (~3.7–4.0 GB/s), while MD5 trails at ~532 MB/s. The Rust and C xxHash numbers are close enough that run-to-run variance could change their relative order.
XXH3_128 at 1 MiB
| Comparator | Median throughput |
|---|---|
c_xxhsum | ~448 MB/s |
rust_xxhash_rs | ~414 MB/s |
b3sum | ~333 MB/s |
md5 | ~272 MB/s |
The C reference leads the Rust implementation by about 8% (~448 vs ~414 MB/s). Both exercise NEON-optimized XXH3 paths on this Apple Silicon host.
XXH64 at 1 MiB
| Comparator | Median throughput |
|---|---|
c_xxhsum | ~565 MB/s |
rust_xxhash_rs | ~472 MB/s |
b3sum | ~424 MB/s |
md5 | ~306 MB/s |
At 1 MiB, process startup is a larger fraction of measured time. The C reference leads the Rust implementation by about 16% (~565 vs ~472 MB/s), though some of that gap reflects startup and I/O variance rather than pure hash throughput differences.
XXH64 at 4 KiB
| Comparator | Median throughput |
|---|---|
md5 | ~2.4 MB/s |
c_xxhsum | ~2.2 MB/s |
rust_xxhash_rs | ~2.0 MB/s |
b3sum | ~1.7 MB/s |
At 4 KiB, process startup overwhelms the hash computation. All comparators converge to a similar throughput floor (~2 MB/s). These numbers say nothing about hash performance and are included only to illustrate the startup-dominated regime.
Where the throughput gap comes from
On XXH64 at 16 MiB, the Rust and C implementations are comparable because the algorithmic work is identical -- the same multiply-rotate-accumulate sequence -- and LLVM optimizes the Rust code similarly to how GCC optimizes the C reference for simple integer arithmetic. The hash core is tight enough that the compiler backend dominates, not the source language.
On XXH3_128 at 1 MiB, the C reference leads by ~8%. XXH3's inner loop is more complex: it combines the XOR-split-multiply accumulate with a secret-derivation step and a scramble pass every 16 stripes. The C reference uses hand-tuned NEON intrinsics that Yann Collet has iterated on for years. The Rust implementation uses equivalent NEON intrinsics but may differ in loop unrolling, register allocation, or instruction scheduling decisions that the compiler makes differently between Clang (for C) and LLVM (for Rust). An 8% gap on a hot SIMD loop is within the range of compiler codegen differences.
At smaller payloads (1 MiB and 4 KiB), process startup and I/O overhead compress the apparent throughput numbers. The hash core's actual speed is masked by fixed costs that both implementations share equally.
Traceability
The repo includes publication tooling that enforces a chain from claims to evidence:
claim_map.pyvalidates that every material claim (methodology, benchmark, parity, limitation, licensing) has a corresponding evidence artifact at a pinned revisiontraceability_check.pyverifies cross-artifact lineage: publication, parity, and benchmark artifacts must all reference the same measured commitgenerate_evidence.pycollects test output, benchmark artifacts, and revision metadata into stable machine-readable files underpublication/evidence/
This means you cannot update the code without also updating the evidence, and you cannot publish claims that reference a revision different from the one the evidence was collected at.
Limitations
-
Single platform. All measurements on one Apple Silicon host (arm64, macOS). x86_64 performance may differ, particularly for SSE2/AVX2 XXH3 paths that have not been benchmarked.
-
CLI-level measurement. Process startup overhead dominates at small payload sizes and partially masks hash throughput differences at medium sizes. Library-level benchmarks would show the raw hash speed more clearly.
-
Smoke-level sample counts. The pinned runs use 2 measured iterations per comparator per scenario. Production-grade studies would use higher sample counts for tighter confidence intervals.
-
Partial scenario coverage. 4 of 8 declared scenarios are covered. The remaining four are declared in the manifest but not included in the pinned runs.
-
Validated CLI surface only. Features outside the validated surface (e.g., the upstream
--benchmarkmode) are not implemented or tested. -
No production deployment evidence. Correctness and benchmark evidence demonstrate parity and baseline performance, not production readiness.
Licensing and clean-room boundary
This is a clean-room reimplementation. The hash algorithms were implemented from the published xxHash specification and the BSD-2-Clause-licensed reference library material. The CLI achieves behavioral compatibility through black-box observation of the upstream xxhsum tool, without translating or copying any GPL-licensed source code.
The upstream project has two license regimes: BSD-2-Clause for the xxHash library and specification (freely usable, informed the Rust hash core), and GPLv2 for the xxhsum CLI tool (treated as an external behavioral oracle only).
xxHash was created by Yann Collet. The Rust reimplementation is released under the MIT OR Apache-2.0 dual license. Zero external runtime dependencies.
How this was built
The implementation was built using Factory mission mode. The mission system planned the project across milestones (hash core, streaming API, SIMD acceleration, CLI tool, benchmark harness, publication), ran worker sessions for each feature, and executed scrutiny reviews and user-testing validators after every implementation step.
The test-to-source ratio is 3.1:1 by line count (11,372 test lines vs 3,676 source lines across 34 Rust files). That ratio reflects the parity-first approach: the majority of the engineering effort went into proving the implementation is correct, not into the implementation itself.
Reproducibility
The measured revision for all evidence is evidence-v1.
git clone https://github.com/sagaragas/xxhash-rs.git
cd xxhash-rs
git checkout evidence-v1
cargo build --workspace --release
cargo test --workspace --all-targets -- --test-threads=3
python3 publication/claim_map.py --verify
python3 publication/traceability_check.pyThe evidence pack is committed under publication/evidence/ and includes parity test results, benchmark summaries with correctness gate outcomes, raw timing samples for three pinned claim-ready runs, and a claim-to-evidence map.