Why VQE Benchmarks Are So Hard to Reproduce — and How QEncode Fixes It

Quantum chemistry is one of the most promising near-term applications of quantum computing. The Variational Quantum Eigensolver (VQE) algorithm, introduced in 2014, has been the subject of hundreds of papers, dozens of hardware demonstrations, and significant investment from both academia and industry. Yet if you try to take a published VQE result and reproduce it — even with the same code and the same molecule — you will frequently fail.

This is not a fringe problem. A 2023 survey of quantum chemistry benchmark papers found that fewer than 30% provided enough information to independently reproduce the reported energy estimates. The reproducibility crisis that has affected classical computational science for decades is arriving in quantum computing — and arriving early, before the field has established the norms to handle it.

Understanding why this happens — and what a rigorous standard looks like — matters for anyone evaluating quantum algorithms or comparing competing approaches.

Why VQE results don't reproduce

The failure modes are consistent and predictable. They cluster around three root causes.

1. Underspecified ansatz construction

The ansatz — the parameterized circuit that VQE optimizes — is the most consequential design choice in the algorithm. A paper might report using "UCCSD" without specifying which Hartree-Fock reference state was used, how the excitation operators were ordered, whether spin symmetry was enforced, or how the Jordan-Wigner mapping was applied. These details change the circuit, which changes the result.

Similarly, hardware-efficient ansatz descriptions often omit the specific gate set, entanglement topology, and number of layers — all of which directly determine the circuit's expressibility and therefore the achievable energy.

2. Hardware-specific transpilation

When circuits run on real quantum hardware, they must be transpiled — compiled down to the native gate set and connectivity of the specific device. Different devices produce different transpiled circuits from the same logical circuit. A result obtained on an IBM Falcon processor cannot be directly compared to one from a Quantinuum H-series device, even for the same molecule and ansatz, because the compiled circuits are different.

Papers often report results from a specific hardware run without clearly separating the algorithmic performance from device-specific effects. This makes it impossible to know whether a reported improvement comes from the algorithm or from favorable hardware characteristics.

3. No standard error metric

Different papers use different metrics to evaluate VQE accuracy. Some report absolute energy error in Hartree. Others report correlation energy recovery percentage. Some use the ground state fidelity. Some don't report an error metric at all, only that the result is "close to FCI."

Without a standard metric, comparison across papers is meaningless. A result that looks better by one metric may be worse by another, and "chemical accuracy" (typically defined as 1.6 × 10⁻³ Hartree, or 1 kcal/mol) is often cited without being rigorously demonstrated.

What a rigorous benchmark standard requires

These are not unsolvable problems. Classical computational chemistry solved equivalent issues decades ago through standardized benchmarks — G2, G3, W4, GMTKN55 — that specify exact molecular geometries, basis sets, methods, and reference energies. Every new method is evaluated against the same standard problems with the same metrics.

Quantum algorithm benchmarking needs the same thing. A rigorous standard requires:

Fixed molecular geometries and Hamiltonians. The electronic Hamiltonian for each benchmark molecule must be generated from a specified geometry and basis set using a reproducible procedure.
Specified qubit mappings. The fermionic-to-qubit mapping must be declared (Jordan-Wigner, parity, Bravyi-Kitaev) with all reduction steps documented.
Exact reference energies. FCI energies computed with the same Hamiltonian serve as the ground truth against which VQE accuracy is measured.
A standard error metric. The energy gap — absolute difference between VQE estimate and FCI energy — gives a hardware-agnostic, universally interpretable accuracy measure.
Hardware-agnostic circuit metrics. Circuit depth and two-qubit gate count, measured before hardware transpilation, give a device-independent cost metric that enables fair comparison across platforms.
Managed, reproducible execution. All benchmark circuits should run on the same simulation infrastructure under identical conditions, with results independently verified.

How QEncode Suite v2 implements this

QEncode Suite v2 is a benchmark specification and managed execution service built around exactly these requirements. The suite defines five benchmark molecules — H₂, LiH, HF, N₂, and BeH₂ — with fixed geometries, basis sets (STO-3G for smaller molecules, cc-pVDZ for larger), and exact FCI reference energies computed with PySCF.

Every submitted algorithm is evaluated at all three standard qubit encodings (Jordan-Wigner, parity, Bravyi-Kitaev) with Qiskit-generated Hamiltonians, producing comparable results across the encoding spectrum. Circuit metrics are recorded post-transpilation at a fixed optimization level, giving a fair hardware-agnostic cost estimate.

The energy gap against FCI is the primary accuracy metric across all leaderboard categories. Results are signed with an Ed25519 key, producing a verifiable certification receipt that can be independently checked.

Why this matters for algorithm developers

If you're developing a new ansatz, optimizer, or error mitigation technique, the inability to make reproducible benchmark claims is a real problem. Reviewers and customers can't evaluate your results against the state of the art if everyone's using different benchmarks. Worse, it's easy to overfit to a specific benchmark setup and report numbers that don't generalize.

A certified QEncode result gives you a reproducible, independently verified benchmark claim that you can publish, share, and defend — because it was produced by the same standard infrastructure as every other entry on the leaderboard. When your result appears on the leaderboard, the comparison is meaningful because the benchmark conditions are identical.

Reproducibility isn't just a scientific virtue. For quantum computing to make its case to industry, results need to be verifiable. QEncode Suite v2 is designed to make that possible.

Read the full benchmark specification

The QEncode Suite v2 specification documents all molecule geometries, Hamiltonian generation procedures, encoding definitions, metric formulas, and certification requirements.

Benchmark spec View leaderboard