Architecture & Differences from R GenomicSEM

This document explains why the Rust port runs faster than R GenomicSEM and where its algorithms or numerical outputs diverge from R.

It is written for two audiences:

Users deciding whether to switch from R GenomicSEM to the Rust port (gsemr R package, genomicsem Python package, or gsem CLI), and who need to know which results will be numerically identical to R and which will not.
Maintainers tracking design decisions and their numerical consequences across releases.

For a per-parameter compatibility reference (“does userGWAS(MPI=TRUE) work? is fix_measurement supported?”), see API_COMPAT.md.

Tests in crates/gsem/tests/r_validation.rs are the source of truth for every parity claim here. If this document contradicts the tests, trust the tests and file an issue.

1. What this is

A from-scratch reimplementation of R GenomicSEM in Rust. No FFI to R or lavaan, no BLAS/LAPACK dependency, no shared numerical code. Dense linear algebra runs against faer; parallelism uses rayon with per-call local thread pools.

Four crates implement the pipeline:

gsem-matrix — vech, near-PD projection, PSD smoothing
gsem-ldsc — LD Score Regression, block jackknife, HDL
gsem-sem — SEM engine: partable, implied covariance, L-BFGS DWLS/ML, sandwich SEs, fit indices
gsem — I/O, munge, sumstats merge, per-SNP GWAS pipelines, CLI

Three frontends sit on top of the same core:

gsemr — R package via extendr
genomicsem — Python package via pyo3 + maturin
gsem — CLI binary

The core runs the same code path regardless of frontend. The frontends differ only in how inputs are passed in and how results are serialized back (R data frame via JSON, Python objects via PyO3, or TSV to stdout).

Numerical methodology matches R GenomicSEM. The places where outputs do not match R byte-for-byte — sandwich vs information-matrix SEs, L-BFGS vs nlminb local minima on Heywood cases, the commonfactorGWAS parameterization — are enumerated in §3 with their justification.

2. Why it’s faster

2.1 No R interpreter and no lavaan per-SNP

The single biggest userGWAS speedup comes from eliminating lavaan::sem() from the per-SNP inner loop. R userGWAS calls lavaan::sem() once for every SNP. Each call:

re-parses the lavaan model syntax string,
rebuilds an lavaanModel S4 object and its parameter table,
constructs and validates model matrices from scratch,
runs the optimizer, and
runs standard-error post-processing.

Most of that work is invariant across SNPs — the model structure never changes — but the R/S4 machinery has no way to cache it, so it pays the cost per SNP.

The Rust port parses the model syntax once at the top of run_user_gwas, builds a ParTable once, and reuses it across every SNP (crates/gsem/src/gwas/user_gwas.rs:127). Only the per-SNP covariance matrices (S_full, V_full) are rebuilt per iteration, and only the numerically-essential parts of the model struct are re-materialized.

2.2 `fix_measurement` baseline pinning

When fix_measurement=TRUE (the default in both R and Rust), the measurement-model parameters are fit once on the no-SNP model and then held constant per SNP. R does this at the lavaan level: it still goes through a full lavaan::sem() call per SNP with fixed measurement parameters. The per-SNP work saved is optimizer iterations, not lavaan overhead.

The Rust port pins the baseline parameters directly in the ParTable (crates/gsem/src/gwas/user_gwas.rs:131-212), so the per-SNP fit is a tiny DWLS problem over only the SNP-related free parameters. The full sandwich standard-error computation still uses the entire parameter set (via a rebuilt “sandwich model” at :347-409) to match R’s SE behavior, but the optimization itself runs against the minimum free-parameter set. This is validated against R’s per-SNP numerics by test_user_gwas_per_snp_match_r.

2.3 `faer` linear algebra with SIMD

faer is a pure-Rust dense linear-algebra library with SIMD-accelerated kernels and parallel factorizations. It has no BLAS dependency, so there is no system-BLAS variability in results — a faer build will produce byte-identical output across platforms that share the same SIMD width. For the matrix sizes this library works with (single-digit to hundreds of rows), faer’s SIMD kernels are consistently faster than BLAS DGEMM dispatch, and the factorization routines (partial_piv_lu, self_adjoint_eigen) avoid repeated allocation overhead that shows up in tight loops.

2.4 Rayon thread pools instead of process clusters

The Rust port uses rayon for parallelism. Every user-facing function that wants threads builds its own local rayon thread pool for the duration of the call (crates/gsem/src/gwas/user_gwas.rs:214-246), rather than relying on a global thread pool. This means:

parallel=FALSE / cores=1 fully disables threading for that call without affecting other concurrent calls.
There is no cluster startup cost — the thread pool spins up in microseconds.
Workers share process memory, so there is no serialization overhead when combining results. At the end of the parallel loop rayon just collects the Vec<SnpResult> directly.

By contrast, R’s foreach + doParallel + makeCluster path:

starts fresh R worker processes (fork on Unix, spawn on Windows),
serializes the full worker environment for each chunk,
calls rbind on data frames from each worker to reassemble the result.

For a small GWAS the cluster overhead is a significant fraction of runtime; the MPI branch in userGWAS.R exists specifically because the authors expected large runs to go to distributed clusters.

2.5 Shared state, no data cloning on hot paths

The per-SNP loop borrows slices rather than cloning per-SNP Vec<f64> arrays (see commit 65974df). At 3 traits × 1.2M SNPs this eliminates about 60 MB of redundant heap data held resident during a GWAS run; at 10M SNPs it is ~500 MB. Small changes like this matter in aggregate because they reduce allocator pressure during the parallel inner loop, which is where rayon worker scheduling and memory locality start to determine end-to-end throughput.

2.6 Performance numbers

Measured on the PGC Anxiety / OCD / PTSD summary statistics (3 traits, ~4.9 million merged SNPs) with BENCH_CORES=4 on an Apple-silicon laptop. Lower is better; last column is R ÷ Rust.

Function	R GenomicSEM	Rust port	Speedup
`munge` (3 traits)	104.1s	21.1s	5×
`ldsc`	10.5s	1.8s	6×
`sumstats` (3 traits merged)	135.9s	24.1s	6×
`simLDSC`	54.3s	21.1s	2.6×
`commonfactor` (DWLS)	0.11s	<1ms	~50×
`usermodel` (DWLS, 1-factor)	0.06s	<1ms	~30×
`rgmodel`	0.23s	<1ms	~100×
`paLDSC` (r=100, p=0.95)	0.61s	<5ms	~200×
`userGWAS` (N=1 000)	6.3s	0.02s	373×
`userGWAS` (N=5 000)	36.9s	0.07s	567×
`userGWAS` (N=10 000)	89.7s	0.12s	748×
`userGWAS` (N=4 936 648, full)	not attempted	57.4s	—

The userGWAS rows are end-to-end per-SNP fits (not just the rayon loop). R’s fixed lavaan-per-SNP overhead scales linearly with N — ~9 ms per SNP — while Rust’s scales sub-linearly once the thread pool saturates, so the gap widens with bench size:

SNP count	R wall time	Rust wall time	Gap
1 000	6.3s	0.02s	373×
5 000	36.9s	0.07s	567×
10 000	89.7s	0.12s	748×
4.9M (full PGC)	—	57.4s	—

R GenomicSEM recommends MPI at full-scale GWAS and the upstream tutorial doesn’t attempt it on a single machine. gsemr’s 57-second full-scale run is on a single laptop with four worker threads.

The Python (genomicsem) and CLI (gsem) bindings measure within a few percent of the R-side Rust numbers on all operations — the Rust core is shared, so the binding layer adds only process-startup / GIL handoff overhead.

See bench/benchmark_perf.R for the reproducible benchmark harness and bench/benchmark_plots.pdf for the full cross-impl comparison including CLI and Python rows. The bench does full tolerance-based equivalence checking (R vs Rust) alongside timing on every function.

3. Algorithmic and numerical differences from R GenomicSEM

For the bulk of the pipeline — LDSC, the S/V/I matrices, single-factor and multi-factor SEM, usermodel, and userGWAS per-SNP point estimates — the Rust port produces results identical to R GenomicSEM within the numerical tolerances listed in §5. The differences below are the places where agreement is not byte-for-byte, and the reasoning for each choice.

3.1 SEM optimizer: L-BFGS (Rust) vs nlminb (R / lavaan)

R GenomicSEM uses lavaan::sem(), which defaults to nlminb, a trust-region quasi-Newton optimizer. The Rust port uses a hand-rolled L-BFGS implementation with projected lower bounds (see crates/gsem-sem/src/estimator.rs:102-320). L-BFGS was chosen because it has lower per-iteration cost, no dense Hessian storage (important when k grows), and is straightforward to run in parallel across independent per-SNP problems without shared state.

What this means for numerics:

On well-conditioned problems both optimizers reach the same global minimum to many decimal places. The R-parity tests test_sem_estimates_match_r and test_sem_2factor_all_params_match_r enforce agreement at per-parameter differences below 0.05 and chi-square agreement below 1e-4. These pass consistently.
On non-convex objectives (for example, a Heywood case where a residual variance wants to go negative), the two optimizers can land at different local minima. This is documented in §3.5.
L-BFGS with projected bounds does not implement a full active-set strategy. Parameters whose lower bounds are binding at the optimum are clamped; if the objective surface is genuinely worse without clamping, the reported estimate is the clamped one. lavaan’s nlminb behavior in the same situation is determined by its trust-region update and can differ.

3.2 Standard errors: sandwich (Rust) vs information-matrix (R / lavaan)

The Rust port always reports sandwich (robust) standard errors for SEM and per-SNP GWAS parameters. The implementation is at crates/gsem-sem/src/sandwich.rs:

bread = (Delta' * W * Delta)^(-1)
lettuce = W * Delta
Omega = bread * lettuce' * V * lettuce * bread
SE = sqrt(diag(Omega))

where Delta is the Jacobian of the implied covariance with respect to free parameters, W is the DWLS weight matrix, and V is the sampling covariance matrix of the observed S-vector.

R GenomicSEM passes through lavaan’s default se = "standard", which reports information-matrix SEs — essentially the bread term alone, without the V correction.

What this means for numerics:

On well-conditioned SEM problems (commonfactor, usermodel) the two agree closely, typically within a few percent.
On degenerate per-SNP GWAS surfaces (sample.nobs=2 in lavaan terms) they can differ by tens of percent. The Rust values are the correct robust SEs in the Huber sandwich sense; the R values are smaller because they don’t account for the sampling variation of the input covariance. Point estimates and z-statistics have the same interpretation in both packages, but a per-SNP SE column will not numerically match R.
The sandwich-SE code path is also the reason the Rust commonfactorGWAS and userGWAS paths are robust to sumstats differences that make R’s information-matrix SEs unstable. It is not an optional feature that can be switched off — the robust SEs are always computed because the full V is needed for the chi-square anyway.

3.3 `commonfactorGWAS` parameterization

This is the largest structural difference between the two packages.

R GenomicSEM::commonfactorGWAS runs lavaan internally with marker-indicator identification: the first loading is fixed to 1 and the factor variance is left free. On the degenerate sample.nobs=2 per-SNP surface, lavaan then appears to free F1~~F1 jointly with F1~SNP per SNP, producing a two-dimensional per-SNP optimization.

The Rust commonfactorGWAS defaults to fixed-variance identification (Identification::FixedVariance): the factor variance is fixed to 1 and all loadings are free. This matches R’s userGWAS per-SNP estimates exactly, which is validated by test_user_gwas_per_snp_match_r and test_commonfactor_gwas_per_snp_match_r.

An opt-in Identification::MarkerIndicator mode is available for users who want R-style anchoring. Its output is mathematically consistent with the FixedVariance path under the standard loading-rescaling identity, but it does not numerically match R’s commonfactorGWAS per-SNP estimates, because the per-SNP free-parameter set differs.

The underlying reason is that R’s own commonfactorGWAS and userGWAS outputs disagree in sign for roughly 80% of SNPs on the same data — the two parameterizations are not numerically equivalent on this kind of surface, even within R GenomicSEM. There is no single “correct” answer to match, so the Rust port defaults to the parameterization that matches R’s userGWAS.

Recommendation:

For parity with R userGWAS per-SNP estimates: use either Rust userGWAS or Rust commonfactorGWAS (default).
For parity with R commonfactorGWAS per-SNP signs and magnitudes specifically: no exact replacement is currently provided. The Identification::MarkerIndicator opt-in is numerically consistent with the lavaan marker-indicator model but not with lavaan’s per-SNP sample.nobs=2 behavior.

3.4 `userGWAS` fix_measurement strategy

Both packages support fix_measurement as a runtime speedup, but they implement it differently (see §2.2 above for the performance implication).

Numerically: when fix_measurement=TRUE, the Rust userGWAS matches R’s per-SNP point estimates exactly (within DWLS optimizer tolerance), as enforced by test_user_gwas_per_snp_match_r. The measurement parameters are identical, the per-SNP fits are against the same reduced free-parameter set, and the optimizer converges to the same point. This is the main per-SNP parity story for the Rust port.

When fix_measurement=FALSE the situation is the same as for general SEM fits: well-conditioned problems agree to many decimals, and Heywood-case problems can diverge at the optimizer-choice level (§3.1).

3.5 Heywood cases (negative residual variances)

Both packages allow negative residual variances by default — neither bounds theta away from zero in the SEM model. This is lavaan’s default behavior, and the Rust port preserves it so that users porting existing R scripts get matching results.

On a Heywood surface, two different optimizers (nlminb vs L-BFGS) can find different local minima. The Rust single-factor SEM and usermodel test fixtures cover both normal and Heywood situations and pass within tolerance; per-SNP fits on top of a Heywood baseline are more sensitive, which is one of the factors behind the commonfactorGWAS story in §3.3.

3.6 Near-PD smoothing

When the per-SNP S_full or V_full matrix is not positive definite, both R and Rust project it back to the nearest PSD matrix before the per-SNP fit. The Rust port uses a near_pd implementation in crates/gsem-matrix/ that matches R’s Matrix::nearPD to 1e-7 (test_near_pd_matches_r).

Smoothing is gated by the smooth_check parameter in userGWAS and commonfactorGWAS. When enabled, a log warning is emitted per SNP that needed smoothing so users can distinguish pathological inputs from numerical noise.

3.7 Genomic control modes

The Rust port supports three genomic control modes, matching R:

Standard (default) — divides the Z statistic by sqrt(i_ld_diag), which is the canonical LDSC GC adjustment.
Conservative — divides by i_ld_diag directly (a more aggressive shrinkage).
None — no adjustment.

String parsing accepts R’s spellings ("standard", "conserv", "conservative", "none", "off"). Unknown values fall back to Standard with no error, matching R’s permissive behavior.

Enforced parity: test_v_snp_standard_matches_r, test_v_snp_conservative_matches_r, test_v_snp_none_matches_r, test_z_pre_matches_r — all hold to 1e-12 / 1e-10.

3.8 Q_SNP heterogeneity statistic

The Rust port computes Q_SNP via eigendecomposition of the per-SNP V matrix (see crates/gsem/src/gwas/q_snp.rs):

eta = vech(S_subset - Sigma_subset)
Q_SNP = sum_i (u_i' * eta)^2 / lambda_i

where (u_i, lambda_i) are eigenvectors/eigenvalues of V_subset. This is the same formula used by GenomicSEM::userGWAS under the hood, so the outputs should match R to numerical tolerance on well-conditioned inputs. There is no dedicated Q_SNP R-parity test in the current test suite — add one if you rely on exact Q_SNP parity.

4. Memory profile at scale

For a typical 3-trait GWAS run, the Rust port uses 1.5-2 GB peak memory on 1.2M SNPs and finishes in a few minutes. R GenomicSEM on the same input takes hours on a single machine, which is why the R package documentation recommends MPI for anything larger than a few hundred thousand SNPs.

What grows with N_SNPs:

The sumstats input (MergedSumstats) — loaded fully into memory before the GWAS loop starts. Stored as a Structure-of-Arrays (parallel Vec<String> for rsIDs, Vec<u8> for A1/A2, flat Vec<f64> of length n * k for the beta/se matrices) to keep the per-SNP footprint near ~98 bytes. ~480 MB at 4.94M SNPs × 3 traits; ~100 MB at 1.2M SNPs × 3 traits.
The per-SNP results vector (Vec<SnpResult>) — held resident for the entire run, then copied into the frontend’s native result type (R list of column vectors, Python dict of NumPy arrays, or TSV on disk). ~400 MB at 1.2M SNPs × 3 traits.
Both the R and Python bindings now return per-SNP results as columnar containers (a named R list or a Python dict of NumPy arrays), allocating roughly 1× the final size with no intermediate JSON buffer. The earlier 3× JSON-encoding peak that used to dominate the R/Python hot path is gone.

What grows with k (traits):

Per rayon worker, a small S_full ((k+1)×(k+1)) and two larger kstar×kstar matrices (V_full and w_diag where kstar = (k+1)(k+2)/2). At k=3 this is well under 1 KB per worker. At k=20 it is ~2 MB per worker. At k=50 it is ~60 MB per worker, which starts to dominate memory on 32-thread boxes.
The w_diag matrix is currently stored as a dense matrix even though only its diagonal is used. This is the largest known opportunity for scaling to high-k GWAS — replacing it with a diagonal type would save both memory and sandwich-SE compute time. Not yet done.

The practical ceiling on the bindings is the resident Vec<SnpResult> plus whatever copy the frontend makes to build its return value. For a standard 3-trait × 1.2M SNP GWAS this is well under 2 GB. For 10M+ SNPs on a laptop, prefer the gsem CLI, which writes TSV directly and never materializes the full result in memory.

Streaming output from the bindings is a planned improvement but not currently implemented.

5. Numerical parity: what the test suite enforces

Every claim in §3 is backed by a test in crates/gsem/tests/r_validation.rs. The tests compare Rust output against an R GenomicSEM run on fixed fixture data (serialized as JSON) and enforce the following tolerances:

Component	Tolerance	Test function
`near_pd` (matrix smoothing)	1e-7	`test_near_pd_matches_r`
`vech` / `vech_reverse` (utilities)	1e-15	`test_vech_3x3_matches_r`, `test_vech_4x4_matches_r`
`cov_to_cor`	1e-12	`test_cov_to_cor_matches_r`
V_SNP construction (all 3 GC modes)	1e-12	`test_v_snp_standard_matches_r`, `_conservative_`, `_none_`
`S_Full` construction	1e-12	`test_s_full_matches_r`
`V_Full` construction	1e-10	`test_v_full_matches_r`
Z-statistic pre-GC adjustment	1e-10	`test_z_pre_matches_r`
V matrix reordering	1e-12	`test_v_reorder_matches_r`
1-factor SEM point estimates	0.05	`test_sem_estimates_match_r`
1-factor SEM model χ²	1e-4	`test_sem_estimates_match_r`
1-factor SEM SRMR	1e-6	`test_sem_estimates_match_r`
2-factor SEM all parameters	0.05	`test_sem_2factor_all_params_match_r`
2-factor SEM model χ²	1e-4	`test_sem_2factor_all_params_match_r`
`commonfactor` point estimates	0.02	`test_commonfactor_matches_r`
`commonfactor` SEs	0.01	`test_commonfactor_matches_r`
`commonfactor` CFI	0.01	`test_commonfactor_matches_r`
`commonfactor` SRMR	1e-6	`test_commonfactor_matches_r`
GWAS baseline (no-SNP) fit	0.002	`test_gwas_baseline_commonfactor_match_r`
`commonfactorGWAS` per-SNP estimates	0.01	`test_commonfactor_gwas_per_snp_match_r`
`commonfactorGWAS` marker-indicator mode	0.01	`test_commonfactor_gwas_marker_indicator_matches_r`
`userGWAS` per-SNP estimates	0.01	`test_user_gwas_per_snp_match_r`
`userGWAS` per-SNP χ²	0.5	`test_user_gwas_per_snp_match_r`

All 18 tests pass on the current tree. Pure linear-algebra utilities (vech, cov_to_cor, matrix construction) are enforced at 1e-10 or tighter because these are pure math with no estimation involved. SEM estimates and per-SNP GWAS estimates are enforced looser (0.01-0.05) because the optimizer and parameterization differences in §3.1 and §3.3 put a floor on how tight the agreement can be.

If you need tighter parity for a specific component, run the relevant test locally with the fixture you care about and open an issue with the observed discrepancy.

README.md — install instructions and usage examples
API_COMPAT.md — parameter-level compatibility reference (which R GenomicSEM arguments are implemented)
CONTRIBUTING.md — dev setup and contribution guide
crates/gsem/tests/r_validation.rs — the R-parity test suite
bench/benchmark_perf.R — the reproducible benchmark harness

This page is generated from ARCHITECTURE.md in the repository — edit it there.