CUSP Reproducibility Audit¶
Date: 2026-04-10 Status: Repo-structure audit plus initial observation-build execution audit for Phase 3
Scope¶
This is an initial source-by-source reproducibility audit based on the current repository contents.
It is not yet a full execution audit. In particular, this pass does not prove that every source-specific processing script runs cleanly end to end in the current environment. It audits what is present in the repo now so we can identify the largest reproducibility gaps before rebuilding artifacts.
This pass used:
- the internal source-summary table as the current included-source list for the observation-level release
- source directories under
data/ - recursive checks for:
- a
processed_*.csvoutput - a processing script with a name containing
process,Processed, orPorcessed - obvious local documentation artifacts such as
readme,metadata,guide,manifest, orscience-metadatafiles
For this audit, "documentation artifact present" is only a proxy for manual-step documentation. A source can have metadata and still be missing a clear rebuild recipe.
Headline Findings¶
- Current included release sources:
50 - Included release sources with a checked-in
processed_*.csv:50 / 50 - Included release sources with a checked-in processing script:
50 / 50 - Included release sources with an obvious local reproducibility-documentation artifact:
16 / 50 - Included release sources that likely still need explicit manual-step documentation:
34 / 50 - Processed-but-not-included sources:
2 Beer_etal_2013Sadeghi_etal_2023- Script-but-no-processed-output source:
1 Yi_etal_2020_ABoVE- Known skipped source with no current processed output in repo:
Wilcox_2015
Provisional Interpretation¶
Level Areproducibility looks plausible for the current release model because every currently included source has both a processed CSV and a checked-in source-specific processing script.Level Breproducibility is not yet release-ready because most included sources still do not have an obvious local manual-step document or rebuild note.- The biggest Phase 3 gap is not missing scripts. It is missing explicit per-source rebuild documentation and then verifying that those scripts still run successfully.
Initial Supported Combine Execution Audit¶
Date run: 2026-04-10
Execution target:
python cusp/combine_data.py
Execution method:
- run in an isolated audit copy under
/tmpso the checked-in release artifacts in the repo were not overwritten
Environment notes:
pythonavailable from the current conda-based environment- verified imports before the run:
pandasgeopandasnumpy
Artifacts checked:
- working observation table
- the internal source-summary table
Headline result:
- the observation build completed successfully in the isolated audit copy
- both checked-in build artifacts were reproduced semantically from the current processed source tables
- the rebuilt files were not byte-for-byte identical to the checked-in files
Observed rebuild results:
- rebuilt working observation table
251,935rows47columns50uniquesourcevalues- date range
1962-08-15to2024-10-03 - rebuilt the source-summary table
50rows6columns50uniquesourcevalues
What differed from the checked-in files:
- working observation table
- column order changed
- row order changed
- some numeric-looking values were written with different string formatting such as
25vs25.0 - source-summary table
- source row order changed
- values matched after canonical sorting and numeric normalization
What matched after canonical comparison:
- working observation table
- same column set
- same per-source row counts
- same date range
- semantically equal after canonicalizing column order, row order, nulls, and numeric formatting
- source-summary table
- same column set
- same integer fields
- same bounding-box areas within a tight numeric tolerance
- semantically equal after sorting by
sourceand canonicalizing numeric formatting
Interpretation:
- the current observation-build path is functionally reproducible for the working observation table and source-summary table
- the current observation-build path is not yet deterministic at the file-layout level
- the most likely causes are:
- unsorted source discovery via
os.listdir(...) - column-order drift caused by concatenating wide source tables in source-discovery order
- mixed-type column formatting drift during CSV round-tripping
Release implication:
- this is good enough to keep moving through Phase 3
- before release, the observation-build path should be hardened so official observation-level artifacts rebuild deterministically, not just semantically
Pilot Source-Script Execution Audit¶
Date run: 2026-04-10
Execution method:
- run selected
process_*.pyscripts in isolated audit copies under/tmpso the checked-inprocessed_*.csvartifacts in the repo were not overwritten
Pilot sources audited:
Koyukuk_2018Cable_2017PastickMoore_et_al_2025Wagner_2019
Headline result:
0 / 5pilot scripts rebuilt successfully in the current environment without intervention- the failures were informative and fell into a few clear categories rather than looking random
Failure categories observed:
- current-pandas script breakage
Koyukuk_2018- path-portability bug in the script
Cable_2017- mixed-type handling bug in the script
Pastick- missing raw input data in the repo
Moore_et_al_2025- environment dependency missing from the current runtime
Wagner_2019
Per-source notes:
Koyukuk_2018- failed because chained assignment is used to replace
Y/Ninpf_observed - this breaks under the current pandas copy-on-write behavior and string dtype handling
Cable_2017- failed because
process_network()readsnetwork_sampling_sites.csvas a bare relative path - the rest of the script uses
_ROOT_DIR, so this looks like a straightforward portability bug rather than a missing-data issue Pastick- failed because the script assumes non-integer
pf_observedvalues are strings and calls.lower()on a float/NA value Moore_et_al_2025- failed because the required raw input
ABoVE_Soil_ThawDepth_Moisture_Validation_V2.csvis not present in the repo - this is a real reproducibility-input gap, not just a script quirk
Wagner_2019- failed because
openpyxlis not available in the current environment even though the script reads.xlsxfiles - this indicates environment drift between the documented environment and the environment actually used for the audit
Interpretation:
- the repo-level combine path is in better shape than the source-level rebuild path
- at least some currently included sources will require:
- script fixes for compatibility and portability
- explicit restoration/documentation of missing raw inputs
- environment validation against the documented dependency set
Release implication:
- Phase 3 should continue, but we should not yet claim that the observation-level release is source-by-source rebuildable from the current repo state
- the manifest should now be treated as a live blocker register rather than just an inventory
Positive-Control Rechecks After Targeted Script Fixes¶
Date run: 2026-04-10
Sources rechecked:
Koyukuk_2018Cable_2017PastickWagner_2019Moore_et_al_2025
Result:
- all five sources rebuilt successfully in isolated audit copies after targeted script fixes, environment correction, or restoring a missing local input
- all five rebuilt
processed_*.csvartifacts matched the checked-in outputs semantically, withPastickrequiring a small coordinate-rounding tolerance because of reprojection precision drift - none of the rebuilt CSVs was byte-for-byte identical to the checked-in file
Per-source notes:
Koyukuk_2018- fixed by replacing chained assignment on
pf_observedwith a pandas-safe normalization path - rebuilt output had the same row count (
372) and matched semantically Cable_2017- fixed by resolving the network CSV path via
_ROOT_DIRwhen a non-absolute path is passed - rebuilt output had the same row count (
19) and matched semantically - date-parsing warnings are still emitted and should be cleaned up later for a quieter release workflow
Pastick- fixed by normalizing mixed
pf_observedencodings robustly across the projected-site shapefiles - rebuilt output had the same row count (
8,012) and matched the checked-in file after applying a modest coordinate-rounding tolerance - the remaining drift appears to be tiny reprojection precision noise rather than a substantive data change
Wagner_2019- no script patch was needed after the environment was corrected
- the earlier failure was due to missing
openpyxlin the shell environment used for the first audit pass - rerunning explicitly inside the
cusp2environment succeeded, and the rebuilt output had the same row count (143) and matched semantically Moore_et_al_2025- no script patch was needed once the raw input file was available locally
- rerunning explicitly inside the
cusp2environment succeeded, and the rebuilt output had the same row count (201,305) and matched semantically - the raw input CSV is currently a local/external dependency rather than an in-repo source artifact because it is
111 MBand intentionally gitignored
Interpretation:
- we now have positive-control evidence that at least some included source scripts can be brought into a release-ready state with relatively small fixes
- the remaining source-level audit work should continue to distinguish:
- simple script-compatibility fixes
- environment/dependency fixes
- missing-input-data blockers
Additional Source-Workflow Batch Audit¶
Date run: 2026-04-10
Batch audited:
Douglas_Koyukuk_2022James_2019James_2020Hanston_etal_2024Bakian_Dogaheh_2020
Headline result:
5 / 5sources in this batch are now execution-verified as semantic rebuilds3 / 5succeeded on the firstcusp2audit run:James_2020Hanston_etal_2024Bakian_Dogaheh_20202 / 5needed small pandas-compatibility fixes before succeeding:Douglas_Koyukuk_2022James_2019
Per-source notes:
Douglas_Koyukuk_2022- initially failed because
pf_observednormalization used chained assignment - after patching to a pandas-safe replacement path, rebuild succeeded and matched semantically
James_2019- initially ran but produced semantically wrong
pf_observedvalues because chained assignment no longer updated the column under current pandas - after patching to
.loc[...], rebuild succeeded and matched semantically James_2020- rebuild succeeded and matched semantically
- the 999/888 thaw-depth sentinels are now treated explicitly as 200 cm / 120 cm observation limits to stay consistent with centimeter-based thaw depths
- a PROJ database warning was emitted during the run, but the shapefile still read successfully and the workflow completed
Hanston_etal_2024- rebuild succeeded and matched semantically
Bakian_Dogaheh_2020- rebuild succeeded and matched semantically
Interpretation:
- the next-tier included source workflows are still largely fixable with small compatibility patches rather than deep redesign
- the main recurring code-level risk so far is older pandas idioms that either fail outright or silently stop mutating data under current copy-on-write behavior
Additional Source-Workflow Batch Audit II¶
Date run: 2026-04-10
Batch audited:
Daanen_2017Wang_2018Brown_etal_2000_calmEbel_2018Holloway_2019
Headline result:
5 / 5sources in this batch are now execution-verified as semantic rebuilds incusp24 / 5succeeded on the first confirmation rerun:Daanen_2017Wang_2018Brown_etal_2000_calmEbel_20181 / 5needed a small pandas/dtype cleanup before succeeding:Holloway_2019Holloway_2019is the first source in the recent audit batches to rebuild byte-for-byte identically after patching
Per-source notes:
Daanen_2017- rebuild succeeded and matched semantically
- the rebuilt CSV was not byte-identical, but no substantive differences were detected
Wang_2018- rebuild succeeded and matched semantically
- the rebuilt CSV was not byte-identical, but no substantive differences were detected
Brown_etal_2000_calm- rebuild succeeded and matched semantically
- the invalid-escape warning paths were cleaned up later the same day
Ebel_2018- rebuild succeeded and matched semantically
- the geometry-assignment warning path was cleaned up later the same day
Holloway_2019- initially failed because the year-specific
pf_observedandpf_depthcleanup no longer played nicely with current pandas/dtype behavior - after normalizing those conversions explicitly, rebuild succeeded
- the rebuilt CSV was byte-for-byte identical to the checked-in output
Interpretation:
- the source-processing layer continues to look recoverable with relatively small fixes rather than major redesign
- warning cleanup is now becoming a more prominent next-tier task once outright execution failures are removed
- older pandas-style implicit conversions remain the main source of real breakage
Additional Source-Workflow Batch Audit III¶
Date run: 2026-04-10
Batch audited:
Bonaventure_WhatiJones_2025Jones_Jones_2025Jorgenson_Kanevskiy_2022_GoslingJorgenson_Kanevskiy_2022_Jago
Headline result:
5 / 5sources in this batch are execution-verified as clean rebuilds incusp2- all five rebuilt outputs were byte-for-byte identical to the checked-in CSVs
- no warnings were emitted during these runs
Per-source notes:
Bonaventure_Whati- rebuild succeeded
- rebuilt output was byte-identical to the checked-in CSV
Jones_2025- rebuild succeeded
- rebuilt output was byte-identical to the checked-in CSV
Jones_Jones_2025- rebuild succeeded
- rebuilt output was byte-identical to the checked-in CSV
Jorgenson_Kanevskiy_2022_Gosling- rebuild succeeded
- rebuilt output was byte-identical to the checked-in CSV
Jorgenson_Kanevskiy_2022_Jago- rebuild succeeded
- rebuilt output was byte-identical to the checked-in CSV
Interpretation:
- some of the newer included source workflows are already in very strong shape for release
- this batch increases confidence that not all remaining Phase 3 work will require code fixes
- the highest-value next step is to keep pushing through included sources that look similarly likely to be clean or near-clean
Additional Source-Workflow Batch Audit IV¶
Date run: 2026-04-10
Batch audited:
Petrone_etal_2016Scheer_etal_2023Schwenk_PFRRSeward_2022Jorgenson_Kanevskiy_2025
Headline result:
5 / 5sources in this batch are now execution-verified as semantic rebuilds incusp24 / 5rebuilt cleanly without code changes:Petrone_etal_2016Scheer_etal_2023Schwenk_PFRRSeward_20221 / 5needed a small pandas-compatibility patch before succeeding:Jorgenson_Kanevskiy_2025- none of these rebuilt files was byte-for-byte identical
Per-source notes:
Petrone_etal_2016- rebuild succeeded and matched semantically
- no warnings were emitted
Scheer_etal_2023- rebuild succeeded and matched semantically
- no warnings were emitted
Schwenk_PFRR- rebuild succeeded and matched semantically
- no warnings were emitted
- the initial failed audit attempt was only an audit-harness path mistake, not a source-script problem
Seward_2022- rebuild succeeded and matched semantically
- no warnings were emitted
Jorgenson_Kanevskiy_2025- initially failed because the script assumed grouping columns were still present inside
groupby.apply() - after patching the summarization function to recover group keys safely, rebuild succeeded and matched semantically
- the later mixed-type CSV read warning and chained-assignment warning paths were cleaned up later the same day
Interpretation:
- the remaining included-source audit set continues to split into two manageable categories:
- workflows that are already reproducible but not byte-deterministic
- workflows that need small pandas-compatibility updates
groupby.apply()behavior under current pandas is now another recurring compatibility theme to watch for in older scripts
Additional Source-Workflow Batch Audit V¶
Date run: 2026-04-10
Batch audited:
Minsley_2015Minsley_2017Minsley_2021Obu_etal_2016Natali_2023
Headline result:
5 / 5sources in this batch are execution-verified as semantic rebuilds incusp2- none of the rebuilt files was byte-for-byte identical
3 / 5ran cleanly with no warnings:Minsley_2021Obu_etal_2016Natali_20232 / 5emitted warnings but still rebuilt semantically:Minsley_2015Minsley_2017
Per-source notes:
Minsley_2015- rebuild succeeded and matched semantically
- the later docstring escape issue and benign
openpyxlworkbook warning were cleaned up later the same day Minsley_2017- rebuild succeeded and matched semantically
- the later date-parsing warning path was cleaned up by specifying the input format explicitly
Minsley_2021- rebuild succeeded and matched semantically
- no warnings were emitted
Obu_etal_2016- rebuild succeeded and matched semantically
- no warnings were emitted
Natali_2023- rebuild succeeded and matched semantically
- no warnings were emitted
Interpretation:
- the Minsley-family workflows are broadly reproducible already, with only warning cleanup standing between them and quieter release-grade execution
- warning-only issues are now becoming common enough that Phase 3 should start distinguishing:
- semantic rebuild success
- warning cleanup needed
- code patch required
Additional Source-Workflow Batch Audit VI¶
Date run: 2026-04-10
Batch audited:
Chapin_2025Kling_2025Ruess_2025Sadeghi_etal_2023Talucci_2024
Headline result:
5 / 5sources in this batch are now execution-verified as semantic rebuilds incusp23 / 5rebuilt semantically without code changes:Chapin_2025Ruess_2025Sadeghi_etal_20232 / 5needed small compatibility/path fixes before succeeding:Kling_2025Talucci_2024- none of the rebuilt files was byte-for-byte identical
Per-source notes:
Chapin_2025- rebuild succeeded and matched semantically
- no warnings were emitted
Kling_2025- initially failed because the script resolved its default CSV and metadata paths relative to the repo root instead of
data/Kling_2025 - after patching default path resolution, rebuild succeeded and matched semantically
- no warnings were emitted
Ruess_2025- rebuild succeeded and matched semantically
- no warnings were emitted
Sadeghi_etal_2023- rebuild succeeded and matched semantically
- no warnings were emitted
Talucci_2024- initially failed under current pandas behavior because a
groupby.apply()filtering step dropped the grouping columns used later in the script - after replacing that step with a transform-based mask, rebuild succeeded and matched semantically
- no warnings were emitted
Interpretation:
- the newer source workflows continue to confirm the main Phase 3 pattern:
- many scripts are already semantically reproducible
- the remaining breakages are usually small path or pandas-compatibility issues
groupby.apply()and path resolution are now the two clearest recurring code-level themes to clean up proactively
Additional Source-Workflow Batch Audit VII¶
Date run: 2026-04-10
Batch audited:
Langer_etal_2020Patton_2021Peirce_2020Zhang_2019Zhao_2021
Headline result:
5 / 5sources in this batch are execution-verified as semantic rebuilds incusp2- none of the rebuilt files was byte-for-byte identical
- no warnings were emitted during these runs
Per-source notes:
Langer_etal_2020- rebuild succeeded and matched semantically
Patton_2021- rebuild succeeded and matched semantically
Peirce_2020- rebuild succeeded and matched semantically
Zhang_2019- rebuild succeeded and matched semantically
Zhao_2021- rebuild succeeded and matched semantically
Interpretation:
- this batch is another strong sign that a large fraction of the remaining included sources are already reproducible enough for release once the tracker is caught up
- the remaining queue is increasingly concentrated in the older and more idiosyncratic workflows rather than the recent additions
Additional Source-Workflow Batch Audit VIII¶
Date run: 2026-04-10
Batch audited:
Hollingsworth_2005Jafarov_2016Kling_2016Myers-Smith_2005Whitley_2018
Headline result:
5 / 5sources in this batch are execution-verified as semantic rebuilds incusp2- none of the rebuilt files was byte-for-byte identical
- no warnings were emitted during these runs
Per-source notes:
Hollingsworth_2005- rebuild succeeded and matched semantically
Jafarov_2016- rebuild succeeded and matched semantically
Kling_2016- rebuild succeeded and matched semantically
Myers-Smith_2005- rebuild succeeded and matched semantically
Whitley_2018- rebuild succeeded and matched semantically
Interpretation:
- even the older, messier-looking workflows are still frequently reproducible enough to keep in scope for v1
- the main residual work is increasingly about determinism, warning cleanup, and documentation rather than basic executability
Additional Source-Workflow Batch Audit IX¶
Date run: 2026-04-10
Batch audited:
SelawikSewardSmith_Burgess_2000Smith_Burgess_2002Walker_2022
Headline result:
5 / 5sources in this batch are execution-verified as semantic rebuilds incusp2- none of the rebuilt files was byte-for-byte identical
- no warnings were emitted during these runs
Per-source notes:
Selawik- rebuild succeeded and matched semantically
Seward- rebuild succeeded and matched semantically
Smith_Burgess_2000- rebuild succeeded and matched semantically
Smith_Burgess_2002- rebuild succeeded and matched semantically
Walker_2022- rebuild succeeded and matched semantically
Interpretation:
- all currently included observation-level release sources have now been execution-verified as semantic rebuilds in isolated
cusp2runs - Phase 3 has shifted from “can these source workflows run?” to:
- how deterministic do we need the outputs to be?
- which warning paths should be cleaned up before release?
- how do we close the per-source manual-step documentation gaps?
Warning Cleanup And Deterministic Combine Audit¶
Date run: 2026-04-10
Scope:
- warning-heavy source workflows:
Brown_etal_2000_calmCable_2017Ebel_2018Jorgenson_Kanevskiy_2025Minsley_2015Minsley_2017- combine-path determinism:
cusp/combine_data.py
Results:
- the warning-heavy scripts above were patched and rerun in
cusp2 - all six now rerun without the tracked Python/pandas/GeoPandas warnings
James_2020remains the notable residual warning path, but it is environment-level:ERROR 1: PROJ: proj_create_from_database: Open of .../share/proj failedcusp/combine_data.pywas updated to:- sort source discovery deterministically
- concatenate with stable indexing
- sort final observation rows by stable keys
- sort source-summary rows by
source - two consecutive
cusp2runs now produce identical hashes: - working observation table:
cf1f81cacdc0f1fb294043bf3fca444f147785c5d7626ff99c9ca9f32af6f109 - source-summary table:
02ef2c1555d06e93c98e8e393129d1911f767c36146c211b59c7db287a60e688
Interpretation:
- the main Phase 3 residuals are now:
- per-source manual-step documentation
- remaining source-specific byte-level nondeterminism
- deferred source-policy questions such as the direct-observation status of
Sadeghi_etal_2023 - the observation-build path is now strong enough to treat as byte-deterministic for repeated rebuilds in the audited environment
Included Release Sources With Script And Obvious Local Documentation Artifact¶
Cable_2017: script present; metadata artifact presentDaanen_2017: script present; metadata artifact presentDouglas_Koyukuk_2022: script present; readme presentHanston_etal_2024: script present; readme presentJones_2025: script present; metadata artifact presentJones_Jones_2025: script present; metadata artifact presentJorgenson_Kanevskiy_2022_Gosling: script present; metadata artifact presentJorgenson_Kanevskiy_2022_Jago: script present; metadata artifact presentKoyukuk_2018: script present; readme presentPastick: script present; metadata artifact presentSchwenk_PFRR: script present; readme presentSeward_2022: script present; bag/metadata artifacts presentWang_2018: script present; metadata artifact presentPetrone_etal_2016: script present; metadata artifact presentJorgenson_Kanevskiy_2025: script present; metadata artifact presentPawley_2018: script present; metadata artifact present
Included Release Sources With Script But No Obvious Manual-Step Documentation Artifact Yet¶
Bakian_Dogaheh_2020: script present; add explicit rebuild/manual-step noteBonaventure_Whati: script present; add explicit rebuild/manual-step noteBrown_etal_2000_calm: script present; add explicit rebuild/manual-step noteChapin_2025: script present; add explicit rebuild/manual-step noteEbel_2018: script present; add explicit rebuild/manual-step noteHollingsworth_2005: script present; add explicit rebuild/manual-step noteHolloway_2019: script present; add explicit rebuild/manual-step noteJafarov_2016: script present; add explicit rebuild/manual-step noteJames_2019: script present; add explicit rebuild/manual-step noteJames_2020: script present; add explicit rebuild/manual-step noteKling_2016: script present; add explicit rebuild/manual-step noteKling_2025: script present; add explicit rebuild/manual-step noteLanger_etal_2020: script present; add explicit rebuild/manual-step noteMinsley_2015: script present; add explicit rebuild/manual-step noteMinsley_2017: script present; add explicit rebuild/manual-step noteMinsley_2021: script present; add explicit rebuild/manual-step noteMoore_et_al_2025: script present; add explicit rebuild/manual-step noteMyers-Smith_2005: script present; add explicit rebuild/manual-step noteNatali_2023: script present; add explicit rebuild/manual-step noteObu_etal_2016: script present; add explicit rebuild/manual-step notePatton_2021: script present; add explicit rebuild/manual-step notePeirce_2020: script present; add explicit rebuild/manual-step noteRuess_2025: script present; add explicit rebuild/manual-step noteScheer_etal_2023: script present; add explicit rebuild/manual-step noteSelawik: script present; add explicit rebuild/manual-step noteSeward: script present; add explicit rebuild/manual-step noteSmith_Burgess_2000: script present; add explicit rebuild/manual-step noteSmith_Burgess_2002: script present; add explicit rebuild/manual-step noteTalucci_2024: script present; add explicit rebuild/manual-step noteWagner_2019: script present; add explicit rebuild/manual-step noteWalker_2022: script present; add explicit rebuild/manual-step noteWhitley_2018: script present; add explicit rebuild/manual-step noteZhang_2019: script present; add explicit rebuild/manual-step noteZhao_2021: script present; add explicit rebuild/manual-step note
Included Release Sources With Naming Or Traceability Risks Worth Cleaning Up¶
- script and processed-output filenames have now been standardized to lowercase
process_<source>.pyandprocessed_<source>.csvconventions across the included workflows Yi_etal_2020_ABoVEstill has a source-name mismatch between the directory and the internalsourcevariable, so that one remains a real traceability cleanup item
Excluded Or Incomplete Source Directories¶
Beer_etal_2013: processed CSV and script are present, butcusp/combine_data.pycurrently skips it because it is an interpolated map product with no datesChen_2015: removed fromdata/after duplicate review; retained in the master bibliography as a bibliographic-only source for synthesis traceabilitySadeghi_etal_2023: processed CSV and script are present, but the source is currently excluded while its direct-observation status is reviewedYi_etal_2020_ABoVE: processing script is present but no processed CSV is checked in;cusp/combine_data.pynotes that it needs online processing because it is too large to load directlyWilcox_2015: currently skipped incusp/combine_data.py; source files are present, but there is no checked-in processed CSV and the skip note says there are no lat/lon data for observations
Recommended Next Steps For Phase 3¶
- Add a short per-source rebuild note for the 34 included sources that currently have no obvious manual-step documentation artifact.
- Decide whether those per-source notes live in each source directory, in a central manifest, or both.
- Add short per-source rebuild notes or header-standard metadata for the included sources that still lack obvious manual-step documentation artifacts.
- Decide which source-level outputs need byte-for-byte determinism versus semantic-stability guarantees only.
- Revisit deferred sources like
Sadeghi_etal_2023once the source-scope decision is made. - Create or refine scripted checks that distinguish:
- clean rebuild
- semantic rebuild with accepted byte drift
- warning-only rebuild
- deferred / external-dependency workflow
- Clean up the script/source naming mismatches before automating rebuild checks.