Observation Build Validation¶
Scope¶
This document records a manual QA pass on the rebuilt observation-level artifacts:
- the working observation table exported later as
cusp_vX.Y.csv - the internal all-fields review table
- the internal source-summary table
- the internal source-reference crosswalk
The goal of this pass was to validate that the current observation-build path produces a structurally sound observation-level release candidate before any aggregation work begins.
Environment¶
- combine rebuild executed in
cusp2 - validation queries executed in
cusp2 - source-processing layer treated as the current accepted input set, with deferred sources still excluded from the release set
Current Rebuild Snapshot¶
- working observation table
- rows:
249,012 - columns:
11 - unique sources:
50 - date range:
1961-09-01to2024-10-03 - internal all-fields review table
- rows:
249,012 - preserves source-specific/wide fields for provenance and review
- internal source-summary table
- rows:
50 - unique sources:
50 - internal source-reference crosswalk
- rows:
50 - one row per included source
- filtered from
cusp_sources_bibtex.csvto the current included-source set - internal observation release manifest
- generated from the build path
- includes row counts, source counts, date range, file sizes, hashes, and generation timestamp for the observation-level artifacts
Checks That Passed¶
- required observation-level columns are present in the working observation table
- the working observation table now contains only the canonical observation-level fields:
cusp_obs_idsourcesite_idlatlondatepf_observedthaw_depthpf_depthobs_limitmethodcusp_obs_idis now present, non-null, and unique across the canonical observation tablepf_observedcontains only0and1when non-null- dates are parseable across the full table
- longitude and latitude ranges are within global bounds
- source coverage matches the current included-source set
- the working observation table and internal source-summary table rebuild deterministically in the audited environment
- the source-reference crosswalk rebuilds cleanly and has one unique row per
included
source - the observation release manifest is now generated automatically by
build.py
Findings That Need Cleanup Before Release¶
1. A small number of citation metadata fields still need cleanup in the¶
source-reference crosswalk
The crosswalk itself is structurally correct, but two sources still need citation metadata attention:
Bonaventure_Whati: missingtitlePastick: currently missingauthor,year, andtitle
These are documentation/citation cleanup items rather than observation-build failures, but they should be resolved before treating the crosswalk as release-ready.
2. A small number of records are still deleted for missing coordinates¶
The canonical observation table has no missing coordinates after hard deletion. Current missing-coordinate deletion-log records are concentrated in a few sources:
Minsley_2015:15Zhao_2021:12Hollingsworth_2005:6Ruess_2025:6
These do not look like widespread combine failures; they appear to be source-specific metadata gaps. They should be reviewed source by source and either:
- filled from source materials,
- explicitly accepted as coordinate-missing records, or
- excluded from the public observation-level release if coordinates are deemed required.
3. A small number of records still have missing site_id¶
Missing site_id values are concentrated in:
Pawley_2018:9308Bonaventure_Whati:145Koyukuk_2018:56Douglas_Koyukuk_2022:45Brown_etal_2000_calm:38
Pawley_2018 is expected to have missing site_id values because the source
does not provide row-level site identifiers, and the processing script does not
assign synthetic IDs.
Findings That Look Diagnostic Rather Than Blocking¶
Duplicate-key groups are common in a few sources¶
Using the key:
sourcesite_iddatelatlon
there are 14,999 rows participating in duplicate-key groups. These are
dominated by:
Brown_etal_2000_calmNatali_2023Jafarov_2016Bakian_Dogaheh_2020
Inspection suggests these are often repeated observations at the same site/date or multiple values recorded under the same site/date identifier, not obviously accidental duplicated rows introduced by the combine step. This should remain a diagnostic QA check, but it is not currently being treated as a release blocker by itself.
Swapped-coordinate heuristic is almost entirely a Brown_etal_2000_calm issue¶
The simple swapped-lat/lon heuristic flags 442 rows, all from
Brown_etal_2000_calm. Given the global historical scope of that source, these
are more likely to be heuristic false positives than actual swapped coordinates.
This should remain an audit output, not an automatic blocker.
The source summary and source-reference crosswalk serve different roles¶
The source-summary artifact is a compact per-source QA summary. The source-reference crosswalk is the citation-facing one-row-per-source artifact filtered to the included release set.
Recommended Next Fixes¶
- Decide whether missing coordinates are acceptable in the public observation-level release.
- Add stable
site_idvalues where feasible, especially for sources where the source clearly provides one or a synthetic transect/site identifier is appropriate, but treat remaining missing-site_idcases as warning-level issues rather than release blockers.
Status After Upstream QA Push¶
After pushing a substantial amount of QA/QC back into the individual
process_<source>.py scripts, the current observation-level build state is much
cleaner:
- no remaining
missing_methodflags - no remaining unsupported method values in the canonical observation table
- no remaining
zero_obs_limitflags - no remaining
(0,0)coordinate rows in the built observation table missing_site_idis no longer emitted as a build-level QC flag- missing
site_idremains accepted as a non-blocking source-level limitation where the original source does not provide one
Current remaining hard deletions are dominated by:
- source-level duplicate groups in
Brown_etal_2000_calm,Jafarov_2016, andBakian_Dogaheh_2020 - missing-coordinate rows in
Minsley_2015,Zhao_2021,Hollingsworth_2005, andRuess_2025
The duplicate-heavy sources are currently deferred for later source-level review rather than treated as public release blockers.
Interim Validation Verdict¶
The rebuilt observation-level bundle is structurally sound and suitable to use as the basis for continued release cleanup. The remaining issues are no longer combine-step failures. They are a short list of source-level cleanup items and citation-metadata gaps that should be resolved before treating the full observation-level release bundle as final.
Current Observation Build Behavior¶
The observation build path is now implemented in cusp/build.py.
cusp/combine_data.py is now only a compatibility wrapper around that logic.
Its current behavior is:
- rebuild raw all-fields observations from the processed source tables
- normalize
methodinto the controlled release vocabulary where possible - write a canonical working observation table that contains only the required core columns
- write an all-fields table as the wide/provenance-preserving version
- write a source-reference crosswalk as the one-row-per-source citation mapping artifact
- write an observation release manifest as the observation-level artifact inventory and checksum manifest
- delete rows with:
- missing coordinates
- missing
pf_observed (0,0)coordinates- exact duplicates across the canonical required fields
- write a deletion log to record hard deletions and reasons
- write a QC flag log to record non-deletion issues such as:
- missing
method obs_limit = 0
The current build assumption is that missing site_id is acceptable in the
canonical observation table, but missing coordinates and missing pf_observed
are not.
QA/QC Boundary¶
The intended long-term boundary is:
process_<source>.pyhandles source-specific interpretation, source-level QA, sentinel handling, units, date assumptions, and within-source deduplicationbuild.pyhandles cross-source consistency, canonical output shaping, global duplicate detection, and explicit release deletion/flag logs
This means contributor pull requests should ideally arrive with source-level judgment already encoded in the processing script, rather than relying on the final observation build step for source-specific interpretation.