Aggregation Validation¶
Scope¶
This document records the first rebuild of the default 30m aggregation
workflow from the canonical observation-level table. The 30m aggregation is a
reproducible derivative, not an official versioned release artifact for v1.
Artifacts produced by python -m cusp.aggregate:
data/aggregated_30m.csvdata/aggregated_30m_membership.csvdata/aggregated_30m_qc_flags.csvdata/aggregated_30m_excluded_rows.csvdata/aggregated_30m.gpkgdata/aggregated_30m_manifest.json
Current Default Aggregation Behavior¶
The current aggregation path:
- reads CUSP observation rows from the working observation table
- requires deterministic
cusp_obs_idvalues from the observation build - assigns observations to deterministic projected grid cells in
EPSG:3413 - exports the public aggregated artifacts back out in
EPSG:4326/ WGS84 where geometry is written - uses a
30 mcell size for the default30mworkflow - separates aggregation groups by calendar year
- within each spatial cell-year group, links observations into temporal groups
using a symmetric
31-day forward/backward rule - this corresponds to a
62-day total temporal window, implemented as a31-day linkage threshold between neighboring observations in the same cell-year sequence - aggregates across sources rather than restricting to within-source groups
Current Rebuild Snapshot¶
aggregated_30m.csv- rows:
27,691 - columns:
cusp_30m_idyeardatelatlonpf_observedthaw_depthpf_depthobs_limitmethodaggregated_sourcesn_grouped
aggregated_30m_membership.csv- rows:
249,012 - unique aggregated groups:
27,691 - unique member observations:
249,012 aggregated_30m_excluded_rows.csv- rows:
0 aggregated_30m_qc_flags.csv- rows:
1,349 aggregated_30m.gpkg- CRS:
EPSG:4326
Output Semantics¶
cusp_30m_idis deterministic and derived from the sorted set of membercusp_obs_idvalues.yearis explicit in the output even though the public-facing artifact name is30m.dateis currently the latest observation date within the aggregated spatial-temporal group.pf_observedis currently the mean of the retained0/1observations, so mixed groups yield fractional values between0and1, while retaining the field namepf_observed.methodis preserved when all retained observations in the group share one method value; heterogeneous groups are labeledmixed, while truly unknown source-level methods can still remainunknown.aggregated_sourcesrecords the unique contributingsourcevalues for each aggregated row so downstream users can trace citation provenance.
Current QC Flag Counts¶
mixed_pf_observed:615mixed_method:329multi_date_window:309mixed_source:96
These are audit outputs, not automatic blockers.
Interpretation Notes¶
The current temporal rule is meant to prevent observations from very different parts of the thaw season from collapsing together just because they share a location-year cell.
This means the aggregation product is not a simple "all observations within 30 m and year" collapse. It is a spatial-plus-temporal aggregation intended to be more suitable for active-layer style modeling and comparison workflows.
Legacy Artifact Cleanup¶
The old legacy aggregation CSVs have now been removed from the repo:
aggregated_10000m_noyear.csvaggregated_1000m_year.csvaggregated_100m_year.csvaggregated_30m_year.csvaggregated_5000m_year.csvaggregated_500m_noyear.csv
Remaining Questions To Confirm¶
No open CRS decision remains for v1:
- aggregation distance is computed in projected
EPSG:3413 - exported geometries remain in user-facing
EPSG:4326