Aggregation Guide¶

CUSP contains many observations that are densely sampled in some places and much sparser elsewhere. Aggregation can be useful when a model assumes more independent observations, when you want to reduce the influence of dense local sampling, or when you plan to join CUSP to environmental layers that are much coarser than individual field points.

The CUSP aggregation tool groups nearby observations within a chosen spatial and temporal window. The default settings are 30 m and 31 days, but you can set whatever distance and time limits are appropriate for your analysis.

What Aggregation Does¶

The aggregation workflow:

starts from a CUSP observation table
groups observations that fall in the same projected grid cell and date window
keeps annual separation so records from different years are not grouped together
allows grouping across sources
preserves provenance through a membership table
sets aggregated pf_observed to the mean of member 0/1 values
sets aggregated method to mixed when multiple methods are present

Important default settings:

Setting	Default	Meaning
Distance threshold	`30 m`	Observations are grouped within projected 30 m grid cells unless you pass a different `--distance-m` value.
Temporal linkage	`31 days`	Within the same year and grid cell, observations can be linked when neighboring observation dates are no more than 31 days apart.
Effective total window	up to `62 days`	A grouped date can include observations as much as 31 days before and 31 days after the representative date.
Annual separation	preserved	Observations from different calendar years are not grouped together.
Grouping projection	`EPSG:3413`	Spatial grouping is computed in a projected Arctic coordinate system.
Output coordinates	`EPSG:4326`	Aggregated latitude and longitude are exported in WGS84.

Run The Default Aggregation¶

python -m cusp.aggregate
python -m cusp.qc validate-aggregated

Important Options¶

See all options with:

python -m cusp.aggregate --help

Common options:

Option	What it controls
`--input`	Observation-level table to aggregate.
`--output`	Aggregated CSV to write.
`--membership-output`	Table linking each original `cusp_obs_id` to an aggregated group.
`--flags-output`	QC flags for mixed sources, mixed methods, mixed permafrost labels, and similar checks.
`--excluded-output`	Rows skipped by the aggregation workflow.
`--gpkg-output`	GeoPackage export of aggregated points.
`--manifest-output`	Parameters, row counts, hashes, and run metadata.
`--distance-m`	Spatial grouping threshold in meters. The default is `30`.
`--temporal-link-days`	Temporal linkage threshold in days. The default is `31`.

Example: Custom Aggregation¶

python -m cusp.aggregate \
  --input exports/latest/cusp_v1.0.csv \
  --output runs/examples/aggregated_100m_example.csv \
  --membership-output runs/examples/aggregated_100m_example_membership.csv \
  --flags-output runs/examples/aggregated_100m_example_qc_flags.csv \
  --excluded-output runs/examples/aggregated_100m_example_excluded_rows.csv \
  --gpkg-output runs/examples/aggregated_100m_example.gpkg \
  --manifest-output runs/examples/aggregated_100m_example_manifest.json \
  --distance-m 100 \
  --temporal-link-days 14

If you publish or share a custom aggregation, name it clearly so other users can distinguish it from the original CUSP release table.

When To Use Custom Aggregation¶

Custom aggregation runs are useful for:

sensitivity analysis
testing alternate model input density
evaluating different spatial thinning choices
matching the approximate scale of environmental covariates
comparing how temporal linkage changes grouped records

They are user-created derivatives unless they are explicitly published as CUSP release files.

Check A Custom Run¶

For non-default outputs, inspect at least:

row count
n_grouped
fraction of mixed-method groups
fraction of mixed-source groups
whether grouped points look spatially reasonable

You may also want to re-sample environmental features for the aggregated table. See GEE feature sampling.