GEE Feature Sampling¶
The feature sampler adds environmental information to CUSP rows using Google Earth Engine. You might use this tool when you want model inputs, spatial context, or comparison variables such as climate, terrain, soils, or surface water occurrence at CUSP observation locations.
For contributor-oriented instructions on extending the sampler, see Adding new GEE features.
Run Feature Sampling¶
python -m cusp.features --input exports/latest/cusp_v1.0.csv
The sampler writes:
- a feature table CSV
- a companion JSON manifest describing the feature set, input table, and sampling configuration
Earth Engine Authentication¶
Feature sampling does not use a CUSP-owned account or checked-in credentials. Each user should authenticate with their own Google Earth Engine account and run sampling through a Google Cloud / Earth Engine project they own or have permission to use.
One-time local setup:
earthengine authenticate
Then pass the project explicitly when sampling:
python -m cusp.features \
--input exports/latest/cusp_v1.0.csv \
--output runs/examples/cusp_v1.0_features.csv \
--manifest runs/examples/cusp_v1.0_features_manifest.json \
--gee-project <your-earth-engine-project> \
--resume
Internally, the sampler calls ee.Initialize(project=<your-earth-engine-project>).
If a user has configured a default Earth Engine project outside CUSP, they can
omit --gee-project, but passing it explicitly is clearer and more
reproducible.
Input Tables¶
The sampler can read any point-like CUSP table that includes:
- a canonical join ID:
cusp_obs_idfor observation-level tables such ascusp_vX.Y.csvcusp_30m_idfor aggregated tables such asaggregated_30m.csv
latlon- either
dateoryear
For a released feature table, use the observation-level CUSP table so the
result is keyed to cusp_obs_id.
Current Base Feature Set¶
base_v1 is the default set of environmental features sampled when you do not
request a custom feature list.
| Feature family | Output columns | What it represents |
|---|---|---|
soil_texture |
sand, silt, clay |
Depth-weighted SoilGrids texture fractions |
soil_oc |
soil_oc |
Depth-weighted SoilGrids soil organic carbon |
climate |
temperature, precip |
Antecedent ERA5 temperature and annualized precipitation |
swo_landsat |
swo_landsat |
Long-term Landsat surface-water occurrence |
merit_hand |
merit90_hand |
Height above nearest drainage |
terrain |
slope, aspect, curvature_6m, curvature_10m, curvature_14m, curvature_18m |
ArcticDEM terrain derivatives |
Feature Sources¶
| Feature(s) | Earth Engine source | Native resolution | Temporal handling |
|---|---|---|---|
slope, aspect, curvature_* |
UMN/PGC/ArcticDEM/V4/2m_mosaic |
2 m mosaic | Static |
sand, silt, clay |
projects/soilgrids-isric |
250 m | Static |
soil_oc |
projects/soilgrids-isric/soc_mean |
250 m | Static |
temperature |
ECMWF/ERA5/MONTHLY |
about 31 km | 20-year antecedent mean through the observation year |
precip |
ECMWF/ERA5/MONTHLY |
about 31 km | 20-year antecedent mean through the observation year, rescaled to annual precipitation |
swo_landsat |
JRC/GSW1_4/MonthlyHistory |
30 m | 1999-2021 occurrence window |
merit90_hand |
MERIT/Hydro/v1_0_1 |
about 90 m | Static |
If a temporal feature only partially overlaps the requested time window, the
sampler uses the available overlap. If there is no overlap, it writes NaN.
Sampling Defaults¶
| Setting | Default |
|---|---|
| Sampling mode | direct point sampling |
| Sampling scale | resolved from each Earth Engine image's native projection at runtime |
| Optional sampling buffer | off unless --sample-buffer-m is set |
| Chunk size | 5000 rows per Earth Engine request block |
| Curvature method | LoG |
| Curvature window sizes | 3, 5, 7, 9 |
| Curvature sigma | 1.0 |
| Climate averaging window | 20 years |
The sampler writes the output CSV and manifest after each completed feature
family. If a long run is interrupted, rerun the same command with --resume to
skip feature columns that are already present in the output.
Transform And Null Handling¶
The sampler does not perform model-oriented imputation or scaling. Missing
sampled values remain null/NaN in the feature table.
Current derived-feature behavior:
- SoilGrids texture outputs are depth-weighted sand, silt, and clay fractions.
- SoilGrids organic carbon is depth-weighted across depth bands.
- ERA5 temperature is a 20-year antecedent monthly mean through the observation year.
- ERA5 precipitation is sampled from monthly total precipitation, averaged over
the antecedent window, and multiplied by
12to express an annualized value. - JRC Global Surface Water monthly classes are converted to a 1999-2021 water occurrence percentage.
- Terrain curvature is derived from ArcticDEM elevation with the configured Laplacian-of-Gaussian settings and window sizes.
Output Design¶
The feature table keeps the canonical join column plus standard point identity fields when available:
- canonical ID column
dateyearlatlon- sampled features
This keeps the feature table joinable without forcing users to carry the full observation or aggregation table around while exploring models.
Why Buffering Is Optional¶
Earth Engine can sample points directly, and that is the default behavior.
An optional sampling buffer still exists because some environmental layers may benefit from neighborhood summaries rather than exact point intersections, for example:
- coarse-resolution climate or water layers
- slight point-location uncertainty relative to raster resolution
- use cases where a local mean is more meaningful than one intersecting pixel
Adding Features¶
To add your own feature or propose a new default feature for CUSP, see Adding new GEE features.
Current Limitations¶
- routine tests cover package behavior, registry resolution, input normalization, and output joins
- the release gate can run a small live Earth Engine smoke test when
--gee-projectis supplied - Landsat spectral bundles are not yet part of
base_v1 - the sampler assumes the CUSP conda environment has
earthengine-api,geemap, andgeopandasavailable