Adding New Data To CUSP¶
CUSP maintainers will add datasets as they have time. Contributors are also welcome to add datasets directly by preparing the data and opening a pull request.
Git Workflow¶
- Fork or clone the CUSP repository.
- Create a branch for your dataset.
- Add the source files, processing script, and processed table under
data/<Source_Key>/. - Rebuild and check CUSP locally.
- Open a pull request and describe what was added.
git checkout -b add-example-2026
What To Add¶
For a new source called Example_2026, create:
data/
Example_2026/
raw source files...
process_example_2026.py
processed_example_2026.csv
Use the source directory name as the canonical source_key.
Step 1: Create The Process Script¶
The processing script must be lowercase and start with process_:
data/Example_2026/process_example_2026.py
Step 2: Add The Metadata As A Docstring¶
Add the metadata as a docstring at the top of the process script. Use the template and field definitions in Process script header guidelines.
If a source needs manual preprocessing, external downloads, or a date assumption, record that in the docstring.
Step 3: Produce The Processed CSV¶
Your script should write:
data/Example_2026/processed_example_2026.csv
The easiest path is to use the helpers in
data_utils.py
where they fit, then finish with data_utils.check_columns(df) before
writing.
Minimum Processed-Table Contract¶
The processed CSV must include these columns:
lonlatdatesourcesite_idpf_observedpf_depththaw_depthobs_limit
It should also include:
method
The build currently fills a missing method column if necessary, but new
contributions should provide it directly whenever possible. If the observation
tool is truly unknown, set:
method = "unknown"
Important expectations:
lat,lonshould be decimal degrees inEPSG:4326dateshould beYYYY-MM-DDpf_observedshould be integer0or1pf_depth,thaw_depth, andobs_limitshould be in centimeterssite_idmay be null if the source truly does not provide one
Step 4: Resolve Source Interpretation¶
Your process_<source>.py script should handle source-specific interpretation
as clearly as possible, including:
- source-specific sentinel values
- unit conversion
- approximate or campaign-level dates
- method mapping to the CUSP vocabulary
- obvious within-source duplicates
- obvious invalid rows that only the source contributor can interpret correctly
Step 5: Validate The Metadata¶
Check that the metadata docstring is parseable and complete:
python -m cusp.generate_process_script_metadata --check --strict data/Example_2026/process_example_2026.py
Step 6: Run The Source Script¶
python data/Example_2026/process_example_2026.py
Step 7: Rebuild And Validate CUSP¶
python -m cusp.build
python -m cusp.qc validate-observations
python -m cusp.aggregate
python -m cusp.qc validate-aggregated
If your source changes the official dataset contents, that should usually be treated as a new dataset version under Versioning and exports.
Pull Request Checklist¶
- create
data/<Source_Key>/ - add
process_<source_key_lower>.py - add TOML metadata docstring
- write
processed_<source_key_lower>.csv - keep source-specific interpretation inside the process script
- validate metadata
- run the source script
- rebuild the working observation table
- run QA
Maintainers make final release-clearance decisions. See Source release clearance for the maintainer review model.