Export Formats¶
MetDataPy supports multiple export formats for interoperability and archival.
Parquet Export¶
Parquet is a columnar storage format optimized for analytics and ML workflows.
from metdatapy.io import to_parquet
# Export to Parquet
to_parquet(ws.to_dataframe(), "weather_data.parquet")
# Or directly from WeatherSet
ws.to_dataframe().to_parquet("weather_data.parquet")
Advantages: - Fast read/write performance - Efficient compression - Preserves data types - Wide tool support (pandas, Spark, DuckDB, etc.)
NetCDF Export (CF-Compliant)¶
NetCDF (Network Common Data Form) is the standard format for climate and meteorological data, following CF (Climate and Forecast) Conventions v1.8.
Basic Export¶
# Export to NetCDF
ws.to_netcdf("weather_data.nc")
# Or using the io module
from metdatapy.io import to_netcdf
to_netcdf(ws.to_dataframe(), "weather_data.nc")
With Metadata¶
metadata = {
"title": "Hourly Weather Observations - Station XYZ",
"institution": "University Weather Network",
"source": "Automatic Weather Station",
"history": "Quality controlled and processed with MetDataPy",
"references": "https://doi.org/10.xxxx/xxxxx",
"comment": "Data collected for agricultural research",
}
station_metadata = {
"station_id": "AWS_001",
"station_name": "Central Campus Weather Station",
"lat": 40.7128,
"lon": -74.0060,
"elev_m": 10.0,
}
ws.to_netcdf(
"weather_data.nc",
metadata=metadata,
station_metadata=station_metadata,
)
CF-Compliant Features¶
The NetCDF export includes:
Global Attributes (CF-1.8):
- Conventions: "CF-1.8"
- featureType: "timeSeries"
- title, institution, source, history, references, comment
Time Dimension:
- Standard name: time
- Axis: T
- Stored as naive UTC on disk (NetCDF/xarray do not carry tz info). Any tz-aware input is
first converted to UTC before the timezone is stripped, so the wall-clock instant is
preserved across the round-trip. from_netcdf re-localizes the time index to UTC so the
returned DataFrame is tz-aware.
Variable Attributes:
- standard_name: CF standard names (e.g., air_temperature, wind_speed)
- long_name: Human-readable descriptions
- units: CF-compliant units (e.g., degree_Celsius, m s-1)
- valid_min, valid_max: Physical plausibility ranges
Spatial Coordinates:
- latitude (degrees_north, axis Y)
- longitude (degrees_east, axis X)
- altitude (m, axis Z, positive up)
QC Flags:
- flag_values: "0, 1"
- flag_meanings: "good_data bad_data"
- Stored as int8 for NetCDF compatibility
Compression: - zlib compression (level 4) - float64 → float32 conversion for efficiency
Supported Variables¶
All canonical MetDataPy variables are mapped to CF standard names:
| MetDataPy Variable | CF Standard Name | Units |
|---|---|---|
temp_c |
air_temperature |
degree_Celsius |
rh_pct |
relative_humidity |
percent |
pres_hpa |
air_pressure |
hPa |
wspd_ms |
wind_speed |
m s-1 |
wdir_deg |
wind_from_direction |
degree |
gust_ms |
wind_speed_of_gust |
m s-1 |
rain_mm |
precipitation_amount |
mm |
rain_rate_mmh |
rainfall_rate |
mm h-1 |
solar_wm2 |
surface_downwelling_shortwave_flux_in_air |
W m-2 |
uv_index |
(no CF standard name) | 1 |
dew_point_c |
dew_point_temperature |
degree_Celsius |
vpd_kpa |
(no CF standard name) | kPa |
heat_index_c |
(no CF standard name) | degree_Celsius |
wind_chill_c |
(no CF standard name) | degree_Celsius |
Reading NetCDF Files¶
from metdatapy.io import from_netcdf
# Import NetCDF back to DataFrame
df = from_netcdf("weather_data.nc")
# Create WeatherSet from imported data
ws = WeatherSet(df)
Automatic Conversions:
- time dimension → ts_utc index
- Naive on-disk time → tz-aware UTC DatetimeIndex
- int8 QC flags → boolean
- Coordinate variables (lat/lon/alt) dropped from time series
Validation¶
Verify CF compliance using standard tools:
Or use ncdump to inspect metadata:
Manifest (Planned)¶
Future releases will include manifest.json export with:
- Variable inventory
- Feature engineering provenance
- Scaler parameters
- QC summary statistics
- Pipeline hash for reproducibility
Comparison¶
| Format | Use Case | Pros | Cons |
|---|---|---|---|
| Parquet | ML workflows, analytics | Fast, efficient, wide support | Not self-describing |
| NetCDF | Scientific archival, sharing | CF-compliant, self-describing, metadata-rich | Larger file size, slower I/O |
| CSV | Human inspection, legacy systems | Universal compatibility | Large, no metadata, type ambiguity |
Best Practices¶
- For ML/AI projects: Use Parquet for speed and efficiency
- For scientific publication: Use NetCDF for CF compliance and metadata
- For data sharing: Provide both formats when possible
- For archival: Use NetCDF with comprehensive metadata
- For version control: Use manifest.json (when available) to track provenance