Documentation >

Bulk STAC item queries with GeoParquet

In addition to its STAC API, the Planetary Computer also provides access to STAC items as geoparquet datasets. These parquet datasets can be used for “bulk” workloads, where the search might return a very large number of items, or if it might require many separate queries to get your desired result. In general, these parquet datasets are produced with a lag relative to what’s available through the STAC API. Most use-cases, including those that need recently added assets, should use our STAC API.

This example shows how to load STAC items from a Parquet dataset into a geopandas GeoDataFrame. A similar workflow would be possible with R’s geoarrow package, or any other library that can read GeoParquet.

[1]:
import dask.dataframe as dd
import geopandas
import planetary_computer
import pystac_client
import pandas as pd

pd.options.display.max_columns = 8

Loading STAC Items

Each STAC collection providing a geoparquet dataset has a collection-level asset under the geoparquet-items key.

[2]:
catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1/",
    modifier=planetary_computer.sign_inplace,
)


asset = catalog.get_collection("io-lulc-9-class").assets["geoparquet-items"]

df = geopandas.read_parquet(
    asset.href, storage_options=asset.extra_fields["table:storage_options"]
)
df.head()
[2]:
row_id type stac_version stac_extensions id ... end_datetime proj:transform start_datetime io:supercell_id
0 Feature 1.0.0 [https://stac-extensions.github.io/projection/... 12Q-2017 ... 2018-01-01 00:00:00+00:00 [10.0, 0.0, 178910.0, 0.0, -10.0, 2657470.0] 2017-01-01 00:00:00+00:00 12Q
1 Feature 1.0.0 [https://stac-extensions.github.io/projection/... 15R-2017 ... 2018-01-01 00:00:00+00:00 [10.0, 0.0, 194773.70566898846, 0.0, -10.0, 35... 2017-01-01 00:00:00+00:00 15R
2 Feature 1.0.0 [https://stac-extensions.github.io/projection/... 16M-2017 ... 2018-01-01 00:00:00+00:00 [10.0, 0.0, 166023.6435927535, 0.0, -10.0, 999... 2017-01-01 00:00:00+00:00 16M
3 Feature 1.0.0 [https://stac-extensions.github.io/projection/... 20L-2022 ... 2023-01-01 00:00:00+00:00 [10.0, 0.0, 169256.89710350422, 0.0, -10.0, 91... 2022-01-01 00:00:00+00:00 20L
4 Feature 1.0.0 [https://stac-extensions.github.io/projection/... 20M-2019 ... 2020-01-01 00:00:00+00:00 [10.0, 0.0, 166023.6435927521, 0.0, -10.0, 999... 2019-01-01 00:00:00+00:00 20M

5 rows × 18 columns

Now we can do things like look at the count of each proj:epsg code.

[3]:
df["proj:epsg"].value_counts()
[3]:
proj:epsg
32616    60
32638    60
32650    60
32648    60
32617    60
         ..
32714    12
32710    11
32708    10
32711     6
32727     6
Name: count, Length: 120, dtype: int64

Or filter the items to a specific code and plot the footprints.

[4]:
subset = df.loc[df["proj:epsg"] == 32651, ["io:tile_id", "geometry"]]
[5]:
import contextily

ax = subset.plot(figsize=(4, 20), color="none", edgecolor="yellow")
contextily.add_basemap(
    ax, crs=df.crs.to_string(), source=contextily.providers.Esri.NatGeoWorldMap
)

ax.set_axis_off()

Schemas

Each parquet dataset has a unique schema, reflecting the unique properties captured in each collection. But there are some general patterns.

  1. Each dataset has a column for the properties required on a STAC item (type, stac_version, stac_extensions, id, geometry, bbox, links, assets, and collection).

  2. All fields under properties are lifted to the top-level, including datetime-related fields like datetime, start_datetime, end_datetime, common metadata (e.g. platform) and extension fields (e.g. proj:bbox, …).

  3. Dynamic datasets, where new items are regularly added, are partitioned by time.

Partitioning

Depending on the number of STAC items in the collection and whether or not new items are being added, the Parquet dataset may be split into multiple files by time.

For example, the io-lulc-9-class collection is not partitioned and has just a single file:

[6]:
import adlfs

fs = adlfs.AzureBlobFileSystem(**asset.extra_fields["table:storage_options"])
fs.ls("items/io-lulc-9-class.parquet")  # Not partitioned, single result
[6]:
['items/io-lulc-9-class.parquet']

Compare that to sentintel-2-l2a, which is partitioned by week.

[7]:
fs.ls("items/sentinel-2-l2a.parquet")[:5]
[7]:
['items/sentinel-2-l2a.parquet/part-0001_2015-06-29T10:25:31+00:00_2015-07-06T10:25:31+00:00.parquet',
 'items/sentinel-2-l2a.parquet/part-0002_2015-07-06T10:25:31+00:00_2015-07-13T10:25:31+00:00.parquet',
 'items/sentinel-2-l2a.parquet/part-0003_2015-07-13T10:25:31+00:00_2015-07-20T10:25:31+00:00.parquet',
 'items/sentinel-2-l2a.parquet/part-0004_2015-07-20T10:25:31+00:00_2015-07-27T10:25:31+00:00.parquet',
 'items/sentinel-2-l2a.parquet/part-0005_2015-07-27T10:25:31+00:00_2015-08-03T10:25:31+00:00.parquet']

To work with a partitioned dataset, you can use a library like dask or dask-geopandas.

[8]:
asset = catalog.get_collection("sentinel-2-l2a").assets["geoparquet-items"]

s2l2a = dd.read_parquet(
    asset.href, storage_options=asset.extra_fields["table:storage_options"]
)
s2l2a.head()
[8]:
row_id type stac_version stac_extensions id ... s2:high_proba_clouds_percentage s2:reflectance_conversion_factor s2:medium_proba_clouds_percentage s2:saturated_defective_pixel_percentage
0 Feature 1.0.0 [https://stac-extensions.github.io/eo/v1.0.0/s... S2A_MSIL2A_20150704T101006_R022_T35XQA_2021041... ... 92.546540 0.967449 4.807670 0.0
1 Feature 1.0.0 [https://stac-extensions.github.io/eo/v1.0.0/s... S2A_MSIL2A_20150704T101006_R022_T32TMM_2021041... ... 0.048035 0.967449 0.051376 0.0
2 Feature 1.0.0 [https://stac-extensions.github.io/eo/v1.0.0/s... S2A_MSIL2A_20150704T101006_R022_T32TMN_2021041... ... 0.011238 0.967449 0.022928 0.0
3 Feature 1.0.0 [https://stac-extensions.github.io/eo/v1.0.0/s... S2A_MSIL2A_20150704T101006_R022_T36WWC_2021041... ... 65.812266 0.967449 19.050561 0.0
4 Feature 1.0.0 [https://stac-extensions.github.io/eo/v1.0.0/s... S2A_MSIL2A_20150704T101006_R022_T36WWD_2021041... ... 97.629422 0.967449 1.861097 0.0

5 rows × 42 columns

You can perform filtering operations on the entire collection.

[9]:
mask = (s2l2a["eo:cloud_cover"] < 10) & (s2l2a["s2:nodata_pixel_percentage"] > 90)
keep = s2l2a[mask]
keep.head()
[9]:
row_id type stac_version stac_extensions id ... s2:high_proba_clouds_percentage s2:reflectance_conversion_factor s2:medium_proba_clouds_percentage s2:saturated_defective_pixel_percentage
27 Feature 1.0.0 [https://stac-extensions.github.io/eo/v1.0.0/s... S2A_MSIL2A_20150704T101006_R022_T32RMS_2021041... ... 0.000000 0.967449 0.000000 0.0
56 Feature 1.0.0 [https://stac-extensions.github.io/eo/v1.0.0/s... S2A_MSIL2A_20150704T101006_R022_T31PFP_2021041... ... 2.169701 0.967449 1.014810 0.0
68 Feature 1.0.0 [https://stac-extensions.github.io/eo/v1.0.0/s... S2A_MSIL2A_20150704T101006_R022_T31QGU_2021041... ... 0.211487 0.967449 0.171659 0.0
77 Feature 1.0.0 [https://stac-extensions.github.io/eo/v1.0.0/s... S2A_MSIL2A_20150704T101006_R022_T31SGT_2021041... ... 1.710171 0.967449 1.712733 0.0
80 Feature 1.0.0 [https://stac-extensions.github.io/eo/v1.0.0/s... S2A_MSIL2A_20150704T101006_R022_T31QHC_2021041... ... 0.000000 0.967449 0.000000 0.0

5 rows × 42 columns

When you compute the results, the computation will run in parallel. See Scale with Dask for more.

As mentioned earlier, the different collections have different properties, and so have different columns in the DataFrame.

[10]:
s2l2a.columns.tolist()
[10]:
['type',
 'stac_version',
 'stac_extensions',
 'id',
 'geometry',
 'bbox',
 'links',
 'assets',
 'collection',
 'datetime',
 'platform',
 'proj:epsg',
 'instruments',
 's2:mgrs_tile',
 'constellation',
 's2:granule_id',
 'eo:cloud_cover',
 's2:datatake_id',
 's2:product_uri',
 's2:datastrip_id',
 's2:product_type',
 'sat:orbit_state',
 's2:datatake_type',
 's2:generation_time',
 'sat:relative_orbit',
 's2:water_percentage',
 's2:mean_solar_zenith',
 's2:mean_solar_azimuth',
 's2:processing_baseline',
 's2:snow_ice_percentage',
 's2:vegetation_percentage',
 's2:thin_cirrus_percentage',
 's2:cloud_shadow_percentage',
 's2:nodata_pixel_percentage',
 's2:unclassified_percentage',
 's2:dark_features_percentage',
 's2:not_vegetated_percentage',
 's2:degraded_msi_data_percentage',
 's2:high_proba_clouds_percentage',
 's2:reflectance_conversion_factor',
 's2:medium_proba_clouds_percentage',
 's2:saturated_defective_pixel_percentage']

Different collections will be partitioned by different frequencies, depending on the update cadence, number of STAC items, and size of each STAC item. Look for an msft:partition_info property on the asset to check if the dataset is partitioned. The partition_frequency is a pandas Offset alias.

[11]:
asset.extra_fields["msft:partition_info"]
[11]:
{'is_partitioned': True, 'partition_frequency': 'W-MON'}

Expanding nested fields

STAC items are highly nested data structures, while libraries like pandas were mostly designed for working with non-nested data types. Consider a column like assets, which is a dictionary mapping asset keys to asset objects (which include an href and other properties).

[12]:
df["assets"].head()
[12]:
0    {'data': {'file:size': 53208880, 'file:values'...
1    {'data': {'file:size': 114187155, 'file:values...
2    {'data': {'file:size': 53981476, 'file:values'...
3    {'data': {'file:size': 165601021, 'file:values...
4    {'data': {'file:size': 97175834, 'file:values'...
Name: assets, dtype: object

The json_normalize method can be used to expand this single column of nested data into many columns, one per asset:

[13]:
import pandas as pd

assets = pd.json_normalize(df["assets"].head())
assets
[13]:
row_id data.file:size data.file:values data.href data.raster:bands ... tilejson.href tilejson.roles tilejson.title tilejson.type
0 53208880 [{'summary': 'No Data', 'values': [0]}, {'summ... https://ai4edataeuwest.blob.core.windows.net/i... [{'nodata': 0, 'spatial_resolution': 10}] ... https://planetarycomputer.microsoft.com/api/da... [tiles] TileJSON with default rendering application/json
1 114187155 [{'summary': 'No Data', 'values': [0]}, {'summ... https://ai4edataeuwest.blob.core.windows.net/i... [{'nodata': 0, 'spatial_resolution': 10}] ... https://planetarycomputer.microsoft.com/api/da... [tiles] TileJSON with default rendering application/json
2 53981476 [{'summary': 'No Data', 'values': [0]}, {'summ... https://ai4edataeuwest.blob.core.windows.net/i... [{'nodata': 0, 'spatial_resolution': 10}] ... https://planetarycomputer.microsoft.com/api/da... [tiles] TileJSON with default rendering application/json
3 165601021 [{'summary': 'No Data', 'values': [0]}, {'summ... https://ai4edataeuwest.blob.core.windows.net/i... [{'nodata': 0, 'spatial_resolution': 10}] ... https://planetarycomputer.microsoft.com/api/da... [tiles] TileJSON with default rendering application/json
4 97175834 [{'summary': 'No Data', 'values': [0]}, {'summ... https://ai4edataeuwest.blob.core.windows.net/i... [{'nodata': 0, 'spatial_resolution': 10}] ... https://planetarycomputer.microsoft.com/api/da... [tiles] TileJSON with default rendering application/json

5 rows × 15 columns

And the explode method will transform each element of a list-like value to a row:

[14]:
assets["data.file:values"]
[14]:
0    [{'summary': 'No Data', 'values': [0]}, {'summ...
1    [{'summary': 'No Data', 'values': [0]}, {'summ...
2    [{'summary': 'No Data', 'values': [0]}, {'summ...
3    [{'summary': 'No Data', 'values': [0]}, {'summ...
4    [{'summary': 'No Data', 'values': [0]}, {'summ...
Name: data.file:values, dtype: object
[15]:
assets["data.file:values"].explode()
[15]:
0               {'summary': 'No Data', 'values': [0]}
0                 {'summary': 'Water', 'values': [1]}
0                 {'summary': 'Trees', 'values': [2]}
0    {'summary': 'Flooded vegetation', 'values': [4]}
0                 {'summary': 'Crops', 'values': [5]}
0            {'summary': 'Built area', 'values': [7]}
0           {'summary': 'Bare ground', 'values': [8]}
0              {'summary': 'Snow/ice', 'values': [9]}
0               {'summary': 'Clouds', 'values': [10]}
0            {'summary': 'Rangeland', 'values': [11]}
1               {'summary': 'No Data', 'values': [0]}
1                 {'summary': 'Water', 'values': [1]}
1                 {'summary': 'Trees', 'values': [2]}
1    {'summary': 'Flooded vegetation', 'values': [4]}
1                 {'summary': 'Crops', 'values': [5]}
1            {'summary': 'Built area', 'values': [7]}
1           {'summary': 'Bare ground', 'values': [8]}
1              {'summary': 'Snow/ice', 'values': [9]}
1               {'summary': 'Clouds', 'values': [10]}
1            {'summary': 'Rangeland', 'values': [11]}
2               {'summary': 'No Data', 'values': [0]}
2                 {'summary': 'Water', 'values': [1]}
2                 {'summary': 'Trees', 'values': [2]}
2    {'summary': 'Flooded vegetation', 'values': [4]}
2                 {'summary': 'Crops', 'values': [5]}
2            {'summary': 'Built area', 'values': [7]}
2           {'summary': 'Bare ground', 'values': [8]}
2              {'summary': 'Snow/ice', 'values': [9]}
2               {'summary': 'Clouds', 'values': [10]}
2            {'summary': 'Rangeland', 'values': [11]}
3               {'summary': 'No Data', 'values': [0]}
3                 {'summary': 'Water', 'values': [1]}
3                 {'summary': 'Trees', 'values': [2]}
3    {'summary': 'Flooded vegetation', 'values': [4]}
3                 {'summary': 'Crops', 'values': [5]}
3            {'summary': 'Built area', 'values': [7]}
3           {'summary': 'Bare ground', 'values': [8]}
3              {'summary': 'Snow/ice', 'values': [9]}
3               {'summary': 'Clouds', 'values': [10]}
3            {'summary': 'Rangeland', 'values': [11]}
4               {'summary': 'No Data', 'values': [0]}
4                 {'summary': 'Water', 'values': [1]}
4                 {'summary': 'Trees', 'values': [2]}
4    {'summary': 'Flooded vegetation', 'values': [4]}
4                 {'summary': 'Crops', 'values': [5]}
4            {'summary': 'Built area', 'values': [7]}
4           {'summary': 'Bare ground', 'values': [8]}
4              {'summary': 'Snow/ice', 'values': [9]}
4               {'summary': 'Clouds', 'values': [10]}
4            {'summary': 'Rangeland', 'values': [11]}
Name: data.file:values, dtype: object