Announcing the Launch of the AI/ML Enhancement Project for GEP and Urban TEP Exploitation Platforms

simonevaccari · June 13, 2024, 9:34am

AI/ML Enhancement Project - Discovering Labelled EO Data with STAC

Introduction

The use of the SpatioTemporal Asset Catalogs (STAC) format is crucial when it comes to search and discover spatio-temporal datasets, including labelled Earth Observation (EO) data. It allows filtering search results using STAC metadata as query parameters, such as spatial and temporal extents, resolution, and other properties. As well as ensuring that the data becomes discoverable and accessible to other stakeholders (e.g. users, researchers, policymakers, etc), the use of STAC brings several other benefits, including enhancing the reproducibility and transparency of the process and its result.

This post presents User Scenario 4 of the AI/ML Enhancement Project, titled “Alice discovers the labelled EO data”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support users exploiting STAC format to discover labelled EO data.

To demonstrate these new capabilities defined in this User Scenario, an interactive Jupyter Notebook is used to guide an ML practitioner, such as Alice, in the process of exploiting the STAC format to discover labelled EO data, including:

Understanding the STAC format
Accessing STAC via STAC Browser and STAC API
Connectivity with dedicated S3 storage

Practical examples and commands are displayed to demonstrate how these new capabilities can be used from a Jupyter Notebook.

Understanding STAC

The SpatioTemporal Asset Catalog (STAC) specification was designed to establish a standard, unified language to talk about geospatial data, allowing it to be more easily searchable and queryable. By defining query parameters based on STAC metadata, such as spatial and temporal extents, resolution, and other properties, the user can narrow down a search with only those datasets that align with the specific requirements.

There are four components specifications that together make up the core STAC specification:

STAC Item: the core unit representing a single spatiotemporal asset as a GeoJSON feature with datetime and links.
STAC Catalog: a simple, flexible JSON file of links that provides a structure to organize and browse STAC Items.
STAC Collection: an extension of the STAC Catalog with additional information such as the extents, license, keywords, providers, etc., that describe STAC Items that fall within the Collection.
STAC API: it provides a RESTful endpoint that enables search of STAC Items, specified in OpenAPI, following OGC’s WFS 3.

A STAC Catalog is used to group STAC objects like Items, Collections, and/or even other Catalogs.

Below are shown some commands of the pystac library that can be used to extract information from a STAC Catalog / Item / Collection.

import pystac

# Read STAC Catalog from file and explore High-Level Catalog Information
cat = Catalog.from_file(url)
cat.describe()

# Print some key metadata
print(f"ID: {cat.id}")
print(f"Title: {cat.title or 'N/A'}")
print(f"Description: {cat.description}")

# Access to STAC Child Catalogs and/or Collections
col = [col for col in cat.get_all_collections()]

# Explore STAC Item Metadata
item = cat.get_item(id=<item_id>, recursive=True)

More information can be found in the official STAC documentation.

Accessing STAC via STAC Browser and STAC API

There are two ways to discover STAC data: by using the STAC Browser or by using the STAC API.

Accessing using STAC Browser

The STAC Browser provides a user-friendly graphical interface that facilitates the search and discovery of datasets. A few screenshots of the graphical interface are provided below.

The dedicated STAC Browser app can be launched by the user at login with the option STAC Browser for AI-Extensions STAC API. The STAC Catalog and Collections available on the App Hub project endpoint will be displayed.

After selecting a specific collection, the query parameters can be manually specified with the dedicated widgets in the Filters section (temporal and spatial extents in this case).

The search results are then shown after clicking Submit. In the example screenshot below, it is shown a single STAC Item with its key metadata.

Despite its user-friendly interface, the use of the STAC Browser is quite limited to manual interactions with the user, making it difficult and time consuming when performing multiple searches with different parameters, for example. For this reason, the use of the STAC Browser is primarily designed for manual exploration and is less suited for automated workflows.

Accessing using STAC API

The STAC API allows for programmatic access to data, enabling automation of data discovery, retrieval, and processing workflows. This is particularly useful for integrating STAC data into larger geospatial data processing pipelines or applications…

import requests

# Define payload for token request
payload = {
  "client_id": "ai-extensions",
  "username": "ai-extensions-user",
  "password": os.environ.get("IAM_PASSWORD"),
  "grant_type": "password",
}

auth_url = 'https://iam-dev.terradue.com/realms/ai-extensions/protocol/openid-connect/token'
token = get_token(url=auth_url, **payload)
headers = {"Authorization": f"Bearer {token}"}

Once the authentication credentials are defined, the private STAC Catalog can be accessed and searched using specific query parameters, such as time range, area of interest, and data collection preferences. Subsequently, only the STAC Item(s) that align with the provided criteria is(are) retrieved for the user. This can be achieved with the pystac and pystac_client libraries.

# Import libraries
import pystac
from pystac_client import Client

# Define STAC endpoint and access to the Catalog
stac_endpoint = "https://ai-extensions-stac.terradue.com"
cat = Client.open(stac_endpoint, headers=headers, ignore_conformance=True)

# Define query parameters
start_date = datetime.strptime(“20230601”, '%Y%m%d')
end_date = datetime.strptime(“20230630”, '%Y%m%d')
bbox = [-121.857043 37.853934 -120.608968 38.840424]
tile = “10SFH”

# Query by AOI, start and end date
query_sel = cat.search(
  collections=[“ai-extensions-svv-dataset-labels”],
  datetime=(start_date, end_date),
  bbox=bbox,
)
item = [item for item in query_sel.item_collection() if tile in item.id][0]

# Display Item
display(item)

Connectivity with dedicated S3 storage

Up until now the user accessed the STAC endpoint for exploring the Catalog and its Collections / Items. In this section we describe the process to access the data referenced in the Item’s assets, which are stored in a dedicated S3 bucket.

The AWS S3 configuration settings are defined in a .json file (eg appsettings.json), which is used to create a UserSettings object. This will be used to create a configured S3 client to retrieve an object stored on S3, using boto3 and botocore libraries.

# Import libraries
import botocore, boto3

# Define AWS S3 settings
settings = UserSettings("appsettings.json")
settings.set_s3_environment(<asset_s3_path>)

# Start botocore session
session = botocore.session.Session()

# create client obj
s3_client = session.create_client(
  service_name="s3",
  region_name=os.environ.get("AWS_REGION"),
  use_ssl=True,
  endpoint_url=os.environ.get("AWS_S3_ENDPOINT"),
  aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
  aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"),
)
parsed = urlparse(geojson_url)

# retrieve bucket name
bucket = parsed.netloc
key = parsed.path[1:]

# retrive the obj which was stored on s3
response = s3_client.get_object(Bucket=bucket, Key=key)

The user can then download locally the file stored on S3 using io library.

import io

geojson_content = io.BytesIO(respond["Body"].read())
fname = './output/downloaded.geojson'

# Save the GeoJSON content to a local file
with open(fname, "wb") as file:
  file.write(geojson_content.getvalue())

The user can also import the downloaded data into this Notebook. In this example, the downloaded .geojson file is loaded and converted into a pandas dataframe.

import geopandas as gpd

# Make geodataframe out of the downloaded .geojson file
gdf = gpd.read_file(fname)
gdf

s3-geodataframe

Conclusion

This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to help a ML practitioner exploiting the STAC format to discover labelled EO data, including:

Understanding the STAC format
Accessing STAC via STAC Browser and STAC API
Connectivity with dedicated S3 storage

Useful links:

The link to the Notebook for User Scenario 4 is: https://github.com/ai-extensions/notebooks/blob/develop/scenario-4/s4-discoveringLabelledEOData.ipynb
Note: access to this Notebook must be granted - please send an email to support@terradue.com with subject “Request Access to s4-discoveringLabelledEOData” and body “Please provide access to Notebook for AI Extensions User Scenario 4”
The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual
Project Update “AI/ML Enhancement Project - Progress Update”
User Scenario 1 “AI/ML Enhancement Project - Exploratory Data Analysis”
User Scenario 2 “AI/ML Enhancement Project - Labelling EO Data”
User Scenario 3 “AI/ML Enhancement Project - Describing labelled EO data”