Announcing the Launch of the AI/ML Enhancement Project for GEP and Urban TEP Exploitation Platforms

pedro · May 2, 2023, 4:41pm

We are excited to announce the launch of a new project aimed at augmenting the capabilities of two Ellip-powered Exploitation platforms, the Geohazards Exploitation Platform (GEP) and the Urban Thematic Exploitation Platform (U-TEP). The project’s primary objective is to seamlessly integrate an AI/ML processing framework into both platforms to enhance their services and empower service providers to develop and deploy AI/ML models for improved geohazards and urban management applications.

Project Overview
The project will focus on integrating a comprehensive AI/ML processing framework that covers the entire machine learning pipeline, including data discovery, training data, model development, deployment, hosting, monitoring, and visualization. A critical aspect of this project will be the integration of MLOps processes into both GEP and Urban TEP platforms’ service offerings, ensuring the smooth operation of AI-driven applications on the platforms.

GEP and Urban TEP Platforms
GEP is designed to support the exploitation of satellite Earth Observations for geohazards, focusing on mapping hazard-prone land surfaces and monitoring terrain deformation. It offers over 25 services for monitoring terrain motion and critical infrastructures, with more than 2500 registered users actively participating in content creation.

Urban TEP aims to provide end-to-end and ready-to-use solutions for a broad spectrum of users to extract unique information and indicators required for urban management and sustainability. It focuses on bridging the gap between the mass data streams and archives of various satellite missions and the information needs of users involved in urban and environmental science, planning, and policy.

Project Partners
The project brings together a strong partnership of experienced organizations, including Terradue, CRIM, Solenix, and Gisat. These partners have a proven track record in various aspects of Thematic Exploitation Platforms, cloud research platforms, AI/ML applications, and EO data analytics.

Expected Outcomes
Upon successful completion, the project will result in the enhancement of both GEP and Urban TEP platforms and their service offerings. The addition of AI/ML capabilities will empower service providers to develop and deploy AI/ML models, ultimately improving their services and delivering added value to their customers. This enhancement will greatly benefit the GEP and Urban TEP platforms by expanding their capabilities and enabling new AI-driven applications for geohazards and urban management.

Discussion Points:

How do you foresee AI/ML capabilities enhancing the services provided by GEP and Urban TEP?
What challenges do you anticipate in integrating AI/ML processing frameworks into existing platforms?
Which use cases do you believe would benefit the most from the addition of AI/ML capabilities in GEP and Urban TEP?

We encourage you to share your thoughts, ideas, and experiences related to the project. Let’s discuss the potential impact and improvements this project can bring to the GEP and Urban TEP platforms and their user communities.

simonevaccari · March 21, 2024, 2:41pm

AI/ML Enhancement Project - Progress Update

Background

One year has passed since the announcement of the AI/ML Enhancement Project launch (see post). This project innovatively integrates cutting-edge Artificial Intelligence (AI) and Machine Learning (ML) technologies into Earth Observation (EO) platforms like Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) through MLOps - the fusion of ML with DevOps principles.

Leveraging these platforms’ extensive EO data usage, the new AI extensions promise enhanced efficiency, accuracy, and functionalities. The integration of these new capabilities unlocks advanced data processing, predictive modelling, and automation, strengthening capabilities in urban management and geohazard assessment.

User Personas, User Scenarios and Showcases

For the project implementation we have identified two types of users:

A ML Practitioner that we will call “Alice”: expert in building and training ML models, selecting appropriate algorithms, analysing data, and using ML techniques to solve real-world problems.
And a Consumer that we will call “Eric”: stakeholder or user (e.g. business owner, a customer, a researcher, etc) who benefits from, or relies upon, the insights or predictions generated by the ML models to inform his decision-making process.

From these users we have derived ten User Scenarios that capture the key activities and goals of these types of users in utilising the service. The user scenarios are:

User Scenario 1 - Alice does Exploratory Data Analysis (EDA)
User Scenario 2 - Alice labels Earth Observation data
User Scenario 3 - Alice describes the labelled Earth Observation data
User Scenario 4 - Alice discovers labelled Earth Observation data
User Scenario 5 - Alice develops a new Machine Learning model
User Scenario 6 - Alice starts a training job on a remote machine
User Scenario 7 - Alice describes her trained machine learning model
User Scenario 8 - Alice reuses an existing pre-trained model
User Scenario 9 - Alice creates a training dataset
User Scenario 10 - Eric discovers a model and consumes it

From these user scenarios, three Showcases were selected to develop and apply AI approaches in different context in order to validate and verify the activities of the AI Extensions service:

“Urban greenery” showcase: urban greenery using EO data, specifically focusing on monitoring urban heat patterns and preventing flooding in urban areas.
“Informal settlement” showcase: AI approaches in the context of urban management, specifically targeting the challenges posed by informal settlements.
“Geohazards - volcanoes” showcase: AI approaches for EO data for monitoring and assessing volcanic hazards.

Project Status

The first release of this project was critical in setting the foundation as it focused on developing a cloud-based environment and related tools that enabled users to work with EO data and data labels. With the successful completion of the second release, the user is now able to build and train ML models with EO data labels effectively.

The project implementation with the User Scenarios focused on developing interactive Jupyter Notebooks that aim at validating and verifying all the key requirements of the activities performed in each Scenario.

To date, Jupyter Notebooks for User Scenarios 1 - 5 have been developed and validated.

Upcoming Work

The project’s future phases are eagerly anticipated. Release 3 will focus on enabling users to train their ML models on remote machines, while Release 4 will empower them to execute these models from the stakeholder/end-user Eric’s perspective. This progression underscores a strategic roadmap towards making GEP and U-TEP powerful platforms for data analysis and interpretation using advanced AI techniques.

Dedicated articles will be published in the coming weeks, describing the activities and main outcomes of each Scenario / Notebook, so stay tuned!

simonevaccari · May 3, 2024, 7:38am

AI/ML Enhancement Project - Exploratory Data Analysis User Scenario

Introduction

Exploratory Data Analysis (EDA) is an essential step in the workflow of a data scientist or machine learning (ML) practitioner. The purpose of EDA is to analyse the data that will be used to train and evaluate ML models. This new capability brought by AI/ML Enhancement Project will support users in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) to better understand the dataset structure and its properties, discover missing values and possible outliers, as well as to identify correlations and patterns between features that can be used to tailor and improve model performance.

This post presents User Scenario 1 of the AI/ML Enhancement Project titled “Alice does Exploratory Data Analysis (EDA)”. For this user scenario, an interactive Jupyter Notebook has been developed to guide an ML practitioner, such as Alice, implement EDA on her data. The Notebook firstly introduces the connectivity with a STAC catalog interacting with the STAC API to search and access EO data and labels by defining specific query parameters (we will cover that in a dedicated article). Subsequently, the user loads the input dataframe and then performs the EDA steps for understanding her data, such as data cleaning, correlation analysis, histogram plotting and data engineering. Practical examples and commands are displayed to demonstrate how simply this Notebook can be used for this purpose.

Input Dataframe

The input data consisted of point data labelled with three classes and features extracted from Sentinel-2 reflectance bands. Three vegetation indices were also computed from selected bands. The pre-arranged dataframe was loaded using the pandas library. The dataframe is composed by 13 columns:

column CLASSIFICATION: defines the land cover classes of each label, with available classes VEGETATION, NOT_VEGETATED, WATER.
columns with reflectance bands: extracted from the spectral bands of six Sentinel-2 scenes: coastal, red, green, blue, nir, nir08, nir09, swir16, and swir22.
columns for vegetation indices, calculated from the reflectance bands: ndvi, ndwi1, and ndwi2.

A snapshot of how the dataframe can be loaded and displayed is shown below.

import pandas as pd
dataset = pd.read_pickle('./input/dataframe.pkl')
dataset

This analysis focused on differentiating between “water” and “no-water” labels, therefore a pre-processing operation was performed on the dataframe to change the classification of VEGETATION and NOT_VEGETATED labels as “no-water”. This can be quickly achieved with the command below:

LABEL_NAME = 'water'
dataset[LABEL_NAME] = dataset['CLASSIFICATION'].apply(lambda x: 1 if x == 'WATER' else 0)

Data Cleaning

After loading, the user can inspect the dataframe with the pandas function dataset.info() to show a quick overview of the data, such as number of rows, columns and data types. A further statistical analysis can then be performed for each feature with the function dataset.describe(), which extracts relevant information, including count, mean, min & max, standard deviation and 25%, 50%, 75% percentiles.

dataset.info()
dataset.describe()

The user can quickly check if null data was present in the dataframe. In general, if features with null values are identified, the user should either remove them from the dataframe, or convert or assign them to appropriate values, if known.

dataset[dataset.isnull().any(axis=1)]

Correlation Analysis

The correlation analysis between “water” and “no-water” pixels for all features was performed with the pairplot() function of the seaborn library.

import seaborn as sns
sns.pairplot(dataframe, hue=LABEL_NAME, kind='reg', palette = "Set1")

This simple command generates multiple pairwise bivariate distributions of all features in the dataset, where the diagonal plots represent univariate distributions. It displays the relationship for the (n, 2) combination of variables in a dataframe as a matrix of plots, as depicted in the figure below (with ‘water’ points shown in blue).

The correlation between variables can also be visually represented by the correlation matrix, simply generated by the seaborn corr() function (see figure below). Each cell in the matrix represents the correlation coefficient, which quantifies the degree to which two variables are linearly related. Values close to 1 (in yellow) and -1 (in dark blue) respectively represent positive and negative correlations, and values close to 0 represent no correlation. The matrix is highly customisable with different format and colour maps available.

Distribution Density Histograms

Another good practice is to understand the distribution density of values for each column feature. The user can target specifically the distribution of specific features when related to the corresponding label “water”, plot this over the histograms, and save the output figure to file.

import matplotlib.pyplot as plt

for i, c in enumerate(dataset.select_dtypes(include='number').columns):
   plt.subplot(4,3,i+1)
   sns.distplot(dataset[c])
   plt.title('Distribution plot for field:' + c)
   plt.savefig(f'./distribution_hist.png')

Outliers detection

The statistical analysis and histogram plots provide an assessment regarding the data distribution of each feature. To further analyse the data distribution, it is advised to conduct a dedicated analysis to detect possible outliers in the data. The Tukey IQR method identifies outliers as values with more than 1.5 times the interquartile range from the quartiles — either below Q1 − 1.5 IQR, or above Q3 + 1.5 IQR. An example of the Tukey IQR method applied to the NDVI index is shown below:

import numpy as np

def find_outliers_tukey(x):
   q1 = np.percentile(x,25)
   q3 = np.percentile(x,75)
   iqr = q3 - q1
   floor = q1 - 1.5*iqr
   ceiling = q3 + 1.5*iqr
   outlier_indices = list(x.index[(x<floor) | (x>ceiling)])
   outlier_values = list(x[outlier_indices])
   return outlier_indices,outlier_values

tukey_indices, tukey_values = find_outliers_tukey(dataset['ndvi'])

Feature engineering and dimensionality reduction

Future engineering can be used when the available features are not enough for training an ML model, for example when a small number of features is available, or it is not representative enough. In such cases, feature engineering can be used to increase and/or improve the dataframe representativeness. In this case it was used the PolynomialFeatures function from the sklearn library to increase the number of overall features through an iterative combination of the available features. For an algorithm like a Random Forest where decisions are being made, adding more features through feature engineering could provide a substantial improvement. Algorithms like convolutional neural networks however, might not need these since they can extract patterns directly out of the data.

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(interaction_only=True, include_bias=False)
new_dataset = pd.DataFrame(poly.fit_transform(dataset))
new_dataset

On the other hand, Principal Component Analysis (PCA) is a technique that transforms a dataset of many features into fewer, principal components that best summarise the variance that underlies the data. This can be used to extract the principal components from each feature so they can be used in training. PCA is also a function offered by the sklearn Python library.

from sklearn.decomposition import PCA

pca = PCA(n_components=len(new_dataset.columns))
X_pca = pd.DataFrame(pca.fit_transform(new_dataset))

Conclusion

This work demonstrates the new functionalities brought by the AI/ML Enhancement Project with simple steps and commands a ML practitioner like Alice can take to analyse a dataframe for the preparatory step of a ML application lifecycle. Using this Jupyter Notebook, Alice can iteratively conduct the EDA steps to gain insights and analyse data patterns, calculate statistical summaries, generating histograms or scatter plots, understand correlation between features, and share results to her colleagues.

Useful links:

The link to the Notebook for User Scenario 1 is: https://github.com/ai-extensions/notebooks/blob/main/scenario-1/s1-eda.ipynb.
Note: access to this Notebook must be granted - please send an email to support@terradue.com with subject “Request Access to s1-eda” and body “Please provide access to Notebook for AI Extensions User Scenario 1”;
The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual.

simonevaccari · May 23, 2024, 2:41pm

AI/ML Enhancement Project - Labelling EO Data User Scenario 2

Introduction

Labelling data is a crucial step in the process for developing supervised Machine Learning (ML) models. It involves the critical task of assigning relevant labels or categories to different features within the data, such as land cover class (e.g. vegetation, water bodies, urban area, etc.) or other physical characteristics of the Earth’s surface. These labels can be multi-class (e.g., forest, grassland, urban), or binary (e.g., water or non-water).

This post presents User Scenario 2 of the AI/ML Enhancement Project, titled “Alice labels Earth Observation (EO) data”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support users labelling EO data.

For this User Scenario, an interactive Jupyter Notebook is used to guide an ML practitioner, such as Alice, through the following steps:

create data labels, using QGIS Software or a Solara / Leafmap application
load Labels and Sentinel-2 data using STAC API
sample Sentinel-2 data with Labels and create a dataframe
validate the labelled data against the Global Surface Water (GSW) dataset
use the dataframe to train a ML model based on a Random Forest classifier
perform raster inference on a Sentinel-2 scene to generate a binary water mask

Practical examples and commands are displayed to demonstrate how this new capabilities can be used from a Jupyter Notebook.

Labelling EO data

The process for creating vector (point or polygon) data layers is illustrated with two examples:

QGIS Software: a dedicated profile on the App Hub is configured to the user for using QGIS Software (more details can be found on the App Hub online User Manual). The steps to create new Shapefile Layers, add classification types for each point / polygon, and save the output in a geojson format are illustrated with several screenshots.
Solara / Leafmap application: an interactive map, built on Solara and Leafmap, has been integrated in the Notebook to give the option to the user to manually create and save labels right from the Notebook itself.

After the annotations are created, either from QGIS or from the Solara / Leafmap interactive map, and saved into a .geojson file, the user can create the STAC Item of the EO labels, and publish it on the STAC endpoint. This is done with the pystac Python library and an interactive form right in the Notebook.

Load Labels and EO data with STAC API

Access to Labels and EO data was facilitated through the utilisation of the libraries pystac and pystac_client. These libraries enable users to interact with a STAC catalog by defining specific query parameters, such as time range, area of interest, and data collection preferences. Subsequently, only the STAC Items that align with the provided criteria are retrieved for the user.

Below is given a simplified code snippet for implementing STAC data search and for displaying results on an interactive map. An upcoming article, dedicated to the STAC format and data access will be published, with more guidance and examples.

Search data using STAC API

# Import libraries
import pystac; from pystac_client import Client

# Access to STAC Catalog
cat = Client.open("https://ai-extensions-stac.terradue.com", ...)

# Define query parameters
start_date = “2023-06-01”
end_date = “2023-06-30”
bbox = [-121.857043 37.853934 -120.608968 38.840424]
cloud_cover = 30
tile = “10SFH”

# Search Labels by AOI, start/end date
query_sel = cat.search(
  collections=[“ai-extensions-svv-dataset-labels”],
  datetime=(start_date, end_date),
  bbox=bbox,
)

labels = query_sel.item_collection()

# Search EO data (Sentinel-2) by AOI, start/end date, cloud cover and tile number
query_sel = cat.search(
  collections=[“sentinel-2-l2a”],
  datetime=(start_date, end_date),
  bbox=bbox,
  query={"eo:cloud_cover": {"lt": cloud_cover}},
)

eo_item = [item for item in query_sel.item_collection() if tile in item.id][0]

Plot Labels and EO data on interactive map

Once the Label data is loaded, it is converted into a dataframe (gdf) using geopandas library. The Python library folium was then used to display both the Labels and EO data on an interactive map.

import folium; from folium import GeoJson, LayerControl, plugins

map = folium.Map(location=[x, y], tiles="OpenStreetMap", zoom_start=9)

# Add Labels to map
map = addPoints2Map(gdf, map)

# Add footprint of EO scene
footprint_eo = folium.GeoJson(eo_item.geometry,style_function=lambda x: {...})
footprint_eo.add_to(map)

# Visualise map
map

Sample EO data with labels

After loading the data, the Notebook continues with the implementation of a function to iteratively sample the EO data in correspondence of each labelled point. In addition to sampling a selection of the Sentinel-2 reflectance band (coastal, red, green, blue, nir, nir08, nir09, swir16, and swir22), three vegetation indices are also calculated (ndvi, ndwi1, and ndwi2). After sampling the EO bands and calculating the vegetation indices, all the data is concatenated into a pandas DataFrame.

import pandas as pd

tmp_gdfs = []
for i, label_item in enumerate(eo_items):
  sampled_data = sample_data(label_item=label_item, common_bands=["coastal", "red", "green", "blue", "nir", "nir08", "nir09", "swir16", "swir22"])
  tmp_gdfs.append(sampled_data)

# Create pandas dataframe
gdf_points = pd.concat(tmp_gdfs)

# Save to file
gdf_points.to_pickle(“filename.pkl”)

Validation against Reference Dataset

A comparison against another, independent, dataset was performed to show a validation approach of the labelled data. As a validation dataset, we used the Global Surface Water (GSW) dataset, generated by JRC (Citation: Pekel, Jean-François; Cottam, Andrew; Gorelick, Noel; Belward, Alan (2017): Global Surface Water Explorer dataset. European Commission, Joint Research Centre (JRC), http://data.europa.eu/89h/jrc-gswe-global-surface-water-explorer-v1).

The comparison was performed simply by iterating through the generated labels dataframe and by counting the number of points labelled as “water” that were correctly classified as water (i.e. with pixel value higher than 80%) also in the GSW dataset.

EO labelled data for Supervised ML task

Dataset preparation

The dataframe was prepared for the supervised ML task by converting it into a binary classification dataset (i.e. “water” and “no-water”) and by removing unnecessary columns. Further and more detailed analysis on the dataframe can be performed through Exploratory Data Analysis (EDA). Check out more information on the recently published article dedicated to EDA, for more details and guidance on this.

The dataset was then split into train and test with the dedicated function train_test_split() from the sklearn package.

from sklearn.model_selection import train_test_split

# columns used as features during training
feature_cols = ['coastal','red','green','blue','nir','nir08','nir09','swir16','swir22', 'ndvi', 'ndwi1', 'ndwi2']

# column name for label
LABEL_NAME = 'CLASSIFICATION'

features = train_dataset[feature_cols] # cols for features
label = train_dataset[LABEL_NAME] # col for labels
X_train, X_test, y_train, y_test = train_test_split(
  features, label,
  random_state=42,
  train_size=0.85,
)

ML Model

The ML model developed in this Notebook was a Random Forest classifier using k-fold cross validation. Random Forest is a powerful and versatile supervised ML algorithm that grows and combines multiple decision trees to create a “forest.” It can be used for both classification and regression problems. K-Fold Cross-Validation is a technique used in ML to assess the performance and generalisation ability of a model. The steps involved in the K-Fold Cross-Validation are:

split the dataset into K subsets, or “folds”.
The model is then trained K times, each time using K-1 folds for training, and the remaining fold for validation.
This process is repeated K times, with each of the K folds used exactly once as the validation data.
The K results from the K folds are then averaged to produce a single estimation of model performance.

The ML parameters are defined and used to train the model with a few simple functions, provided these are defined.

hyperparameters = {
  'n_estimators': 200,
  'criterion':'gini',
  'max_depth':None,
  'min_samples_split':2,
  'min_samples_leaf':1,
  'min_weight_fraction_leaf':0.0,
  'max_features':'sqrt',
  'max_leaf_nodes':None,
  'min_impurity_decrease':0.0,
  'bootstrap':True,
  'oob_score':False,
  'n_jobs':-1,
  'random_state':42,
  'verbose':0,
  'warm_start':True,
  'class_weight':None,
  'ccp_alpha':0.0,
  'max_samples':None
}

# define model obj which is defined in utils.py
model = Model(hyperparameters)

# training model using k-fold cross validation
estimators = model.training(X=X_train,Y=y_train,folds=5)

Model Evaluation

The model is evaluated on unseen data with the following evaluation metrics:

Accuracy: calculated as the ratio of correctly predicted instances to the total number of instances in the dataset
Recall: also known as sensitivity or true positive rate, recall is a metric that evaluates the ability of a classification model to correctly identify all relevant instances from a dataset
Precision: it evaluates the accuracy of the positive predictions made by a classification model
F1-score: it is a metric that combines precision and recall into a single value. It is particularly useful when there is an uneven class distribution (imbalanced classes) and provides a balance between precision and recall
Confusion Matrix: it provides a detailed breakdown of the model’s performance, highlighting instances of correct and incorrect predictions.

The code snippet below shows how the model can be evaluated, followed by the output of the evaluation metrics calculated during the process.

# evaluate model
best_model = model.evaluation(estimators,X_test, y_test)

s2-evaluation

Other ways to evaluate the ML model are the distribution of the probability of predicted values, the Receiver Operating Characteristic (ROC) Curve, and the analysis of the permutation features importance. All three can be derived and plotted from within the Notebook with one simple line of code.

# Distribution of probability of predicted values
ml_helper.distribution_of_predicted_val(best_model, X_train, X_test)

# ROC Curve
ml_helper.roc(best_model,X_test,y_test)

# Permutation Importance
ml_helper.p_importance(best_model,X_test,y_test,hyperparameters,MODEL_OUTPUT_DIR)

Finally, the best ML model can be saved to a file so that it can be loaded and used in the future. The only prerequisite for applying the ML model is for the input dataset to have the same format as the training dataset described above.

import joblib

# Save the model to file
model_fname = 'best_rf_model.joblib'
joblib.dump(best_model, model_fname)

Raster Inference

Now the user can apply the ML model on a Sentinel-2 image to generate a binary water mask output. After loading the EO data and the ML model into the Notebook, the ML model is applied to make predictions over the entire input EO data. The steps to perform these operations are shown in the simplified code snippet below.

# Select EO assets from the loaded Sentinel-2 scene (eo_item)

fileList = {}
for f in eo_item.get_assets():
  if (f in feature_cols) or f == 'scl':
    fileList[f] = eo_item.get_assets()[f].href

# Load the ML model classifier
model = joblib.load(model_fname)

# Make predictions
predictions = ml_helper.readRastersToArray(model, fileList, feature_cols)

# Save predictions
df_predict = pd.DataFrame(predictions.ravel(),columns=['predictions'])
df_predict.to_pickle('prediction.pkl')

# Create binary mask
predictions = df_predict['predictions']
predictions = predictions.to_numpy().reshape((10980,10980))

# Apply sieve operation to remove small features (in pixels)
my_array_uint8 = predictions.astype(rasterio.uint8)
sieved = sieve(my_array_uint8, threshold=1000, connectivity=8)

# Use Scene Classification band to filter out clouds and bad data
with rasterio.open(fileList['scl']) as scl_src:
  scl = scl_src.read(1)
  scl = np.where(~np.isin(scl, [4, 5, 6, 7, 11]), np.nan, scl)
mask_out = np.where(~np.isnan(scl), sieved, np.nan)

# Use Scene Classification band to filter out clouds and bad data
import matplotlib.pyplot as plt
plt.imshow(mask_out,interpolation='none'); plt.title("Improved result")

s2-inference_result

In the figure above, water bodies are plotted in yellow and non-water pixels are plotted in dark blue, and clouds are masked out in white (top-right corner of the image).

Conclusion

This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to help a ML practitioner:

create EO data labels, using QGIS Software or a Solara / Leafmap application
load Labels and EO data with STAC API
sample EO data with Labels and create a dataframe
use the dataframe to train a Random Forest classifier
perform raster inference on a selected Sentinel-2 scene to generate a binary water mask.

Useful links:

The link to the Notebook for User Scenario 2 is: https://github.com/ai-extensions/notebooks/blob/main/scenario-2/s2-labellingEOdata.ipynb.
Note: access to the Notebook for User Scenario 2 must be granted - please send an email to support@terradue.com with subject “Request Access to s2-labellingEOdata” and body “Please provide access to Notebook for AI Extensions User Scenario 2”
The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual
Link to the project update article “AI/ML Enhancement Project - Progress Update”
Link to User Scenario 1 article “AI/ML Enhancement Project - Exploratory Data Analysis”

simonevaccari · May 29, 2024, 1:21pm

AI/ML Enhancement Project - Describing labelled EO Data with STAC

Introduction

The use of the SpatioTemporal Asset Catalogs (STAC) format is crucial when it comes to describing spatio-temporal datasets, including labelled Earth Observation (EO) data. This allows to describe the labelled EO data while defining standardised sets of metadata to delineate its key properties, such as spatial and temporal extents, resolution, and other pertinent characteristics. The use of STAC brings several benefits, including enhancing the reproducibility and transparency of the process and its result, as well as ensuring that the data becomes discoverable and accessible to other stakeholders (e.g. users, researchers, policymakers, etc).

This post presents User Scenario 3 of the AI/ML Enhancement Project, titled “Alice describes the labelled EO data”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support users describing labelled EO data using the STAC format.

To demonstrate these new capabilities defined in this User Scenario, an interactive Jupyter Notebook is used to guide an ML practitioner, such as Alice, in the process of exploiting the STAC format to describe, publish, and search labelled EO data, including:

Loading a labelled EO data (.geojson file) and display it as geopandas dataframe
Show labelled EO data on an interactive map
Generate a STAC Item and add metadata to it
Publish the STAC Item on dedicated S3 and on the STAC endpoint
Search the STAC Item using using STAC API and query parameters

Practical examples and commands are displayed to demonstrate how these new capabilities can be used from a Jupyter Notebook.

Loading labelled EO data

A .geojson file of the labelled EO data was loaded into the notebook and converted into a geopandas dataframe.

import geopandas as gpd
import geojson

fname = './input/label-S2A_10SFH_20230519_0_L2A.geojson'

with open(fname) as f:
  gj = geojson.load(f)

# Make geodataframe out of the created object
gdf = gpd.read_file(fname)
gdf

s3-geodataframe

The Python library folium was then used to display the labelled EO data on an interactive map.

import folium
from folium import GeoJson, LayerControl

# Get extent and center of dataframe points
bbox = (gdf.geometry.total_bounds)
centerx,centery = (np.average(bbox[1::2]), np.average(bbox[::2]))

# Create map
map = folium.Map(location=[centerx, centery], tiles="OpenStreetMap", zoom_start=9)

# Add Labels to map
map = addPoints2Map(gdf, map)

# Add layer control
LayerControl().add_to(map)

# Visualis map
map

s3-map

Generate STAC Item

Before creating the STAC Item, the user defines the geometry of the vector data represented by the dataframe.

# Get geometry of dataframe points

label_geom = geojson.Polygon([[
  (bbox[0], bbox[1]),
  (bbox[2], bbox[1]),
  (bbox[2], bbox[3]),
  (bbox[0], bbox[3]),
  (bbox[0], bbox[1])
]])

The user can now create the STAC Item and populate it with relevant information, by exploiting the pystac library.

import pystac

# Creating STAC Item
label_item = pystac.Item(
  id="<label_id>",
  geometry=label_geom,
  bbox=list(bbox),
  datetime=datetime.utcnow(),
  properties={},
)

The user defines a dictionary named label_classes to represent the classes for a classification task. The dictionary contains the class names for various land cover types, such as vegetation, water, clouds, shadows, and more. This mapping can be used to label and categorise data in a classification process.

The user can then apply the label-specific STAC Extension with the defined label classes.

from pystac.extensions.label import LabelExtension, LabelType, LabelClasses

# Define label classes
label_classes = {
  "name": "CLASSIFICATION",
  "classes": [
    "NO_DATA",
    "SATURATED_OR_DEFECTIVE",
    "CAST_SHADOWS",
    "CLOUD_SHADOWS",
    "VEGETATION",
    "NOT_VEGETATED",
    "WATER",
    "UNCLASSIFIED",
    "CLOUD_MEDIUM_PROBABILITY",
    "CLOUD_HIGH_PROBABILITY",
    "THIN_CIRRUS",
    "SNOW or ICE",
  ],
}

# Apply label-specific STAC Extension “LabelExtension” with its related fields
label = LabelExtension.ext(label_item, add_if_missing=True)
label.apply(
   label_description="Land cover labels",
   label_type=LabelType.VECTOR,
   label_tasks=["segmentation", "regression"],
   label_classes=[LabelClasses(label_classes)],
   label_methods=["manual"],
   label_properties=["CLASSIFICATION"],
)

# Add geojson labels
label.add_geojson_labels(f"label-{label_id}.geojson")

# Add version
version = ItemVersionExtension(label_item)
version.apply(version="0.1", deprecated=False)

label_item.stac_extensions.extend(
   ["https://stac-extensions.github.io/version/v1.2.0/schema.json"]
)

In the end, the user validates the created STAC Item.

# Validate STAC Item
label_item.validate()
display(label_item)

Publish the STAC Item

The STAC endpoint and STAC Collection in which to publish the STAC Item are firstly defined:

stac_endpoint = "https://ai-extensions-stac.terradue.com"
collection = read_file("input/collection/collection.json")

Subsequently, the STAC Item can be posted on a dedicated S3 bucket.

# Define filename and write locally
out_fname = f"item-label-{label_id}.json"
pystac.write_file(label_item, dest_href=out_fname)


# Define wrapper to write on S3 bucket
wrapper = StarsCopyWrapper()
exit_code, stdout, stderr = (
   wrapper.recursivity()
   .output(f"s3://ai-ext-bucket-dev/svv-dataset/{label_id}")
   .config_file("/etc/Stars/appsettings.json")
   .extract_archive(extract=False)
   .absolute_assets()
   .run(f"file://{os.getcwd()}/{out_fname}")
)

When the STAC Item is posted on S3, it can be published on the dedicated STAC endpoint.

# Define customized StacIO class
StacIO.set_default(CustomStacIO)

# Read catalog.json file posted on S3
catalog_url = f"s3://ai-ext-bucket-dev/svv-dataset/{label_id}/catalog.json"
catalog = read_url(catalog_url)

ingest_items(
   app_host=stac_endpoint,
   items=list(catalog.get_all_items()),
   collection=collection,
   headers=get_headers(),
)

Find STAC Item on STAC Catalog

Once the STAC Item is successfully published on the STAC endpoint, it can be searched using pystac and pystac_client libraries. These libraries enable users to interact with a STAC catalog by defining specific query parameters, such as time range, area of interest, and data collection preferences. Subsequently, only the STAC Item(s) that align with the provided criteria is(are) retrieved for the user.

# Import libraries
import pystac
from pystac_client import Client

# Access to STAC Catalog
cat = Client.open(stac_endpoint, headers=get_headers(), ignore_conformance=True)

# Define query parameters
start_date = datetime.strptime(“20230601”, '%Y%m%d')
end_date = datetime.strptime(“20230630”, '%Y%m%d')
bbox = [-121.857043   37.853934 -120.608968   38.840424]
tile = “10SFH”

# Query by AOI, start and end date 
query_sel = cat.search(
    collections=[“ai-extensions-svv-dataset-labels”],
    datetime=(start_date, end_date),
    bbox=bbox,
)
item = [item for item in query_sel.item_collection() if tile in item.id][0]

# Display Item
display(item)

Conclusion

This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to help a ML practitioner exploiting the STAC format to describe, publish, and search labelled EO data, including:

Loading a labelled EO data (.geojson file) and display it as geopandas dataframe
Show labelled EO data on an interactive map
Generate a STAC Item and add metadata to it
Publish the STAC Item on dedicated S3 and on the STAC endpoint
Search the STAC Item using using STAC API and query parameters with pystac

Useful links:

The link to the Notebook for User Scenario 3 is: https://github.com/ai-extensions/notebooks/blob/main/scenario-3/s3-describingEOdata.ipynb
Note: access to this Notebook must be granted - please send an email to support@terradue.com with subject “Request Access to s3-describeEOdata” and body “Please provide access to Notebook for AI Extensions User Scenario 3”
The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual
Link to the project update article “AI/ML Enhancement Project - Progress Update”
Link to User Scenario 1 article “AI/ML Enhancement Project - Exploratory Data Analysis”
Link to User Scenario 2 article “AI/ML Enhancement Project - Labelling EO Data”

simonevaccari · June 13, 2024, 9:34am

AI/ML Enhancement Project - Discovering Labelled EO Data with STAC

Introduction

The use of the SpatioTemporal Asset Catalogs (STAC) format is crucial when it comes to search and discover spatio-temporal datasets, including labelled Earth Observation (EO) data. It allows filtering search results using STAC metadata as query parameters, such as spatial and temporal extents, resolution, and other properties. As well as ensuring that the data becomes discoverable and accessible to other stakeholders (e.g. users, researchers, policymakers, etc), the use of STAC brings several other benefits, including enhancing the reproducibility and transparency of the process and its result.

This post presents User Scenario 4 of the AI/ML Enhancement Project, titled “Alice discovers the labelled EO data”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support users exploiting STAC format to discover labelled EO data.

To demonstrate these new capabilities defined in this User Scenario, an interactive Jupyter Notebook is used to guide an ML practitioner, such as Alice, in the process of exploiting the STAC format to discover labelled EO data, including:

Understanding the STAC format
Accessing STAC via STAC Browser and STAC API
Connectivity with dedicated S3 storage

Practical examples and commands are displayed to demonstrate how these new capabilities can be used from a Jupyter Notebook.

Understanding STAC

The SpatioTemporal Asset Catalog (STAC) specification was designed to establish a standard, unified language to talk about geospatial data, allowing it to be more easily searchable and queryable. By defining query parameters based on STAC metadata, such as spatial and temporal extents, resolution, and other properties, the user can narrow down a search with only those datasets that align with the specific requirements.

There are four components specifications that together make up the core STAC specification:

STAC Item: the core unit representing a single spatiotemporal asset as a GeoJSON feature with datetime and links.
STAC Catalog: a simple, flexible JSON file of links that provides a structure to organize and browse STAC Items.
STAC Collection: an extension of the STAC Catalog with additional information such as the extents, license, keywords, providers, etc., that describe STAC Items that fall within the Collection.
STAC API: it provides a RESTful endpoint that enables search of STAC Items, specified in OpenAPI, following OGC’s WFS 3.

A STAC Catalog is used to group STAC objects like Items, Collections, and/or even other Catalogs.

Below are shown some commands of the pystac library that can be used to extract information from a STAC Catalog / Item / Collection.

import pystac

# Read STAC Catalog from file and explore High-Level Catalog Information
cat = Catalog.from_file(url)
cat.describe()

# Print some key metadata
print(f"ID: {cat.id}")
print(f"Title: {cat.title or 'N/A'}")
print(f"Description: {cat.description}")

# Access to STAC Child Catalogs and/or Collections
col = [col for col in cat.get_all_collections()]

# Explore STAC Item Metadata
item = cat.get_item(id=<item_id>, recursive=True)

More information can be found in the official STAC documentation.

Accessing STAC via STAC Browser and STAC API

There are two ways to discover STAC data: by using the STAC Browser or by using the STAC API.

Accessing using STAC Browser

The STAC Browser provides a user-friendly graphical interface that facilitates the search and discovery of datasets. A few screenshots of the graphical interface are provided below.

The dedicated STAC Browser app can be launched by the user at login with the option STAC Browser for AI-Extensions STAC API. The STAC Catalog and Collections available on the App Hub project endpoint will be displayed.

After selecting a specific collection, the query parameters can be manually specified with the dedicated widgets in the Filters section (temporal and spatial extents in this case).

The search results are then shown after clicking Submit. In the example screenshot below, it is shown a single STAC Item with its key metadata.

Despite its user-friendly interface, the use of the STAC Browser is quite limited to manual interactions with the user, making it difficult and time consuming when performing multiple searches with different parameters, for example. For this reason, the use of the STAC Browser is primarily designed for manual exploration and is less suited for automated workflows.

Accessing using STAC API

The STAC API allows for programmatic access to data, enabling automation of data discovery, retrieval, and processing workflows. This is particularly useful for integrating STAC data into larger geospatial data processing pipelines or applications…

import requests

# Define payload for token request
payload = {
  "client_id": "ai-extensions",
  "username": "ai-extensions-user",
  "password": os.environ.get("IAM_PASSWORD"),
  "grant_type": "password",
}

auth_url = 'https://iam-dev.terradue.com/realms/ai-extensions/protocol/openid-connect/token'
token = get_token(url=auth_url, **payload)
headers = {"Authorization": f"Bearer {token}"}

Once the authentication credentials are defined, the private STAC Catalog can be accessed and searched using specific query parameters, such as time range, area of interest, and data collection preferences. Subsequently, only the STAC Item(s) that align with the provided criteria is(are) retrieved for the user. This can be achieved with the pystac and pystac_client libraries.

# Import libraries
import pystac
from pystac_client import Client

# Define STAC endpoint and access to the Catalog
stac_endpoint = "https://ai-extensions-stac.terradue.com"
cat = Client.open(stac_endpoint, headers=headers, ignore_conformance=True)

# Define query parameters
start_date = datetime.strptime(“20230601”, '%Y%m%d')
end_date = datetime.strptime(“20230630”, '%Y%m%d')
bbox = [-121.857043 37.853934 -120.608968 38.840424]
tile = “10SFH”

# Query by AOI, start and end date
query_sel = cat.search(
  collections=[“ai-extensions-svv-dataset-labels”],
  datetime=(start_date, end_date),
  bbox=bbox,
)
item = [item for item in query_sel.item_collection() if tile in item.id][0]

# Display Item
display(item)

Connectivity with dedicated S3 storage

Up until now the user accessed the STAC endpoint for exploring the Catalog and its Collections / Items. In this section we describe the process to access the data referenced in the Item’s assets, which are stored in a dedicated S3 bucket.

The AWS S3 configuration settings are defined in a .json file (eg appsettings.json), which is used to create a UserSettings object. This will be used to create a configured S3 client to retrieve an object stored on S3, using boto3 and botocore libraries.

# Import libraries
import botocore, boto3

# Define AWS S3 settings
settings = UserSettings("appsettings.json")
settings.set_s3_environment(<asset_s3_path>)

# Start botocore session
session = botocore.session.Session()

# create client obj
s3_client = session.create_client(
  service_name="s3",
  region_name=os.environ.get("AWS_REGION"),
  use_ssl=True,
  endpoint_url=os.environ.get("AWS_S3_ENDPOINT"),
  aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
  aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"),
)
parsed = urlparse(geojson_url)

# retrieve bucket name
bucket = parsed.netloc
key = parsed.path[1:]

# retrive the obj which was stored on s3
response = s3_client.get_object(Bucket=bucket, Key=key)

The user can then download locally the file stored on S3 using io library.

import io

geojson_content = io.BytesIO(respond["Body"].read())
fname = './output/downloaded.geojson'

# Save the GeoJSON content to a local file
with open(fname, "wb") as file:
  file.write(geojson_content.getvalue())

The user can also import the downloaded data into this Notebook. In this example, the downloaded .geojson file is loaded and converted into a pandas dataframe.

import geopandas as gpd

# Make geodataframe out of the downloaded .geojson file
gdf = gpd.read_file(fname)
gdf

s3-geodataframe

Conclusion

This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to help a ML practitioner exploiting the STAC format to discover labelled EO data, including:

Understanding the STAC format
Accessing STAC via STAC Browser and STAC API
Connectivity with dedicated S3 storage

Useful links:

The link to the Notebook for User Scenario 4 is: https://github.com/ai-extensions/notebooks/blob/develop/scenario-4/s4-discoveringLabelledEOData.ipynb
Note: access to this Notebook must be granted - please send an email to support@terradue.com with subject “Request Access to s4-discoveringLabelledEOData” and body “Please provide access to Notebook for AI Extensions User Scenario 4”
The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual
Project Update “AI/ML Enhancement Project - Progress Update”
User Scenario 1 “AI/ML Enhancement Project - Exploratory Data Analysis”
User Scenario 2 “AI/ML Enhancement Project - Labelling EO Data”
User Scenario 3 “AI/ML Enhancement Project - Describing labelled EO data”

pmembari · July 9, 2024, 9:13am

AI/ML Enhancement Project - Developing a new ML model and tracking with MLflow

Introduction

In this scenario, the ML practitioner Alice develops a Convolutional Neural Networks (CNN) model for a classification task and employs MLflow for monitoring the ML model development cycle. MLflow is a crucial tool that ensures effective log tracking and preserves key information, including specific code versions, datasets used, and model hyperparameters. By logging this information, the reproducibility of the work drastically increases, enabling users to revisit and replicate past experiments accurately. Moreover, quality metrics such as classification accuracy, loss function fluctuations, and inference time are also tracked, enabling easy comparison between different models.

This post presents User Scenario 5 of the AI/ML Enhancement Project, titled “Alice develops a new ML model”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support users on developing a new ML model and on using MLflow to track experiments.

These new capabilities are implemented with an interactive Jupyter Notebook to guide an ML practitioner, such as Alice, through the following steps:

Data ingestion
Design the ML model architecture
Train the ML model and fine-tuning
Evaluate the ML model performance with metrics such as accuracy, precision, recall, or F1 score, and confusion matrix
Check experiments with MLflow

These steps are outlined in the diagram below.

Practical examples and commands are displayed to demonstrate how these new capabilities can be used from a Jupyter Notebook.

Data Ingestion

The training data used for this scenario is the EuroSAT dataset. The EuroSAT dataset is based on ESA’s Sentinel-2 data, covering 13 spectral bands and consisting out of 10 classes with a total of 27,000 labelled and geo-referenced images. A separate Notebook was generated to create a STAC Catalog, a STAC Collection, and STAC Items for the entire EuroSAT dataset, and then publish these into the STAC endpoint (https://ai-extensions-stac.terradue.com/collections/EUROSAT-Training-Dataset).

The data ingestion process was implemented with a DataIngestion class, configured with three main components:

stac_loader: for fetching the dataset from the STAC endpoint
data_splitting: for splitting the dataset into train, test and validation sets with defined percentages
data_downloader: for downloading the data into the local system.

ML Model Architecture

In this section, the user defines a Convolutional Neural Networks (CNNs) model with six layers. The first layer serves as the input layer, accepting an image with a defined shape of (13, 64, 64) (i.e. same as the shape of the EuroSAT labels in this case). The model is designed with four convolutional layers, each employing: a relu activation function, a BatchNormalization layer, a 2D MaxPooling operation, and a Dropout layer. Subsequently, the model includes two Dense layers and finally, a Softmax activation layer is implied in the last Dense layer which generates a vector with 10 cells containing the likelihood of the predicted classes. The user defines a loss function and an optimizer, and eventually the best model is compiled and saved locally for each epoch based on the improvement in validation loss function. The input parameters defining the ML model architecture are described in a params.yml file which is used for the configuration process. See below for the params.yml file defined in this test.

params.yml

BATCH_SIZE: 128
EPOCHS: 50
LEARNING_RATE: 0.001
DECAY: 0.1 ### float
EPSILON: 0.0000001
MEMENTUM: 0.9
LOSS: categorical_crossentropy
# choose one of l1,l2,None
REGULIZER: None
OPTIMIZER: SGD

The configuration of the ML model architecture is run with a dedicated pipeline, such as that defined below.

# pipeline
try:
  config = ConfigurationManager()
  prepare_base_model_config = config.get_prepare_base_model_config()
  prepare_base_model = PrepareBaseModel(config=prepare_base_model_config)
  prepare_base_model.base_model()
except Exception as e:
  raise e

The output of the ML model architecture configuration is displayed below, allowing the user to summarise the model and report the number of trainable and non-trainable parameters.

Model: "sequential"
___________________________________________________________________
 Layer (type)                    Output Shape              Param #   
===================================================================
 conv2d (Conv2D)                 (None, 64, 64, 32)        3776                                                              
 activation (Activation)         (None, 64, 64, 32)        0                                                                          
 conv2d_1 (Conv2D)               (None, 62, 62, 32)        9248                                                                       
 activation_1 (Activation)       (None, 62, 62, 32)        0        
 max_pooling2d (MaxPooling2D)    (None, 31, 31, 32)        0                                                   
 dropout (Dropout)               (None, 31, 31, 32)        0         
 conv2d_2 (Conv2D)               (None, 31, 31, 64)        18496     
 activation_2 (Activation)       (None, 31, 31, 64)        0         
 conv2d_3 (Conv2D)               (None, 29, 29, 64)        36928    
 activation_3 (Activation)       (None, 29, 29, 64)        0         
 max_pooling2d_1 (MaxPooling2D)  (None, 14, 14, 64)        0         
 dropout_1 (Dropout)             (None, 14, 14, 64)        0         
 flatten (Flatten)               (None, 12544)             0         
 dense (Dense)                   (None, 512)               6423040   
 activation_4 (Activation)       (None, 512)               0         
 dropout_2 (Dropout)             (None, 512)               0         
 dense_1 (Dense)                 (None, 10)                5130      
 activation_5 (Activation)       (None, 10)                0         
===================================================================
Total params: 6,496,618
Trainable params: 6,496,618
Non-trainable params: 0
===================================================================

Training and fine-tuning

The steps involved in the training phase are as follows:

Create the training entity
Create the configuration manager
Define the training component
Run the training pipeline

As mentioned in the “Training Data Ingestion” chapter, the training data was split into train, test and validation sets in order to ensure that the model is trained effectively and its performance is evaluated accurately and without bias. The user trains the ML model on the train data set for a specific number of epochs, defined in the params.yml file, after each epoch the model is evaluated on the validation data to avoid overfitting. There are several approaches to address overfitting during training. One effective method is adding a regularizer to the model’s layers, which introduces a penalty term to the loss function to penalize larger weights. In the end, the test set, which is not used in any part of the training or validation process, is used to evaluate the final model’s performance.

In order to assess the ML model’s performance and reliability, the user can plot the Loss and Accuracy curves of the Training and Validation sets. This can be done with the matplotlib library, as illustrated below.

# Import library
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 5))

# Plot Loss
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()

# Plot Accuracy
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracy')
plt.legend()
plt.tight_layout()
plt.show()

Evaluation

The evaluation of the trained ML model was conducted on the test set. It is crucial for the user to prevent any data leakage between the train and test sets to ensure an independent and unbiased assessment of the training pipeline’s outcome. The model’s performance was measured using the following evaluation metrics: accuracy, recall, precision, F1-score, and the confusion matrix.

Accuracy: calculated as the ratio of correctly predicted instances to the total number of instances in the dataset
Recall: also known as sensitivity or true positive rate, recall is a metric that evaluates the ability of a classification model to correctly identify all relevant instances from a dataset
Precision: it evaluates the accuracy of the positive predictions made by a classification model
F1-score: it is a metric that combines precision and recall into a single value. It is particularly useful when there is an uneven class distribution (imbalanced classes) and provides a balance between precision and recall
Confusion Matrix: it provides a detailed breakdown of the model’s performance, highlighting instances of correct and incorrect predictions.

The pipeline for generating the evaluation metrics was defined as follows:

try:
  config = ConfigurationManager()
  eval_config = config.get_evaluation_config()
  evaluation = Evaluation(eval_config)
  test_dataset,conf_mat = evaluation.evaluation()
  evaluation.log_into_mlflow()
except Exception as e:
  raise e

The confusion matrix can be easily plotted with the seaborne library.

# Import libraries
import matplotlib.pyplot as plt
import seaborn as sn
import numpy as np
def plot_confusion_matrix(self):
  class_names = np.unique(self.y_true)
  fig, ax = plt.subplots()
 
  # Create a heatmap
  sns.heatmap(
    self.matrix,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=class_names,
    yticklabels=class_names
  )

  # Add labels and title
  plt.xlabel('Predicted')
  plt.ylabel('True')
  plt.title('Confusion Matrix')
  # Show the plot
  plt.show()

s5-confmatrix

MLflow Tracking

The training, fine-tuning, and evaluation processes are executed multiple times, referred to as “runs”. Each run is generated by executing multiple jobs with different combinations of parameters, specified in the params.yaml file described in the ML Model Architecture section. The user monitors all executed runs during the training and evaluation phases using mlflow and its built-in tracking functionalities, as shown in the code below.

# Import libraries
import mlflow
import tensorflow

def log_into_mlflow(self):
  mlflow.set_tracking_uri(os.environ.get("MLFLOW_TRACKING_URI"))
  tracking_url_type_store = urlparse(os.environ.get("MLFLOW_TRACKING_URI")).scheme
  confusion_matrix_figure = self.plot_confusion_matrix()

  with mlflow.start_run():
    mlflow.tensorflow.autolog()
    mlflow.log_params(self.config.all_params)
    mlflow.log_figure(confusion_matrix_figure, artifact_file="Confusion_Matrix.png")
    mlflow.log_metrics(
      {
        "loss": self.score[0], "test_accuracy": self.score[1],
        "test_precision":self.score[2],"test_recall":self.score[3],
      }
    )
    # Model registry does not work with file store
    if tracking_url_type_store != "file":
      log_model(self.model, "model", registered_model_name=f"CNN")

The MLflow dashboard allows for visual and interactive comparisons of different runs, enabling the user to make informed decisions when selecting the best model. The user can access the MLflow dashboard by clicking on the dedicated icon from the user’s App Hub dashboard.

On the MLflow dashboard, the user can select the experiments to compare in the “Experiment” tab.

Subsequently, the user can select the specific parameters and metrics to include in the comparison from the “Visualizations” dropdown. The runs’ behavior and details generated by the different evaluation metrics and parameters are displayed.

The comparison of the parameters and metrics are shown in the dedicated dropdown.

Conclusion

This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to guide a ML practitioner through the development of a new ML model and its related tracking functionalities provided by MLflow, including:

Data ingestion
Design the ML model architecture
Train the ML model and fine-tuning
Evaluate the ML model performance with metrics such as accuracy, precision, recall, or F1 score, and confusion matrix
Check experiments with MLflow dashboard and tools.

Useful links:

The link to the Notebook for User Scenario 5 is: https://github.com/ai-extensions/notebooks/blob/main/scenario-5/trials/s5-newMLModel.ipynb
Note: access to this Notebook must be granted - please send an email to support@terradue.com with subject “Request Access to s5-newMLModel ” and body “Please provide access to Notebook for AI Extensions User Scenario 5”
The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual
Project Update “AI/ML Enhancement Project - Progress Update”
User Scenario 1 “AI/ML Enhancement Project - Exploratory Data Analysis”
User Scenario 2 “AI/ML Enhancement Project - Labelling EO Data”
User Scenario 3 “AI/ML Enhancement Project - Describing labelled EO data”
User Scenario 4 “AI/ML Enhancement Project - Discovering labelled EO data with STAC”

simonevaccari · July 23, 2024, 2:20pm

AI/ML Enhancement Project - Training and Inference on a remote machine

Introduction

In this scenario, the ML practitioner Alice develops two Earth Observation (EO) Application Packages using the Common Workflow Language (CWL), as described in the OGC proposed best practices. The two App Packages CWLs are executed on a remote machine with a CWL runner for Kubernetes, enabling the submission of (parallel) kubernetes jobs distributed across the available resources in the cluster. MLflow is used for tracking experiments and related metrics for further analysis and comparison.

App Package CWL for a training job: Alice develops an ML training job with Random Forest, based on a segmentation approach for water bodies masking. The training process is repeated and evaluated for each training run with MLflow and in the end, the model evaluated with higher performance is selected and used for the inference service.
App Package CWL for inference job: this service performs inference on Sentinel-2 data, based on the best model developed with the training service. It takes as input parameter the STAC Item(s) of Sentinel-2 data and generates the inference water-bodies output mask(s).

This post presents User Scenario 6 of the AI/ML Enhancement Project, titled “Alice starts a training job on a remote machine”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support users on packaging EO applications using CWL format and for using a CWL runner for executing and managing Kubernetes jobs.

These new capabilities are implemented through the development and execution of the two App Packages CWLs to guide an ML practitioner, such as Alice, through the following steps:

Write a Python application with its containerised environment
Use the CWL standard, as described in the OGC proposed best practices, for writing EO Application Packages
Dockerise all processing nodes and write the App Package CWLs through CI/CD pipeline with GitHub Actions, and release them as official repository releases
Use the CWL runners for Kubernetes to submit (parallel) jobs, distributing them across the available resources in the cluster
Configure MLflow server for tracking the experiments and log relevant information

Practical examples and commands are displayed to demonstrate how these new capabilities can be used within the App Hub environment.

Key Concepts

Common Workflow Language

The Common Workflow Language (CWL) is a powerful community-driven standard for describing, executing and sharing computational workflows. Originating from the bioinformatics community, CWL has gained traction across various scientific disciplines, including the EO sector.

The CWL standard supports multiple workflows and includes two key components (i.e. classes):

Workflow class/standard for describing workflows: in this class are defined all input parameters and the final output of the application. The processing steps are defined with only the inputs specific to each step, and the dependencies between steps are established by defining the output(s) of one step as the input(s) of another step. The scatter is a feature that enables parallel execution of tasks based on specific inputs and is defined at step level.
CommandLineTool class/standard for describing command line tools: in this class are defined the inputs and arguments needed for each specific module, the environment variables (EnvVarRequirement), the docker container in which the module is executed (DockerRequirement), the specific resources allocated for its execution (ResourceRequirement), and the type of output generated.

A simple guide on how to get started on EO Application Packages with CWL is provided here.

Kubernetes

Kubernetes is an open-source container orchestration engine designed to automate the deployment, scaling, and management of containerized applications. It provides a robust framework to run distributed systems efficiently, ensuring high availability and scalability.

The development environment ML Lab runs on a Kubernetes cluster composed of multiple nodes. When a user starts their own ML Lab pod, this is instantiated using one of the available nodes, based on the resource requirements of the specific user pod. From within the ML Lab pod, the user can use two CWL runners to execute an Application Package CWL:

Using cwltool, the CWL is executed using the Kubernetes node that hosts the ML Lab user pod. This is done with podman, an open-source tool for Linux for managing containerized applications.
Using calrissian, the CWL runner creates pods leveraging all the nodes available on the Kubernetes cluster. The Kubernetes cluster autoscaler facilitates this behaviour by creating new nodes on demand to satisfy all the requests made by calrissian. The vertical architecture of calrissian makes it ideal for running large-scale workflows in a cloud or high-performance computing environment.

The implementation of both CWL runners cwltool and calrissian from the ML Lab user pod is shown in the diagram below.

App Package CWL for training job

Objective

The training job consists of a Random Forest model based on a segmentation approach for water bodies masking on EO data.

Application Workflow

The workflow of the Application Package for the training job is illustrated in the diagram below.

Application Inputs

The inputs are:

STAC Item(s) of EO Data (Sentinel-2)
STAC Item(s) of EO Labels
ML model parameters (e.g. RANDOM_STATE and n_estimators)
MLflow URL
AWS-related configuration settings

Application Outputs

The expected output is:

ML Model(s) saved and tracked with MLflow on the configured MLflow server

Processing Modules

The training workflow is structured in two processing modules:

make-dataframe for creating dataframes (e.g. .pkl file) based on EO data sampling on the classified EO labels. This consisted in extracting the reflectance values of the EO bands coastal, red, green, blue, nir, nir08, nir09, swir16, and swir22 in correspondence of each EO label point, in addition to its classified label. Subsequently, the three vegetation indices ndvi, ndwi1, and ndwi2 are calculated from a selection of these EO bands. This approach is described more in detail with its Python code in the “Sample EO data with labels” function described in the related article “Labelling EO Data User Scenario 2”. The output of this module is a list of dataframes with classified EO labels and related spectral values and vegetation indices, as can be seen below.

make-ml-model for initiating, training and evaluating multiple ML models based on the input dataframes and the input ML model parameters. The list of generated dataframes are firstly merged into a unique dataframe, which is then split into train, validation and test sets. Subsequently, a number of estimators (based on the user input) are trained using a RandomForest classifier with the k-fold cross-validation method. The candidate model is chosen based on the accuracy of these estimators on the test dataset. Finally, the best estimator is evaluated using various metrics, including accuracy, recall, precision, f1-score, and the confusion matrix.
All the generated ML models are tracked using MLflow and the artifacts of the training jobs are stored on a dedicated S3 bucket.

Both modules are developed as Python projects, with dedicated source code, environment.yml, Dockerfile and set-up configuration instructions. A Python project template can be found on the guidance document “Setup of software project template for my Python application”.

Application Package CWL

An extraction of the App Package CWL for the training job is shown below, where the following key components can be inspected:

Class Workflow: this includes all input parameters and the output of the application. The two processing steps are defined with their specific inputs and outputs dependencies, and the scatter method is applied to some inputs for parallel execution.
Class CommandLineTool for the base command make-dataframe: in this class are defined the inputs needed for the make-dataframe module, the environment variables (EnvVarRequirement), the docker container in which it is executed (DockerRequirement), the specific resources allocated for its execution (ResourceRequirement), and the type of output generated.
Class CommandLineTool for the base command make-ml-model: this class has the specifications described above but tailored to the needs of this step. In addition, in the arguments section are also defined, in JavaScrip (InlineJavascriptRequirement), how multiple inputs are handled by this module.

cwlVersion: v1.2
$namespaces:
  s: https://schema.org/
s:softwareVersion: 1.0.8
schemas:
  - http://schema.org/version/9.0/schemaorg-current-http.rdf
$graph:
  - class: Workflow
    id: water-bodies-app-training
    label: Water-Bodies Training on Sentinel-2 data
    doc: Training a RandomForest calssifier on Sentinel-2 data to detect water bodies, and track it using MLFlow.
    requirements:
      - class: InlineJavascriptRequirement
      - class: ScatterFeatureRequirement
    inputs:
      ADES_AWS_S3_ENDPOINT: 
        label: ADES_AWS_S3_ENDPOINT
        type: string?
      ADES_AWS_REGION: 
        label: ADES_AWS_REGION
        type: string?
      ADES_AWS_ACCESS_KEY_ID: 
        label: ADES_AWS_ACCESS_KEY_ID
        type: string?
      ADES_AWS_SECRET_ACCESS_KEY: 
        label: ADES_AWS_SECRET_ACCESS_KEY
        type: string?
      labels_url:
        label: labels_url
        doc: STAC Item label url
        type: string[]
      eo_url:
        label: eo_url
        doc: STAC Item url to sentinel-2 eo data
        type: string[]
      MLFLOW_TRACKING_URI:
        label: MLFLOW_TRACKING_URI
        doc: URL for MLFLOW_TRACKING_URI
        type: string
      RANDOM_STATE:
        label: RANDOM_STATE
        doc: RANDOM_STATE
        type: int[]
      n_estimators:
        label: n_estimators
        doc: n_estimators
        type: int[]
      experiment_id: 
        label: experiment_id
        doc: experiment_id
        type: string
    outputs: 
      - id: artifacts
        outputSource: 
          - make_ml_model/artifacts
        type: Directory[]
    steps:
      create_datafram:
        run: "#create_datafram"
        in:
          labels_url: labels_url
          eo_url: eo_url
          ADES_AWS_S3_ENDPOINT: ADES_AWS_S3_ENDPOINT
          ADES_AWS_REGION: ADES_AWS_REGION
          ADES_AWS_ACCESS_KEY_ID: ADES_AWS_ACCESS_KEY_ID
          ADES_AWS_SECRET_ACCESS_KEY: ADES_AWS_SECRET_ACCESS_KEY
        out: 
          - dataframe
        scatter: 
          - labels_url
          - eo_url
        scatterMethod: dotproduct # "flat_crossproduct" to analyse all possible combination of inputs
      make_ml_model:
        run: "#make_ml_model"
        in:
          MLFLOW_TRACKING_URI: MLFLOW_TRACKING_URI
          labels_urls: labels_url
          eo_urls: eo_url
          RANDOM_STATE: RANDOM_STATE
          n_estimators: n_estimators
          experiment_id: experiment_id
          data_frames:
            source: create_datafram/dataframe
        out: 
          - artifacts
        scatter: 
          - RANDOM_STATE
          - n_estimators
        scatterMethod: dotproduct # "flat_crossproduct" to analyse all possible combination of inputs

  - class: CommandLineTool
    id: create_datafram
    requirements:
      InlineJavascriptRequirement: {}
      NetworkAccess:
        networkAccess: true
      EnvVarRequirement:
        envDef:
          PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/linus/conda/envs/env_make_df/bin:/home/linus/conda/envs/env_make_df/snap/bin
          AWS_S3_ENDPOINT: $( inputs.ADES_AWS_S3_ENDPOINT )
          AWS_REGION: $( inputs.ADES_AWS_REGION )
          AWS_ACCESS_KEY_ID: $( inputs.ADES_AWS_ACCESS_KEY_ID )
          AWS_SECRET_ACCESS_KEY: $( inputs.ADES_AWS_SECRET_ACCESS_KEY )
      ResourceRequirement:
        coresMax: 1
        ramMax: 1600
    hints:
      DockerRequirement:
        dockerPull: ghcr.io/ai-extensions/make_dataframe:latest
        "cwltool:Secrets":
          secrets: 
            - ADES_AWS_ACCESS_KEY_ID
            - ADES_AWS_SECRET_ACCESS_KEY
            - ADES_AWS_S3_ENDPOINT
            - ADES_AWS_REGION
    baseCommand: ["make-dataframe"]
    arguments: []
    inputs:
      ADES_AWS_S3_ENDPOINT:
        type: string?
        inputBinding:
          prefix: --AWS_S3_ENDPOINT
      ADES_AWS_REGION:
        type: string?
        inputBinding:
          prefix: --AWS_REGION
      ADES_AWS_ACCESS_KEY_ID:
        type: string?
        inputBinding:
          prefix: --AWS_ACCESS_KEY_ID
      ADES_AWS_SECRET_ACCESS_KEY:
        type: string?
        inputBinding:
          prefix: --AWS_SECRET_ACCESS_KEY
      labels_url:
        type: string
        inputBinding:
          prefix: --labels_url
      eo_url:
        type: string
        inputBinding:
          prefix: --eo_url
    outputs: 
      dataframe:
        outputBinding:
          glob: .
        type: Directory

  - class: CommandLineTool
    id: make_ml_model
    hints:
      DockerRequirement:
        dockerPull: ghcr.io/ai-extensions/make_ml_model:latest
    baseCommand: ["make-ml-model"]
    inputs:
      labels_urls:
        type: string[]
      eo_urls:
        type: string[]
      MLFLOW_TRACKING_URI:
        type: string
        inputBinding:
          prefix: --MLFLOW_TRACKING_URI
      data_frames:
        type: Directory[]
      RANDOM_STATE:
        type: int
        inputBinding:
          prefix: --RANDOM_STATE
      n_estimators:
        type: int
        inputBinding:
          prefix: --n_estimators
      experiment_id: 
        type: string
        inputBinding:
          prefix: --experiment_id
    arguments: 
    - valueFrom: |
            ${
                var args=[];
                for (var i = 0; i < inputs.data_frames.length; i++)
                {
                  args.push("--data_frames");
                  args.push(inputs.data_frames[i].path);
                }
                return args;
            }
    - valueFrom: |
            ${
                var args=[];
                for (var i = 0; i < inputs.labels_urls.length; i++)
                {
                  args.push("--labels_urls");
                  args.push(inputs.labels_urls[i]);
                }
                return args;
            }
    - valueFrom: |
            ${
                var args=[];
                for (var i = 0; i < inputs.eo_urls.length; i++)
                {
                  args.push("--eo_urls");
                  args.push(inputs.eo_urls[i]);
                }
                return args;
            }
    outputs: 
      artifacts:
        outputBinding:
          glob: .
        type: Directory
    requirements:
      InlineJavascriptRequirement: {}
      NetworkAccess:
        networkAccess: true
      EnvVarRequirement:
        envDef:
          MLFLOW_TRACKING_URI: $(inputs.MLFLOW_TRACKING_URI )
          MLFLOW_VERSION: 2.10.0
      ResourceRequirement:
        coresMax: 1
        ramMax: 3000

App Package CWL Execution

For the execution of an App Package workflow are needed the App Package CWL file and the params.yml file containing the input parameters. An example of the params.yml file for the training job is shown below, followed by the executing commands with the CWL runners cwltool and calrissian.

params.yml

ADES_AWS_S3_ENDPOINT: # fill with AWS_S3_ENDPOINT
ADES_AWS_REGION: # fill with AWS_REGION
ADES_AWS_ACCESS_KEY_ID: # fill with AWS_ACCESS_KEY_ID
ADES_AWS_SECRET_ACCESS_KEY: # fill with AWS_SECRET_ACCESS_KEY
labels_url:
- https://ai-extensions-stac.terradue.com/collections/ai-extensions-svv-dataset-labels/items/label_S2A_10SFG_20230618_0_L2A
- https://ai-extensions-stac.terradue.com/collections/ai-extensions-svv-dataset-labels/items/label_S2B_10SFG_20230613_0_L2A
eo_url:
- https://earth-search.aws.element84.com/v1/collections/sentinel-2-l2a/items/S2A_10SFG_20230618_0_L2A
- https://earth-search.aws.element84.com/v1/collections/sentinel-2-l2a/items/S2B_10SFG_20230613_0_L2A
MLFLOW_TRACKING_URI: http://ml-flow-dev-mlflow:5000
RANDOM_STATE:
- 20
- 13
- 19
n_estimators:
- 120
- 150
- 250
experiment_id: "water-bodies"

Execution with `cwltool`

cwltool is used for the CWL execution on a single Kubernetes node. Note that this approach needs the user to login into the ghcr.io registry where the docker images are saved, to ensure the docker accessibility during the cwltool execution.

podman login ghcr.io
<enter username and password>
cwltool –-podman --no-read-only water-bodies-app-training.cwl#water-bodies-app-training params.yml

Execution with `calrissian`

calrissian is used for the CWL execution on distributed nodes which is ideal for running large-scale workflows in a cloud or high-performance computing environment.

calrissian --debug --outdir /calrissian --max-cores 2 --max-ram 12G --tmp-outdir-prefix /calrissian --tmpdir-prefix /calrissian --stderr /calrissian/run.log --tool-logs-basepath /calrissian/ water-bodies-app-training.cwl#water-bodies-app-training params.yml

Selection of the best model for `inference` module

The user can use a Jupyter Notebook to get the best trained model using the mlflow Python library, and save it within the inference module in an .onnx format using the skl2onnx library. This allows the inference module, described below, to run independently from the training module.

# Import Libraries
import mlflow
import pickle
import onnx
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import rasterio
import pystac

# Search for the best run
active_runs = (
    mlflow.search_runs(
        experiment_names=[params["experiment_id"]],
        # Select the best one with highest f1_score and test accuracy
        filter_string="metrics.f1_score > 0.8 AND metrics.test_accuracy > 0.98",
        search_all_experiments=True,
    )
    .sort_values(
        by=["metrics.f1_score", "metrics.test_accuracy", "metrics.precision"],
        ascending=False,
    )
    .reset_index()
    .loc[0]
)

artifact_path = json.loads(active_runs["tags.mlflow.log-model.history"])[0]["artifact_path"]
best_model_path = active_runs.artifact_uri + f"/{artifact_path}"

# Load the model as an MLflow PyFunc model
mlflow_model = mlflow.pyfunc.load_model(model_uri=best_model_path)

# Extract the underlying scikit-learn model
sklearn_model = mlflow.sklearn.load_model(model_uri=best_model_path)

# Define input type and convert the scikit-learn model to ONNX format
initial_type = [("float_input", FloatTensorType([None, len(features)]))]
onnx_model = convert_sklearn(sklearn_model, initial_types=initial_type)

# Save the ONNX model to a file
onnx.save_model(onnx_model, f"{model_path}")

App Package CWL for inference job

Objective

The inference job consists of performing inference on EO data using the trained Random Forest model for the detection of water bodies.

Application Workflow

The workflow of the Application Package for the inference job is illustrated in the diagram below.

Application Inputs

The inputs are:

STAC Item(s) of EO Data (Sentinel-2)

Application Outputs

The expected output is, for each EO Data input, a directory containing:

water bodies mask (in .tif format) with three classes (water, non-water and cloud)
overview (in .tif format) with three classes (water, non-water and cloud)
STAC Objects

Processing Modules

The inference workflow consists of a single processing module:

make-inference for performing inference on a (list of) Sentinel-2 data by applying the RandomForest model that was trained and selected during the training job.

Application Package CWL

An extraction of the App Package CWL for the inference job is shown below, where the following key components can be inspected:

Class Workflow: in this class are defined the input parameter (i.e. the list of Sentinel-2 data) and the output of the application. The single processing step make-inference scatters its execution based on the Sentinel-2 data input.
Class CommandLineTool for the base command make-inference: in this class are defined the inputs needed for the make-inference module, the docker container in which it is executed (DockerRequirement), the specific resources allocated for its execution (ResourceRequirement), and the type of output generated.

cwlVersion: v1.2
$namespaces:
  s: https://schema.org/
s:softwareVersion: 1.0.9
schemas:
  - http://schema.org/version/9.0/schemaorg-current-http.rdf
$graph:
  - class: Workflow
    id: water-bodies-app-inference
    label: Water-Bodies Inference on Sentinel-2 data
    doc: A trained Random Forest model performs inference on Sentinel-2 data to detect water bodies
    requirements:
      - class: InlineJavascriptRequirement
      - class: ScatterFeatureRequirement
    inputs:
      s2_item: 
        label: s2_item
        doc: s2_item
        type: string[]
    outputs: 
      - id: artifacts
        outputSource: 
          - make_inference/artifacts
        type: Directory[]
    steps:
      make_inference:
        run: "#make_inference"
        in:
          s2_item: s2_item
        out: 
          - artifacts
        scatter: 
          - s2_item
        scatterMethod: dotproduct

  - class: CommandLineTool
    id: make_inference
    hints:
      DockerRequirement:
        dockerPull: ghcr.io/ai-extensions/water-bodies-inference:latest
    baseCommand: ["make-inference"]
    inputs:
      s2_item:
        type: string
        inputBinding:
          prefix: --s2_item 
    outputs: 
      artifacts:
        outputBinding:
          glob: .
        type: Directory
    requirements:
      InlineJavascriptRequirement: {}
      NetworkAccess:
        networkAccess: true
      ResourceRequirement:
        coresMax: 1
        ramMax: 3000

App Package CWL Execution

An example of the params.yml file for inference job is shown below, followed by the commands to execute the App Package CWL with cwltool and calrissian.

params.yml

s2_item:
- https://earth-search.aws.element84.com/v1/collections/sentinel-2-l2a/items/S2A_10SFG_20230618_0_L2A
- https://earth-search.aws.element84.com/v1/collections/sentinel-2-l2a/items/S2B_10SFG_20230613_0_L2A

Execution with `cwltool`

cwltool –-podman --no-read-only water-bodies-app-inference.cwl#water-bodies-app-inference params.yml

Execution with `calrissian`

calrissian --debug --outdir /calrissian/ --max-cores 2 --max-ram 12G --tmp-outdir-prefix /calrissian/ --tmpdir-prefix /calrissian/ --stderr /calrissian/run.log --tool-logs-basepath /calrissian/ water-bodies-app-inference.cwl#water-bodies-app-inference params.yml

Conclusion

This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to guide a ML practitioner through the development and execution of the two App Packages CWLs, including the following activities:

Write a Python application with its containerised environment
Use the CWL standard for writing EO Application Packages
Dockerise all processing nodes and write the App Package CWLs through CI/CD pipeline with GitHub Actions
Use the CWL runners for Kubernetes to submit (parallel) training jobs as kubernetes jobs, distributing them across the available resources in the cluster
Configure MLflow server for tracking the experiments and log relevant information

Useful links:

The link to the repository where both training and inference jobs are saved is: https://github.com/ai-extensions/notebooks/tree/main/scenario-6/
Note: access to this repository must be granted - please send an email to support@terradue.com with subject “Request Access to s6-repository” and body “Please provide access to the repository for AI Extensions User Scenario 6”
The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual
Project Update “AI/ML Enhancement Project - Progress Update”
User Scenario 1 “Exploratory Data Analysis”
User Scenario 2 “Labelling EO Data”
User Scenario 3 “Describing labelled EO data”
User Scenario 4 “Discovering labelled EO data with STAC”
User Scenario 5 “Developing a new ML model and tracking with MLflow”

simonevaccari · August 1, 2024, 8:41am

AI/ML Enhancement Project - Describing a trained ML model with STAC

Introduction

In this scenario, the ML practitioner Alice describes a trained ML model by leveraging the capabilities of the STAC format. By utilising STAC, Alice can describe her ML model by creating STAC Objects that encapsulates relevant metadata such as model name and version, model architecture and training process, specifications of inputs and output data formats, and hyperparameters. The STAC Objects can then be shared and published so that it can be discovered and accessed effectively. This enables Alice to provide a comprehensive and standardised description of her model, facilitating collaboration but also promoting interoperability within the geospatial and ML communities.

This post presents User Scenario 7 of the AI/ML Enhancement Project, titled “Alice describes her trained ML model”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support users on describing an ML model using the STAC format and the ML-dedicated STAC Extensions.

These new capabilities are implemented with an interactive Jupyter Notebook to guide an ML practitioner, such as Alice, through the following steps:

Create a STAC Item, either with pystac or by uploading an existing STAC Item into the Notebook, and its related Catalog and Collection. The STAC Item contains all related ML model specific properties, related STAC extensions and hyperparameters.
Post STAC Objects onto S3 bucket
Publish STAC Objects onto STAC endpoint
Search STAC Item(s) on STAC endpoint with standard query params such as bbox and time range, but also ML-specific params such as model architecture or hyperparameters.

Practical examples and commands are displayed to demonstrate how these new capabilities can be used from a Jupyter Notebook.

Create / Upload STAC Objects

This section allows the user to either:

Create a STAC Item using pystac, or:
Upload an existing STAC Item (.json/.geojson file)

Create STAC Item

The STAC Item is create with pystac library with the following steps:

# Import Libraries
import pystac

# Create STAC Item with key properties
item = pystac.Item(
   id='water-bodies-model-pystac',
   bbox=bbox,
   geometry=getGeom(bbox),
   datetime=datetime.now(),
   properties={
      "start_datetime": "2024-06-13T00:00:00Z"
      "end_datetime": "2024-07-13T00:00:00Z"
      "description": "Water bodies classifier with Scikit-Learn Random-Forest"
   }
)

Add relevant STAC Extensions using their latest references, below:

eo: https://stac-extensions.github.io/eo/v1.1.0/schema.json
ml-model: https://stac-extensions.github.io/ml-model/v1.0.0/schema.json
mlm-extension: https://crim-ca.github.io/mlm-extension/v1.2.0/schema.json
raster: https://stac-extensions.github.io/raster/v1.1.0/schema.json
file: https://stac-extensions.github.io/file/v2.1.0/schema.json

from pystac.extensions.eo import EOExtension

# Add Extensions
EOExtension.ext(item, add_if_missing=True)
item.stac_extensions.append('https://stac-extensions.github.io/ml-model/v1.0.0/schema.json')
item.stac_extensions.append('https://crim-ca.github.io/mlm-extension/v1.2.0/schema.json')
item.stac_extensions.append("https://stac-extensions.github.io/raster/v1.1.0/schema.json")
item.stac_extensions.append("https://stac-extensions.github.io/file/v2.1.0/schema.json")

Add ml-model properties

# Add "ml-model" properties
item.properties["ml-model:type"] = "ml-model"
item.properties["ml-model:learning_approach"] = "supervised"
item.properties["ml-model:prediction_type"] = "segmentation"
item.properties["ml-model:architecture"] = "RandomForestClassifier"
item.properties["ml-model:training-processor-type"] = "cpu"
item.properties["ml-model:training-os"] = "linux"

Add mlm-extension properties

# Add "mlm-extension" properties
item.properties["mlm:name"] = "Water-Bodies-S6_Scikit-Learn-RandomForestClassifier"
item.properties["mlm:architecture"] = "RandomForestClassifier"
item.properties["mlm:framework"] = "scikit-learn"
item.properties["mlm:framework_version"] = "1.4.2"
item.properties["mlm:tasks"] = [
      "segmentation",
      "semantic-segmentation"
    ]
item.properties["mlm:pretrained_source"] = None
item.properties["mlm:compiled"] = False
item.properties["mlm:accelerator"] = "amd64"
item.properties["mlm:accelerator_constrained"] = True

# Add hyperparameters
item.properties["mlm:hyperparameters"] = {
  "bootstrap": True,
  "ccp_alpha": 0.0,
  "class_weight": None,
  "criterion": "gini",
  "max_depth": None,
  "max_features": "sqrt",
  "max_leaf_nodes": None,
  "max_samples": None,
  "min_impurity_decrease": 0.0,
  "min_samples_leaf": 1,
  "min_samples_split": 2,
  "min_weight_fraction_leaf": 0.0,
  "monotonic_cst": None,
  "n_estimators": 200,
  "n_jobs": -1,
  "oob_score": False,
  "random_state": 19,
  "verbose": 0,
  "warm_start": True
  }

Add input and output to the mlm properties

# Add input and output to the properties
item.properties["mlm:input"] = [
      {
        "name": "EO Data",
        "bands": ["B01","B02","B03","B04","B08","B8A","B09","B11","B12","NDVI","NDWI1","NDWI2"],
        "input": {
          "shape": [-1,12,10980,10980],
          "dim_order": ["batch","channel","height","width"],
          "data_type": "float32"
        },
        "norm_type": None,
        "resize_type": None,
        "pre_processing_function": None
      }
    ]


item.properties["mlm:output"] = [
      {
        "name": "CLASSIFICATION",
        "tasks": ["segmentation","semantic-segmentation"],
        "result": {
          "shape": [-1,10980,10980],
          "dim_order": ["batch","height","width"],
          "data_type": "uint8"
        },
        "post_processing_function": None,
        "classification:classes": [
          {
            "name": "NON-WATER",
            "value": 0,
            "description": "pixels without water",
            "color_hint": "000000",
            "nodata": False
          },
          {
            "name": "WATER",
            "value": 1,
            "description": "pixels with water",
            "color_hint": "0000FF",
            "nodata": False
          },
          {
            "name": "CLOUD",
            "value": 2,
            "description": "pixels with cloud",
            "color_hint": "FFFFFF",
            "nodata": False
          }
        ]
      }
    ]

Add raster:bands properties, which can be either standard EO bands as well calculated from expressions, for example to calculate vegetation indices. Both examples are given below.

item.properties["raster:bands"] = [
    {
        "name": "B01",
        "common_name": "coastal",
        "nodata": 0,
        "data_type": "uint16",
        "bits_per_sample": 15,
        "spatial_resolution": 60,
        "scale": 0.0001,
        "offset": 0,
        "unit": “m”
    },
    ...,
    {
        "name": NDVI,
        "common_name": ndvi,
        "nodata": 0,
        "data_type": float32,
        "processing:expression": {
            "format": "rio-calc",
            "expression": "(B08 - B04) / (B08 + B04)"
        }
    }

Now the user can add the assets to the STAC Item of the ML model. The required assets are:

Asset for App Package CWL for ML Training
Asset for App Package CWL for Inference
Asset for ML Model (i.e. .onnx file)

# Add Assets - ML Training
asset = pystac.Asset(
    title='Workflow for water bodies training', 
    href='https://github.com/ai-extensions/notebooks/releases/download/v1.0.8/water-bodies-app-training.1.0.8.cwl',
    media_type='application/cwl+yaml',
    roles = ['ml-model:training-runtime', 'runtime', 'mlm:training-runtime'])
item.add_asset("ml-training", asset)

# Add Assets - Inference
asset = pystac.Asset(
    title='Workflow for water bodies inference', 
    href='https://github.com/ai-extensions/notebooks/releases/download/v1.0.8/water-bodies-app-inference.1.0.8.cwl',
    media_type='application/cwl+yaml',
    roles = ['ml-model:inference-runtime', 'runtime', 'mlm:inference-runtime'])
item.add_asset("ml-inference", asset)

# Add Asset - ML model
asset = pystac.Asset(
    title='ONNX Model',
    href='https://github.com/ai-extensions/notebooks/raw/main/scenario-7/model/best_model.onnx',
    media_type='application/octet-stream; framework=onnx; profile=onnx',
    roles = ['mlm:model'])
item.add_asset("model", asset)

Now the created STAC Item can be validated.

item.validate()

Output of successful validation: 
['https://schemas.stacspec.org/v1.0.0/item-spec/json-schema/item.json',
 'https://stac-extensions.github.io/eo/v1.1.0/schema.json',
 'https://stac-extensions.github.io/ml-model/v1.0.0/schema.json',
 'https://crim-ca.github.io/mlm-extension/v1.2.0/schema.json',
 'https://stac-extensions.github.io/raster/v1.1.0/schema.json',
 'https://stac-extensions.github.io/file/v2.1.0/schema.json']

Upload STAC Item

If the user has manually written a .json / .geojson file of the STAC Item, this can be simply uploaded into the notebook with pystac. The Item can subsequently be validated as it was done before.

# Read Item
item = pystac.read_file('./path/to/STAC_Item.json')

# Validate STAC Item
item.validate()

Output of successful validation: 
['https://schemas.stacspec.org/v1.0.0/item-spec/json-schema/item.json',
 'https://stac-extensions.github.io/eo/v1.1.0/schema.json',
 'https://stac-extensions.github.io/ml-model/v1.0.0/schema.json',
 'https://crim-ca.github.io/mlm-extension/v1.2.0/schema.json',
 'https://stac-extensions.github.io/raster/v1.1.0/schema.json',
 'https://stac-extensions.github.io/file/v2.1.0/schema.json']

STAC Objects

The STAC Catalog and STAC Collection need to be created and interlinked with each other and the STAC Item (see related Article dedicated to STAC for more information about the STAC format).

Create STAC Catalog

# Create folder structure
CAT_DIR = "ML_Catalog"
COLL_NAME = "ML-Models"
SUB_DIR = os.path.join(CAT_DIR,COLL_NAME)

# Create Catalog
catalog = pystac.Catalog(
    id = "ML-Models", 
    description = "A catalog to describe ML models",
    title="ML Models"
)

Create STAC Collection

collection = pystac.Collection(
    id = COLL_NAME,
    description = "A collection for ML Models",
    extent = pystac.Extent(
        spatial=<spatial_extent>,
        temporal=<temporal_extent>
    ),
    title = COLL_NAME,
    license = "properietary",
    keywords = [],
    stac_extensions=["https://schemas.stacspec.org/v1.0.0/collection-spec/json-schema/collection.json"],
    providers=[
        pystac.Provider(
            name = "AI-Extensions Project",
            roles = ["producer"],
            url = "https://ai-extensions.github.io/docs"
        )
    ]
)

Now the user can create the interlinks between STAC Catalog, STAC Collection and STAC Item

# Add STAC Item to the Collection 
collection.add_item(item=item)

# Add Collection to the Catalog
catalog.add_child(collection)

Finally, the user can normalise and save locally the three STAC Objects and then check that these have been created successfully, using the dedicated pystac methods

# Save STAC Objects to files
catalog.normalize_and_save(root_href=CAT_DIR, 
                           catalog_type=pystac.CatalogType.SELF_CONTAINED)


# Check that the STAC Catalog contains the Collection, and the Collection contains the Item
catalog.describe()

Example output:
* <Catalog id=ML-Models>
    * <Collection id=ML-Models>
      * <Item id=water-bodies-model-pystac>

Post on S3 bucket

Once the STAC Objects are created, they can be posted on the AWS S3 bucket. A custom class is defined using the pystac, boto3 and botocore libraries to interact with S3. This class allows configuring access to a specific bucket using pre-defined user settings in the development environment ML Lab, including endpoint, access key credentials, and other related settings.

# Import libraries
from pystac.stac_io import DefaultStacIO, StacIO 
import boto3, import botocore

# Create S3 client object
bucket_name = 'my_bucket'
settings = UserSettings("/etc/Stars/appsettings.json")
settings.set_s3_environment(f"s3://{bucket_name}/{SUB_DIR}")
StacIO.set_default(DefaultStacIO)
client = boto3.client(
    service_name="s3",
    region_name=os.environ.get("AWS_REGION"),
    use_ssl=True,
    endpoint_url=os.environ.get("AWS_S3_ENDPOINT"),
    aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"),
)

# Configure and set custom Class 
class CustomStacIOs(DefaultStacIO):
    """Custom STAC IO class that uses boto3 to read from S3."""

    def __init__(self):
        self.session = botocore.session.Session()
        self.s3_client = self.session.create_client(
            service_name="s3",
            region_name=os.environ.get("AWS_REGION"),
            use_ssl=True,
            endpoint_url=os.environ.get("AWS_S3_ENDPOINT"),
            aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
            aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"),
        )

    def write_text(self, dest, txt, *args, **kwargs):
        parsed = urlparse(dest)
        if parsed.scheme == "s3":
            self.s3_client.put_object(
                Body=txt.encode("UTF-8"),
                Bucket=parsed.netloc,
                Key=parsed.path[1:],
                ContentType="application/geo+json",
            )
        else:
            super().write_text(dest, txt, *args, **kwargs)
StacIO.set_default(CustomStacIOs)

The user also makes use of the concurrent.futures and tqdm libraries. The former allows running tasks asynchronously managing multiple threads or processes in parallel. The latter is used to monitor and understand progress of a running process.

# Import libraries
import concurrent.futures
from concurrent.futures import ProcessPoolExecutor,ThreadPoolExecutor
from tqdm import tqdm


# push assets and STAC objs to s3
def upload_asset(item, key, asset, SUB_DIR):
    s3_path = os.path.normpath(
        os.path.join(os.path.join(SUB_DIR, SUB_DIR, item.id, asset.href))
    )
    item.add_asset(key, asset)
    
futures = []
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    for item in tqdm(items):
        for key, asset in item.assets.items():
            future = executor.submit(upload_asset, item, key, asset, SUB_DIR)
            futures.append(future)
            


    # Wait for all uploads to complete
    for _ in tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc="Uploading"):
        pass

Post STAC Objects to S3

# Update STAC Catalog with new urls point to S3
catalog.set_root(catalog)
catalog.normalize_hrefs(f"s3://{bucket_name}/{SUB_DIR}")
items = list(tqdm(catalog.get_all_items()))

# push STAC Item(s) to S3 in parallel
futures = []
def write_and_upload_item(client, item, bucket_name):
    # Write STAC item to file
    s3_path = item.get_self_href()
    pystac.write_file(item, item.get_self_href())

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as execute:
    for item in tqdm(items, desc="Processing Items"):
        future = execute.submit(write_and_upload_item,client ,item, bucket_name)
        futures.append(future)

    # Wait for all processes to complete
    for _ in tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc="Uploading Items"):
        pass

# push STAC Collection to S3
for col in tqdm(catalog.get_all_collections(),desc="Processing Collection"):
    pystac.write_file(col, col.get_self_href())

# push STAC Catalog to S3
pystac.write_file(catalog, catalog.get_self_href())
print("STAC Objects are pushed successfully")

An example of output log is shown below

100%|██████████| 2/2 [00:00<00:00, 266.00it/s]
Uploading: 100%|██████████| 5/5 [00:00<00:00, 59409.41it/s]
2it [00:00, 22610.80it/s]
Processing Items: 100%|██████████| 2/2 [00:00<00:00, 78.49it/s]
Uploading Items: 100%|██████████| 2/2 [00:00<00:00,  3.06it/s]
Processing Collection: 1it [00:00,  1.24it/s]
STAC Objects are pushed successfully

Publish on STAC endpoint

Now that the STAC Objects are posted on S3, the user can publish them on the STAC endpoint with the code below.

# Define STAC endpoint 
stac_endpoint = "https://ai-extensions-stac.terradue.com"

# Create a new Collection on the endpoint 
from urllib.parse import urljoin
def post_or_put(url: str, data, headers=None): # function to post/put STAC obj to the endpoint using REST API
    if headers is None:
        headers = get_headers()
    try:
        request = requests.post(url, json=data, timeout=20, headers=headers)

    # Print or log the content of the response
    except:
        new_url = url if data["type"] == "Collection" else f"{url}/{data['id']}"
        request = requests.put(new_url, json=data, timeout=20, headers=headers)
    return request

response = post_or_put(urljoin(stac_endpoint, "/collections"), 
   collection.to_dict(), 
   headers=get_headers())
if response.status_code == 200: print(f"Collection {collection.id} created successfully") 
else: print(f"ERROR: Collection {collection.id} exists already, please check") 

# Set custom Class and read the STAC Catalog posted on S3
StacIO.set_default(CustomStacIO)

catalog_s3 = read_url(catalog.self_href)

# Run function to publish STAC Item(s) in STAC endpoint
ingest_items(
    app_host=stac_endpoint,
    items=list(catalog_s3.get_all_items()),
    collection=collection,
    headers=get_headers(),
)

An example of output log is shown below

2024-07-26 08:02:43.961 | INFO     | utils:ingest_items:187 - Post item water-bodies-model-pystac to https://ai-extensions-stac.terradue.com/collections/ML-Models/items
https://ai-extensions-stac.terradue.com/collections/ML-Models/items/water-bodies-model-pystac

Discover ML Model with STAC

Once the STAC Objects are posted on S3 and on the STAC endpoint successfully, the user can perform a search on such STAC endpoint using specific query parameters. Only the STAC Item(s) that align with the provided criteria is(are) retrieved for the user. This can be achieved with the pystac and pystac_client libraries.

# Import libraries
import pystac
from pystac_client import Client

# Define STAC endpoint and access to the Catalog
stac_endpoint = "https://ai-extensions-stac.terradue.com"
cat = Client.open(stac_endpoint, headers=get_headers(), ignore_conformance=True)

# Define date
start_date = datetime.strptime('20230614', '%Y%m%d')
end_date = datetime.strptime('20230620', '%Y%m%d')
date_time = (start_date, end_date)

# Define bbox
bbox = [-121.857043 ,  37.853934 ,-120.608968  , 38.840424]

query = {
    # `ml-model` properties
    "ml-model:prediction_type": {"eq": 'segmentation'},
    "ml-model:architecture": {"eq": "RandomForestClassifier"},
    "ml-model:training-processor-type": {"eq": "cpu"},
    
    # `mlm-model` properties
    "mlm:architecture": {"eq": "RandomForestClassifier"},
    "mlm:framework": {"eq": "scikit-learn"},
    "mlm:hyperparameters.random_state": {"gt": 18},
    "mlm:compiled": {"eq": False},
    "mlm:hyperparameters.bootstrap": {"eq": True}
    }

# Query by AOI, TOI and ML-specific params
query_sel = cat.search(
    collections= collection,
    datetime=date_time,
    bbox=bbox,
    query = query
)
items = [item for item in query_sel.item_collection()]

For the example query above, the following items were retrieved:

[<Item id=water-bodies-model-pystac>,
 <Item id=water-bodies-model>]

Conclusion

This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to support a ML practitioner on describing an ML model using the STAC format. The activities covered are listed below:

Create a STAC Item, either with pystac or by uploading an existing STAC Item into the Notebook, and its related Catalog and Collection. The STAC Item contains all related ML model specific properties, related STAC extensions and hyperparameters.
Post STAC Objects onto S3 bucket
Publish STAC Objects onto STAC endpoint
Search STAC Item(s) on STAC endpoint with standard query params such as bbox and time range, but also ML-specific params such as model architecture or hyperparameters.

Useful links:

The link to the Notebook for User Scenario 7 is: https://github.com/ai-extensions/notebooks/blob/main/scenario-7/s7-CreateSTAC-describingMLmodel.ipynb
Note: access to this Notebook must be granted - please send an email to support@terradue.com with subject “Request Access to s7-CreateSTAC-describingMLmodel.ipynb” and body “Please provide access to Notebook for AI Extensions User Scenario 7”
The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual
Project Update “AI/ML Enhancement Project - Progress Update”
User Scenario 1 “Exploratory Data Analysis”
User Scenario 2 “Labelling EO Data”
User Scenario 3 “Describing labelled EO data”
User Scenario 4 “Discovering labelled EO data with STAC”
User Scenario 5 “Developing a new ML model and tracking with MLflow”
User Scenario 6 “Training and Inference on a remote machine”

simonevaccari · August 12, 2024, 9:47am

AI/ML Enhancement Project - Reusing an existing pre-trained model

Introduction

In this scenario, the ML practitioner Alice reuses a pre-trained model by leveraging the power of transfer learning. Transfer learning, a widely adopted technique in deep learning, involves using an existing model, pre-trained on a large dataset, to train a new model on a smaller, task-specific dataset. This approach allows the new model to utilise the features learned by the pre-trained model, enabling it to extract valuable information from the input data more efficiently. Consequently, the new model can achieve higher accuracy even with limited data.

This post presents User Scenario 8 of the AI/ML Enhancement Project, titled “Alice reuses an existing pre-trained ML model”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support users on performing a semantic segmentation task with transfer learning in the context of Earth Observation (EO). This also includes the option to leverage the GPU resources set-up in the dedicated App Hub environment, significantly reducing the execution time for the model fine-tuning phase.

These new capabilities are implemented with an interactive Jupyter Notebook to guide an ML practitioner, such as Alice, through the following steps:

Import libraries (e.g. torch, sklearn, albumentation)
Data acquisition, including EO data search and data loader implementing different augmentation techniques (e.g., RandomCrop, Resize, and RandomRotate90) and data loading in batches
Data visualization to enable the user to gain comprehensive insights of data distribution
Selection of pre-trained ML model, in this case with a UNet backbone trained on ImageNet, and subsequently implementation of fine-tuning adjustments
Evaluation of outputs using different techniques on unseen dataset (e.g., plotting loss functions, calculating mIoU)
Inference on the unseen test dataset.

Practical examples and commands are displayed to demonstrate how these new capabilities can be used from a Jupyter Notebook.

Key Python Libraries

Three key Python libraries were used for this scenario: PyTorch, segmentation_models, and albumentations:

PyTorch: This deep learning framework is well-suited for transfer learning in the context of EO semantic segmentation due to its flexibility and extensive support for deep learning models. It is widely used in academia, and research paper’s code written with pytorch are often available on Github for reproducibility purposes.
segmentation_models: Built on PyTorch, this library provides pre-trained models tailored for various segmentation tasks, significantly reducing training time.
albumentations: This library constructs data augmentation pipelines, enriching the training dataset and generalisation of trained models.

These libraries work together seamlessly within the PyTorch ecosystem, enabling the user to utilise other tools like TorchMetrics for improved model performance evaluation.

Note: In order to leverage the GPU resources and make the most of these libraries, the user must switch to the dedicated profile on the ML Lab environment. This can be done with the following steps:

Stop the current pod by going on https://app-hub-ai-extensions-dev.terradue.com/hub/user/
Select Home > Stop my Server
Select Start my Server
Select the Machine Learning Lab with GPU vX.Y profile and click Start

Data Acquisition Pipeline

Data Acquisition

The candidate dataset for this task was OpenEarthMap, which is a benchmark dataset for global high-resolution land cover mapping. It consists of 5000 aerial and satellite images with manually annotated 8-class land cover labels and 2.2 million segments at a 0.25-0.5m ground sampling distance, covering 97 regions from 44 countries across 6 continents.

# Set-up dataset information
DATA_URLS = {"OpenEarthMap_wo_xBD":"https://zenodo.org/records/7223446/files/OpenEarthMap.zip?download=1"}
SELECTED_DATA = {
    "DATASET_NAME": "OpenEarthMap_wo_xBD",
    "DATA_URL" : DATA_URLS["OpenEarthMap_wo_xBD"] }
CLASSES = [
    "Bareland",
    "Rangeland",
    "Developed_Space",
    "Road",
    "Tree",
    "Water",
    "Agriculture_Land",
    "Building",
]
RANDOM_STATE = 17
BATCH_SIZE = 4
IMAGE_SIZE = 300
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Data Acquisition
data_obj = Acquisition(data_url=SELECTED_DATA['DATA_URL'],
                       data_file=SELECTED_DATA["DATASET_NAME"]+'.zip',
            source = "Gdrive")
data_obj.download_and_unzip_file()
data_obj.data_dir = SELECTED_DATA["DATASET_NAME"]
data_obj.check_and_remove_empty_regions()
data_obj.read_files()
sorted_image_list, sorted_mask_list = data_obj.sort_image_mask_files()
data_obj.check_if_sorted(sorted_image_list, sorted_mask_list)

Data Loader and Data Augmentation

Once the dataset was downloaded and locally accessible, a custom class MultiClassSegDataset, inheriting from the Dataset class, was created in PyTorch. This class reads training or evaluation images and ground truth masks from disk, and carries out transformations such as normalisation. Subsequently, the user split the data into train, validation, and test datasets with ratios of 80%, 10%, and 10%, respectively. The training set is used for model training, the validation set for model selection, and the test set for assessing the model’s generalisation error on unseen images.

Data augmentation was applied to artificially increase the number of training examples. This involved applying image transformations such as random cropping, rotation, and brightness adjustments while ensuring the corresponding mask remained aligned with the transformed image. It was crucial to select transformations that produced an augmented training set representative of the target application images. For the test set, a centered cropping operation was applied to maintain consistency in the comparative model performance evaluations and ensure reproducible outcomes. Conversely, random cropping was applied to the training set to generate diverse samples, helping the model learn from different image patches and improving its generalisation capability.

Some example code of these key concepts is shown below.

# Import libraries
from torch.utils.data import Dataset, DataLoader, Subset
from sklearn.model_selection import KFold, train_test_split
import albumentations as A

# Split data into train, val, test
train_images, test_images, train_masks, test_masks = train_test_split(sorted_image_list, sorted_mask_list, test_size=0.1, random_state=RANDOM_STATE)
train_images, valid_images, train_masks, valid_masks = train_test_split(train_images, train_masks, test_size=0.1, random_state=RANDOM_STATE)

# Define tranforms using Albumations 
train_transform = A.Compose(
    [
        A.RandomCrop(IMAGE_SIZE, IMAGE_SIZE, always_apply=True),
        A.Resize(256, 256, always_apply=True),
        A.RandomRotate90(0.5) 
    ]
)
test_transform = A.Compose(
    [
     A.CenterCrop(IMAGE_SIZE, IMAGE_SIZE, always_apply=True),
     A.Resize(256, 256, always_apply=True),
     ]
)

# Create the `MultiClassSegDataset` datasets
trainDS = MultiClassSegDataset(train_images, train_masks, classes=CLASSES, transform=train_transform)
validDS = MultiClassSegDataset(valid_images,valid_masks, classes=CLASSES, transform=test_transform)
testDS = MultiClassSegDataset(test_images,test_masks, classes=CLASSES, transform=test_transform)

Once the custom Datasets were created (trainDS, validDS, and testDS), the pytorch DataLoader class was used to provide an efficient and flexible way to iterate over a dataset, managing batching, shuffling, and parallel data loading. This class facilitates the application of different augmentation techniques (e.g., RandomCrop, Resize, and RandomRotate90) and the loading of the data in batches.

# Define DataLoaders 
trainDL = DataLoader(trainDS,
                       batch_size=BATCH_SIZE,    
                       shuffle=True,    
                       num_workers=1,   
                       pin_memory=True, 
                       )
...

print(f"Number of Training Samples: {len(train_images)} \nNumber of Validation Samples: {len(valid_images)} \nNumber of Test Samples: {len(test_images)}")

# Printed Output: 
Number of Training Samples: 577
Number of Validation Samples: 102
Number of Test Samples: 76

Data Visualisation

Before proceeding to model training, some data and their corresponding segmented masks were plotted to allow performing a sanity check by visually inspecting them. This helps identify any obvious data transformation mistakes or inconsistencies.

Model Selection for Transfer Learning

Transfer learning involves taking a pre-trained deep learning model, which has been trained on a large dataset (often on tasks like image classification with thousands of images), and adapting it to a new, typically smaller, dataset for a different but related task. In this scenario the pre-trained model was adapted for semantic segmentation, a task that involves classifying each pixel in an image.

Many deep learning computer vision models are pre-trained on datasets such as ImageNet, where the task is multi-class classification. By training on this dataset, the deep learning model (often using convolutional layers), learns to identify textures, geometric properties, shapes, and features from each image rather than reasoning on a pixel level. Transfer learning typically involves freezing these pre-trained layers, adding more layers (or unfreezing the last few layers), adapting the output layers to the desired task, and then continuing training.

ResNet is popular for image segmentation tasks due to its deep architecture with residual connections, which helps in effectively training very deep networks by mitigating the vanishing gradient problem. UNet, designed specifically for biomedical image segmentation, features a symmetric encoder-decoder structure with skip connections, enabling precise localization by combining high-level and low-level features. Both architectures have proven effective in various segmentation benchmarks, demonstrating high accuracy and robustness in diverse applications.

In this case, we leveraged the Unet architecture with a ResNet34 backbone and pre-trained weights from ImageNet for the encoder. The first layer was adjusted to accommodate the three input channels. Additionally, the final layer was updated to match the number of classes required for our target segmentation task.

# Import libraries
import torch
import segmentation_models_pytorch as smp

# Initiate UNet++ Model 
MULTICLASS_MODE: str = "multiclass"
ENCODER = "resnet34"
ENCODER_WEIGHTS = "imagenet"
DECODER_ATTENTION_TYPE = None 
EPOCHS = 50
ACTIVATION = None

model = smp.Unet(
    encoder_name=ENCODER,
    encoder_weights=ENCODER_WEIGHTS,
    in_channels=3,
    classes=9,
    activation=ACTIVATION
)
optimizer = torch.optim.Adam(
    [dict(params=model.parameters(), lr=0.0001)]
)

If the user is running the profile with GPU resources, the code below is used to check if multiple GPUs are available and, if so, enables certain CUDA backend features and prepares the model for parallel processing across these GPUs.

if torch.cuda.device_count() > 1:
    torch.backends.cudnn.enabled
    print("Number of GPUs :", torch.cuda.device_count())
    model = torch.nn.DataParallel(model)

Model Fine-tuning

The process involved modifying the architecture and hyperparameters of the selected pre-trained model ResNet to meet the specific requirements of the semantic segmentation task. After these adjustments, the model was trained and validated over a specified number of epochs (e.g., 50 epochs) to ensure it learned the features relevant to the new task. To optimise the training process, we used the Adam optimizer, known for its efficiency in handling large datasets and complex models. Additionally, we employed the JaccardLoss function, which is well-suited for measuring the performance of segmentation tasks by evaluating the similarity between predicted and true labels (more on this in the next section). By leveraging transfer learning and customising the model architecture, semantic segmentation fine-tuning enables us to effectively segment images and extract valuable semantic information for a variety of applications. Below is the code used for the fine-tuning.

# Define Loss and Metrics to Monitor (Make sure mode = "multiclass") 
loss = smp.losses.JaccardLoss(mode="multiclass")
loss.__name__ = "loss"
metrics = []
# Define training epoch
train_epoch = utils.train.TrainEpoch(
    model,
    loss=loss,
    metrics=metrics,
    optimizer=optimizer,
    device=DEVICE,
    verbose=True,
)

# Define testing epoch 
val_epoch = utils.train.ValidEpoch(
    model,
    loss=loss,
    metrics=metrics,
    device=DEVICE,
    verbose=True,
)

# Train model for 10 epochs 
min_score = 10.0
train_losses = []
val_losses = []
for epoch in range(EPOCHS):
    logger.info(f"Epoch: {epoch+1}/ {EPOCHS}")
    train_logs = train_epoch.run(trainDL)
    val_logs = val_epoch.run(validDL)
    train_losses.append(train_logs["loss"])
    val_losses.append(val_logs["loss"])
    torch.save(model, f'./out/current_model.pth')
    if min_score > train_logs["loss"]:
        min_score = train_logs["loss"]
        torch.save(model, f'./out/best_model.pth')

Model Evaluation

The performance of the fine-tuned model is evaluated on a separate test dataset to assess its accuracy and generalisation capabilities. This testing phase allows us to validate the model’s performance in real-world scenarios and determine its effectiveness in accurately segmenting new images that were not part of the train and validation datasets. Metrics such as mean intersection over union (mIoU) are commonly used to evaluate the performances of segmentation tasks.

It is important to note the relationship between the loss function and the evaluation metric. The Jaccard Index, also known as the Intersection over Union (IoU), is a popular metric for evaluating the performance of segmentation models because it measures how well the model’s predictions align with the ground truth annotations. It is calculated as the ratio of the intersection area to the union area of the predicted and ground truth masks.

On the other hand, JaccardLoss is a loss function commonly used in semantic segmentation tasks. It is defined as 1 - Jaccard Index, meaning that it penalises predictions with lower overlap with the ground truth masks.

The mean Intersection over Union (mIoU) is the average of the Intersection over Union values calculated for all samples in the dataset and across all classes. It provides a single scalar value representing the overall performance of the segmentation model across the entire dataset.

Both metrics assess the overlap between predicted and true segments, but JaccardLoss is used as a training objective, whereas mIoU is used for evaluation.

By plotting the Loss functions and mIoU over the training epochs, users can gain insights into potential underfitting or overfitting during training on the train and validation datasets. This comprehensive evaluation helps ensure the robustness and reliability of the model in practical applications.

# Import Libraries
from segmentation_models_pytorch import utils
import torch.optim as optim

best_model = torch.load("./models/best_model.pth", map_location=DEVICE)
metrics = []  # No metrics other than loss for testing

# Define Optimizer (Adam in this case) and learning rate
optimizer = optim.Adam(params=best_model.parameters(), lr=0.0001)

val_epoch = utils.train.ValidEpoch(
    best_model,
    loss=loss,
    metrics=metrics,
    device=DEVICE,
    verbose=True,
)

# Initialize a list to store all test losses
test_losses = []

# Run the validation/testing epoch
for x_test, y_test in tqdm(testDL):
    # Ensure data is on the correct device
    x_test, y_test = x_test.to(DEVICE), y_test.to(DEVICE)

    # Compute loss for the current batch
    test_loss, _ = val_epoch.batch_update(x_test, y_test)

    # Append the current batch's loss to the list of test losses
    test_losses.append(test_loss.item())

# Calculate the average test loss
avg_test_loss = sum(test_losses) / len(test_losses)

The JaccardLoss and IoU over each training epoch can be plotted to assess whether the model is experiencing overfitting or underfitting, and to determine if the learning curve is behaving as expected. In this case, even after 50 epochs, the validation loss closely follows the training loss, indicating that there is no significant overfitting.

Model Inference

To conclude, the fine-tuned model can then be used for inference to predict segmented masks for the test set. By plotting the original image, the ground truth mask, and the output prediction side-by-side, we can visually assess the model’s performance. As demonstrated below, leveraging transfer learning with a pre-trained model has enabled us to develop a robust segmentation model with commendable accuracy.

Conclusion

This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to guide a ML practitioner through the implementation of transfer learning for EO image segmentation with the following steps:

Configuring a custom dataset with pytorch data loader
Setting up data augmentation to artificially increase the size of a small dataset
Selecting a pre-trained backbone model for our Unet model
Fine-tune the model
Model evaluation with loss functions and mIoU
Inference on the unseen test dataset.

Useful links:

The link to the Notebook for User Scenario 8 is: https://github.com/ai-extensions/notebooks/blob/main/scenario-8/s8_Transfer_Learning.ipynb
Note: access to this Notebook must be granted - please send an email to support@terradue.com with subject “Request Access to s8_Transfer_Learning.ipynb” and body “Please provide access to Notebook for AI Extensions User Scenario 8”
The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual
Project Update “AI/ML Enhancement Project - Progress Update”
User Scenario 1 “Exploratory Data Analysis”
User Scenario 2 “Labelling EO Data”
User Scenario 3 “Describing labelled EO data”
User Scenario 4 “Discovering labelled EO data with STAC”
User Scenario 5 “Developing a new ML model and tracking with MLflow”
User Scenario 6 “Training and Inference on a remote machine”
User Scenario 7 “Describing a trained ML model”

pmembari · August 12, 2024, 10:36am

AI/ML Enhancement Project - Creating a training dataset

Introduction

In this scenario, the ML practitioner Alice creates a deep learning training dataset. A high-quality, well-annotated dataset is crucial for any AI-driven task; without it, training an effective machine learning (ML) model is doomed to failure. Annotating a dataset is a meticulous and time-consuming process that demands precision and focus. To ensure accuracy, the dataset must be reviewed by multiple experts. However, various computer vision techniques can expedite this process. For instance, an initial labelling can be done by another ML model, with human supervision refining these labels to ensure their correctness.

This post presents User Scenario 9 of the AI/ML Enhancement Project, titled “Alice creates a training dataset”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support users on generating a labelled dataset for a semantic segmentation task in the context of Earth Observation (EO) using both manual and automated ML-driven solutions.

These new capabilities are implemented with an interactive Jupyter Notebook to guide an ML practitioner, such as Alice, through the following steps:

Import libraries (e.g., pystac, rasterio, boto3, sklearn)
Load the Sentinel-2 data using STAC
Generate Training Dataset using both manual and ML approach
Create STAC Objects
Post STAC Objects to a dedicated S3 bucket and then publish on STAC endpoint

Practical examples and commands are displayed to demonstrate how these new capabilities can be used from a Jupyter Notebook or from a dedicated, open-source annotation tool.

Load EO Data with STAC

The EO data used in this scenario were Sentinel-2 data published on the STAC endpoint. More information on this can be found in the related article “Discovering Labelled EO Data with STAC”.

# Import Libraries 
import pystac
from pystac_client import Client

catalog = Client.open("https://ai-extensions-stac.terradue.com",
                      headers=get_headers(),ignore_conformance=True)

query = catalog.search(collections=["ai-extensions-svv-dataset-labels"])
eo_items_selected = [item for item in query.item_collection() if any(link.rel == 'source' for link in item.links)]

Generate Training Dataset

The training dataset generated in this Scenario consisted on

EO data image patches and their corresponding masks
image patches with their corresponding masks (annotated with three classes water, non-water and not-applicable)

Human-annotation Approach

For demonstrating the process of creating an image segmentation dataset with a manual approach, we used the IRIS Tool. Developed by ESA/ESRIN Phi-Lab, this AI-assisted tool enhances image segmentation and classification for EO imagery and other types of images via a manual human-annotation approach. While this method is more accurate and reliable than automated alternatives, it is also time-consuming and lacks quick reproducibility, making automated methods the preferred choice in those cases. Key highlights of the IRIS Tool are provided below, while a detailed tutorial can be found in IRIS_Tutorial.md.

IRIS was designed to streamline the manual creation of ML training datasets for EO by fostering collaboration among multiple annotation experts. It features an iterative process aimed at refining annotation guidelines and enhancing the overall quality of the training dataset. This application is a Flask app that can be deployed both locally and in the cloud, with Github Codespace being the recommended platform for cloud deployment.

As described by the official IRIS project documentation, the highlights of the IRIS tool are:

Supported by AI model Gradient Boosted Decision Tree for image segmentation
Multiple and configurable views for multispectral imagery
Multi-user support: invite multiple annotators to collaborate on the same image segmentation labelling project. This collaborative approach helps merge results and reduces bias and errors, enhancing the accuracy and reliability of the results
Simple setup with pip and one configuration file.

The annotation workflow consists of the following steps:

Annotate just a few pixels: in order to trigger the AI feature to help the annotation process, the user only needs to manually label 10 pixels from 2 classes. This can be done simply by selecting the specific class and then colouring on the pixels that must be assigned to that class on the map, by using the dedicated widgets. An example is shown below, where three classes are labelled with different colours: WATER (blue), NON-WATER (yellow), NOT-APPLICABLE (red).

Train the AI: the Gradient Boosted Decision Tree training process can be triggered by the user with the dedicated button. This process is based on the already-annotated pixels in step 1 and is very fast.
Visualise results: based on the annotated pixels, the AI-trained model predictions are shown in the rest of the images. The user can visually inspect the segmentation predictions by moving and zooming in on the image.
Iterate: If the predicted segmentation needs improvements, the user can proceed by repeating steps 1 to 3 iteratively for manually creating additional labelled pixels. This will increase the quantity of training data for the AI model and will improve the quality of the AI-segmented predictions.
Save the dataset: When the predicted segmentation results are satisfactory, the user can simply save the predicted image mask, which is saved automatically in the same directory as the input image, in the formats .png, .npy or .tif, depending on the configuration settings.

Machine Learning Approach

For generating the image segmentation dataset via an automated ML approach we used a model that was previously trained on labelled dataset in Scenario 2 (see related article “Labelling EO Data User Scenario 2”). This consists of a RandomForest model for segmenting a Sentinel-2 image and generating an output water bodies mask with three classes: NON-WATER, WATER, NOT-APPLICABLE. Although this computer vision technique results in masks that are less accurate, the overall accuracy of labels can be improved by manually refining the results using tools such as the IRIS tool. The output dataset resulting from this combined approach can then be leveraged to train a highly accurate segmentation model such as UNET.

The Python library mlflow was used to access the trained models from the MLflow server and to select the best model, based on specific evaluation metrics. The best model was then loaded into this Notebook and then inference on the input EO data was performed to generate the water bodies masks.

# Import Libraries 
import mlflow
import json

mlflow.set_tracking_uri(os.environ.get('MLFLOW_TRACKING_URI'))
active_runs = mlflow.search_runs(
    experiment_names = ["water-bodies"],
    # select the best one with highest f1_score and test accuracy
    filter_string="metrics.f1_score > 0.8 AND metrics.test_accuracy>0.98", search_all_experiments=True
).sort_values(by=['metrics.f1_score','metrics.test_accuracy','metrics.precision'],
              ascending=False).reset_index().loc[0] 
artifact_path = json.loads(active_runs['tags.mlflow.log-model.history'])[0]['artifact_path']
best_model_path = active_runs.artifact_uri+f'/{artifact_path}'
MODEL = mlflow.pyfunc.load_model(model_uri=best_model_path)

The user defines a DataAcquisition Class that is applied to each input EO data. The Class is responsible for:

generating image chips from the input EO data
apply the trained ML model to generate water-bodies masks for each image chip
create an RGB thumbnail in JPEG format of each image chip.

The data is then stored in the defined output directory. An extraction of the DataAcquisition Class for creating the dataset is shown below.

import tqdm
import rasterio
COMMON_BANDS = ['red','green','blue']
FEATURE_COLUMNS = ['coastal','red','green','blue','nir','nir08',
                   'nir09','swir16','swir22', 'ndvi', 'ndwi1', 'ndwi2']

class DataAcquisition:
    def __init__(self, eo_items: List[pystac.Item], common_bands:List[str], FEATURE_COLUMNS:List[str], model,base_resolution:float=10):
        super().__init__()
        self.eo_items = eo_items
        self.common_bands = common_bands
        self.FEATURE_COLUMNS = FEATURE_COLUMNS
        self.model = model
        self.base_resolution = base_resolution
        
    def __getitem__(self, idx):
        # Getting source url to sentinel-2 images
        source_sel = self.eo_items[idx].get_links(rel="source")[0].href
        self.band_urls = self.get_image_links(source_sel)

        # Resample bands to the same shape and resolution
        self.band_urls = ml_helper.resample_bands(
            self.band_urls, 
            self.base_resolution, 
            self.out_dir, 
            desired_shape=(10980,10980)
        )
        
        # Prediction on each image patch to generate the segmented mask 
        tif_path = ml_helper.data_and_mask_generator(model = self.model,
                                            band_urls= self.band_urls,
                                            feature_cols=self.FEATURE_COLUMNS, 
                                            common_bands= self.common_bands,
                                            out_dir = self.dataset_dir, 
                                            eo_item= self.eo_items[idx])
        
    # Create thumbnail 
    def create_thumbnail(self,tif_paths, size=(128, 128)):
        for tiff_path in tqdm(tif_paths,desc="Creating thumbnail"):
            with rasterio.open(tiff_path) as src:
                # Read RGB bands, normalise to 0-255, and save the thumbnail in .jpeg format
                data = src.read()
                data = ((data - data.min()) / (data.max() - data.min()) * 255).astype(np.uint8)
                data = data[:3, :, :]  
                data = np.transpose(data, (1, 2, 0))
                image = Image.fromarray(data)
                image.thumbnail(size)
                thumbnail_path = tiff_path.replace(".tif", ".jpeg")
                image.save(thumbnail_path)

# Apply DataAcquisition Class to the input EO data
dataset = DataAcquisition(eo_items=eo_items_selected,common_bands=COMMON_BANDS, 
                          FEATURE_COLUMNS=FEATURE_COLUMNS, model= MODEL)

Create STAC Objects

Once all the image masks are created, the user proceeds by creating the STAC Objects (i.e. the STAC Catalog, the STAC Collection and the STAC Items). The assets of each STAC Item describe an image patch, its mask and its thumbnail. More information on the STAC specifications can also be found in the related article “Describing labelled EO Data with STAC”.

# STAC Catalog
catalog_metadata = {
    "id": IMAGES_DIRECTORY,
    "description": "A training dataset for water-bodies segmentation task",
    "catalog_type": pystac.CatalogType.SELF_CONTAINED,
}

# STAC Collection
collection_metadata = {
    "id": IMAGES_DIRECTORY,
    "keywords":["segmentation", "water-bodies","earthsearch"],
    "license": "MIT",
    "provider_name" : "Terradue",
    "provider_role" : ["producer"],  # Any of licensor, producer, processor or host.
    "provider_homepage_url" : "https://www.terradue.com/portal/",
    
}
# STAC Item
stac_item_properties_common_metadata = {
    "license" : "MIT", # for more license please check https://spdx.org/licenses/
    "provider_name" : "Terradue",
    "provider_role" : "producer",  # Any of licensor, producer, processor or host.
    "provider_homepage_url" : "https://www.terradue.com/portal/",
    "platform" : "ai-extension",
    "instruments" : "MSI",
    "constellation" : "sentinel-2",
    "mission" : "sentinel-2"
}
stac_item_properties_extensions = {
    # Label Extenstion properties
    "class_values": {
        0: "NON-WATER",
        1: "WATER",
        2: "NOT-APPLICABLE"
    },
    "label_tasks" : [LabelTask.SEGMENTATION],
    "label_description": "Water / Non-Water / Not-Applicable", 
}
item_metadata = {
    "image_paths": glob(f'{DATASET_PATH}/{IMAGES_DIRECTORY}/**/image_*.tif', recursive=True),
    "mask_paths" : glob(f'{DATASET_PATH}/{IMAGES_DIRECTORY}/**/mask_*.tif', recursive=True),
    "properties":{
    "common_metadata": stac_item_properties_common_metadata,
    "extensions": stac_item_properties_extensions,
}}

# STAC Object generator
stac_generator_obj = stac_helper.StacGenerator(
        DATASET_PATH=DATASET_PATH,
        COMMON_BANDS=COMMON_BANDS,
        IMAGES_DIRECTORY=IMAGES_DIRECTORY,
        item_metadata = item_metadata,
        collection_metadata=collection_metadata,
        catalog_metadata=catalog_metadata)
catalog, collection, stac_items = stac_generator_obj.main()

Post on S3 bucket and publish on STAC endpoint

These two activities - posting the STAC Objects to an S3 bucket and then publishing them on the STAC endpoint - followed the same process described in details in the related articles “Describing labelled EO Data with STAC” and “Describing a trained ML model with STAC”. Please refer to those articles where the steps for each of these activities are explained in great detail.

The screenshot below shows the STAC Collection train-dataset-water-bodies published on the endpoint. In addition to the Collection keywords and spatial / temporal extents, a preview of some listed STAC Items is shown underneath the map.

Below is shown a STAC Item as an example. Key features of the STAC Item that are visible in the dashboard are: temporal and spatial extent, description, Collection, key metadata, and the three assets (input image chip, segmented mask, and thumbnail).

Conclusion

This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to guide a ML practitioner through the generation of a labelled EO dataset for a semantic segmentation task using both manual and automated approaches with the following steps:

Load the Sentinel-2 data using STAC
Generate Training Dataset using both manual approach (with the IRIS Tool), and automated ML approach
Create STAC Objects
Post STAC Objects to a dedicated S3 bucket and then publish on STAC endpoint

Useful links:

The link to the Notebook for User Scenario 9 is: https://github.com/ai-extensions/notebooks/blob/main/scenario-9/s9-CreatingTrainingData.ipynb
Note: access to this Notebook must be granted - please send an email to support@terradue.com with subject “Request Access to s9-CreatingTrainingData.ipynb” and body “Please provide access to Notebook for AI Extensions User Scenario 9”
The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual
Project Update “AI/ML Enhancement Project - Progress Update”
User Scenario 1 “Exploratory Data Analysis”
User Scenario 2 “Labelling EO Data”
User Scenario 3 “Describing labelled EO data”
User Scenario 4 “Discovering labelled EO data with STAC”
User Scenario 5 “Developing a new ML model and tracking with MLflow”
User Scenario 6 “Training and Inference on a remote machine”
User Scenario 7 “Describing a trained ML model”
User Scenario 8 “Reusing an existing pre-trained model”

simonevaccari · August 13, 2024, 3:37pm

AI/ML Enhancement Project - Discovering, deploying and consuming an ML model

Introduction

In this Scenario, a stakeholder/user such as Eric is seeking to discover an existing machine learning (ML) model that has been developed by other ML practitioners, such as Alice. Eric’s goal is to deploy this ML model on an Exploitation Platform, allowing him to integrate it into his own workflow. In this example, Eric wants to create a water-bodies mask using the “water-bodies” model based on a RandomForest segmentation classifier, previously trained by Alice. More details on the training and inference processes can be found in the dedicated article “Training and Inference on a remote machine”.

Firstly, Eric discovers the “water-bodies” ML model by utilising the STAC search functionalities and narrowing down his search by providing key metadata, such as date and geographic location, as well as ML model-specific properties like model architecture or hyperparameters. Once Eric identifies the service with the ML model that aligns with his project requirements, he interacts with the Platform Operator for deploying it as a processing service on an Exploitation Platform. After deployment, Eric can find the deployed service and execute it with his own input parameters, allowing integration in his own geospatial analysis workflows.

This post presents User Scenario 10 of the AI/ML Enhancement Project, titled “Eric discovers and consumes an ML model”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support stakeholders on discovering ML models using STAC and on interacting with a Platform Operator for deploying an ML model on an Exploitation Platform, after which the stakeholder can execute the service with his/her own data.

These new capabilities are implemented with an interactive Jupyter Notebook to guide a stakeholder, such as Eric, through the following steps:

Import Libraries (e.g. pystac, boto3)
Search ML model with pystac by defining specific metadata parameters
Configure Exploitation Platform (e.g. GEP) and deploy ML model as a processing service (supported by the Platform Operator)
Launch a new job and monitor its execution
Check job status and retrieve results

Practical examples and commands are displayed to demonstrate how these new capabilities can be used from a Jupyter Notebook.

Search ML model with STAC

The STAC format and related API can be used not only to discover EO data (as explained in the related article “Discovering Labelled EO Data with STAC”), but also to discover and access ML models that were previously described with the STAC format (see related article “Describing a trained ML model with STAC”). The process, which leverages the use of Pystac libraries pystac and pystac_client, is the same as described in those articles.

In addition to standard key metadata such as date and geographic location, relevant query parameters for finding ML models include model-specific properties like model architecture and hyperparameters. The code below demonstrates how some of these fields are defined in the query dictionary and how these are used to filter results.

# Import Libraries 
import pystac
from pystac_client import Client

stac_endpoint = "https://ai-extensions-stac.terradue.com"

# Access to STAC Catalog
cat = Client.open(stac_endpoint, headers=get_headers(), ignore_conformance=True)

# Define collection 
collection = ["ML-Models"]

# Define date
start_date = datetime.strptime('20230614', '%Y%m%d')
end_date = datetime.strptime('20230620', '%Y%m%d')
date_time = (start_date, end_date)

# Define bbox
bbox = [-121.857043 ,  37.853934 ,-120.608968  , 38.840424]

query = {
    # `ml-model` properties
    "ml-model:prediction_type": {"eq": 'segmentation'},
    "ml-model:architecture": {"eq": "RandomForestClassifier"},
    "ml-model:training-processor-type": {"eq": "cpu"},
    
    # `mlm-model` properties
    "mlm:architecture": {"eq": "RandomForestClassifier"},
    "mlm:framework": {"eq": "scikit-learn"},
    "mlm:hyperparameters.random_state": {"gt": 10}
    }

# Query by AOI, TOI and ML-specific params
query_sel = cat.search(
    collections=collection,
    datetime=date_time,
    bbox=bbox,
    query = query
)
items = [item for item in query_sel.item_collection()]

When the query results are retrieved, the basic metadata, as well as the ML-specific properties and hyperparameters, can be fully visualised by inspecting the STAC Item(s).

# Select one item 
item = items[0]

# Display properties
print(list(item.properties.keys()))

# Display ML-related properties 
[print(p) for p in item.properties if 'ml' in p]

# Display Hyperparameters
display(item.properties['mlm:hyperparameters'])

An example of ML-related properties and hyperparameters of the ML model retrieved from STAC are shown below.

# ML-related properties
mlm:name
mlm:input
mlm:tasks
mlm:output
mlm:compiled
ml-model:type
mlm:framework
mlm:accelerator
mlm:architecture
mlm:hyperparameters
ml-model:training-os
ml-model:architecture
mlm:framework_version
ml-model:prediction_type
ml-model:learning_approach
mlm:accelerator_constrained
ml-model:training-processor-type

# Hyperparameters
{'n_jobs': -1,
 'verbose': 0,
 'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'gini',
 'oob_score': False,
 'warm_start': True,
 'max_features': 'sqrt',
 'n_estimators': 200,
 'random_state': 19,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_impurity_decrease': 0.0,
 'min_weight_fraction_leaf': 0.0}

Another key feature of STAC Items are the assets. The expected assets of a STAC Item that describes an ML model are:

model: the asset of the ML model itself (e.g. in .onnx format)
ml-training: the App Package of the training process
ml-inference: the App Package of the inference process

These can be visualised by inspecting the STAC Item assets:

# Display assets
display(item.assets)

# Printed Output 
{'model': <Asset href=https://github.com/ai-extensions/notebooks/raw/main/scenario-7/model/best_model.onnx>,
 'ml-training': <Asset href=https://github.com/ai-extensions/notebooks/releases/download/v1.0.8/water-bodies-app-training.1.0.8.cwl>,
 'ml-inference': <Asset href=https://github.com/ai-extensions/notebooks/releases/download/v1.0.8/water-bodies-app-inference.1.0.8.cwl>}

Once the stakeholder/user Eric has identified the ML model that aligns with his project requirements, he can proceed to access and utilise it for his own purposes. This may involve running the model on his own geospatial data, integrating it into a larger workflow, or applying it within a specific application context.

In this Scenario, Eric wants to create a water-bodies mask using the water-bodies ML model trained by the ML practitioner Alice, therefore he’s interested in the inference service. The URL of the Application Package for the inference service is retrieved from STAC Item assets.

# Fetch URL of App Package CWL of the `inference` service 
print(item.assets["ml-inference"].href)

# Printed Output
https://github.com/ai-extensions/notebooks/releases/download/v1.0.8/water-bodies-app-inference.1.0.8.cwl

Configure and Deploy on Exploitation Platform

Once the App Package CWL of the inference service has been retrieved, the user interacts with the Exploitation Platform Operator for deploying it as a processing service with the steps below:

The Exploitation Platform Operator informs the user about the available Thematic Exploitation Platforms in which the service can be deployed. In this case we are using the [Geohazard Exploitation Platform (GEP)](https://geohazards-tep.eu/#!).
The Exploitation Platform Operator accesses the GEP Services panel and deploys the Water-Bodies Inference on Sentinel-2 data service by adding a new service linked to the App Package CWL URL provided by the user (i.e. “https://github.com/ai-extensions/notebooks/releases/download/v1.0.8/water-bodies-app-inference.1.0.8.cwl”).

The user can now login into Terradue sign-in portal:

The user opens the selected Thematic App (i.e. GEP in this case):

The user verifies that the Water-Bodies Inference on Sentinel-2 data service has been successfully deployed by checking whether it is listed in the Processing Services panel on the right side of the dashboard:

Launch and Monitor Job

Once the service is successfully deployed, the user can enter the input parameters and launch a new job with the following steps:

The user opens the Water-Bodies Inference on Sentinel-2 data service panel to check its key properties (e.g. title, version, input parameters).

The user checks that the required input parameter for the service is a Sentinel-2 product(s) (S2 product). The user can search the appropriate data with the following steps:
click on the EO Data icon on the top of the dashboard and select Sentinel 2
open the Search panel on the left side, and enters the desired parameters, e.g.:

product type: S2MSI2A - Note: this is important as this service works with L2A data only, not L1C data;
start and end date;
cloud cover;
spatial filter, by using the dedicated widget on the map.

when all the search parameters are entered, click on the Search button.

After a few seconds, the search will load and display the results on the Search panel. You can double-click on the individual products to inspect its key metadata

The user can simply select and then drag and drop each product from the Results panel on the left into the S2 product field of the Water-Bodies Inference on Sentinel-2 data service panel on the right.

The user can also edit the title of the job, and when satisfied can click on the Run job button.

Get Results

Once the job has been created, the user can check its status and then get the produced results with the steps below:

The user can check the job status in the Jobs section of the Processing Services panel on the right of the screen.

When the job finishes, the user can click on the job and the Job info is displayed. If the Status is Success, the Show Results button appears on the bottom of the panel, and the results are displayed on the Results section on the left side of the dashboard and on the map.

The layer symbology and visualisation can be fully customised by double-clicking on the result product and on the Layer Styling drop-down, where the user can select the asset to display, the colour map, change the histogram and other customisable visualisation options.

The user can eventually download the individual assets of the result products by simply clicking on the Download button on the lower-bottom part of the screen).

Conclusion

This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to guide a stakeholder/user on using STAC for discovering ML models, developed by an ML practitioner, and for interacting with a Platform Operator for deploying such ML model on an Exploitation Platform, after which the stakeholder can execute the service with his/her own data. The following steps were covered in this article:

Search ML model with STAC by defining key metadata and ML-specific parameters
Configure an Exploitation Platform and deploy the discovered ML model as a processing service (supported by the Platform Operator)
Launch a new job and monitor its execution
Check job status and retrieve results.

Useful links:

The link to the Notebook for User Scenario 10 is: https://github.com/ai-extensions/notebooks/blob/main/scenario-10/s10-search_executeMLmodel.ipynb
Note: access to this Notebook must be granted - please send an email to support@terradue.com with subject “Request Access to s10-search_executeMLmodel.ipynb” and body “Please provide access to Notebook for AI Extensions User Scenario 10”
The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual
Project Update “AI/ML Enhancement Project - Progress Update”
User Scenario 1 “Exploratory Data Analysis”
User Scenario 2 “Labelling EO Data”
User Scenario 3 “Describing labelled EO data”
User Scenario 4 “Discovering labelled EO data with STAC”
User Scenario 5 “Developing a new ML model and tracking with MLflow”
User Scenario 6 “Training and Inference on a remote machine”
User Scenario 7 “Describing a trained ML model”
User Scenario 8 “Reusing an existing pre-trained model”
User Scenario 9 “Creating a training dataset”

Announcing the Launch of the AI/ML Enhancement Project for GEP and Urban TEP Exploitation Platforms

AI/ML Enhancement Project - Progress Update

Background

User Personas, User Scenarios and Showcases

Project Status

Upcoming Work

AI/ML Enhancement Project - Exploratory Data Analysis User Scenario

Introduction

Input Dataframe

Data Cleaning

Correlation Analysis

Distribution Density Histograms

Outliers detection

Feature engineering and dimensionality reduction

Conclusion

AI/ML Enhancement Project - Labelling EO Data User Scenario 2

Introduction

Labelling EO data

Load Labels and EO data with STAC API

Search data using STAC API

Plot Labels and EO data on interactive map

Sample EO data with labels

Validation against Reference Dataset

EO labelled data for Supervised ML task

Dataset preparation

ML Model

Model Evaluation

Raster Inference

Conclusion

AI/ML Enhancement Project - Describing labelled EO Data with STAC

Introduction

Loading labelled EO data

Generate STAC Item

Publish the STAC Item

Find STAC Item on STAC Catalog

Conclusion

AI/ML Enhancement Project - Discovering Labelled EO Data with STAC

Introduction

Understanding STAC

Accessing STAC via STAC Browser and STAC API

Accessing using STAC Browser

Accessing using STAC API

Connectivity with dedicated S3 storage

Conclusion

AI/ML Enhancement Project - Developing a new ML model and tracking with MLflow

Introduction

Data Ingestion

ML Model Architecture

Training and fine-tuning

Evaluation

MLflow Tracking

Conclusion

AI/ML Enhancement Project - Training and Inference on a remote machine

Introduction

Key Concepts

Common Workflow Language

Kubernetes

App Package CWL for training job

Objective

Application Workflow

Application Inputs

Application Outputs

Processing Modules

Application Package CWL

App Package CWL Execution

Execution with cwltool

Execution with calrissian

Selection of the best model for inference module

App Package CWL for inference job

Objective

Application Workflow

Application Inputs

Application Outputs

Processing Modules

Application Package CWL

App Package CWL Execution

Execution with cwltool

Execution with calrissian

Conclusion

AI/ML Enhancement Project - Describing a trained ML model with STAC

Execution with `cwltool`

Execution with `calrissian`

Selection of the best model for `inference` module

Execution with `cwltool`

Execution with `calrissian`