Announcing the Launch of the AI/ML Enhancement Project for GEP and Urban TEP Exploitation Platforms

AI/ML Enhancement Project - Exploratory Data Analysis User Scenario

Introduction

Exploratory Data Analysis (EDA) is an essential step in the workflow of a data scientist or machine learning (ML) practitioner. The purpose of EDA is to analyse the data that will be used to train and evaluate ML models. This new capability brought by AI/ML Enhancement Project will support users in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) to better understand the dataset structure and its properties, discover missing values and possible outliers, as well as to identify correlations and patterns between features that can be used to tailor and improve model performance.

This post presents User Scenario 1 of the AI/ML Enhancement Project titled “Alice does Exploratory Data Analysis (EDA)”. For this user scenario, an interactive Jupyter Notebook has been developed to guide an ML practitioner, such as Alice, implement EDA on her data. The Notebook firstly introduces the connectivity with a STAC catalog interacting with the STAC API to search and access EO data and labels by defining specific query parameters (we will cover that in a dedicated article). Subsequently, the user loads the input dataframe and then performs the EDA steps for understanding her data, such as data cleaning, correlation analysis, histogram plotting and data engineering. Practical examples and commands are displayed to demonstrate how simply this Notebook can be used for this purpose.

Input Dataframe

The input data consisted of point data labelled with three classes and features extracted from Sentinel-2 reflectance bands. Three vegetation indices were also computed from selected bands. The pre-arranged dataframe was loaded using the pandas library. The dataframe is composed by 13 columns:

  • column CLASSIFICATION: defines the land cover classes of each label, with available classes VEGETATION, NOT_VEGETATED, WATER.
  • columns with reflectance bands: extracted from the spectral bands of six Sentinel-2 scenes: coastal, red, green, blue, nir, nir08, nir09, swir16, and swir22.
  • columns for vegetation indices, calculated from the reflectance bands: ndvi, ndwi1, and ndwi2.

A snapshot of how the dataframe can be loaded and displayed is shown below.

import pandas as pd
dataset = pd.read_pickle('./input/dataframe.pkl')
dataset

This analysis focused on differentiating between “water” and “no-water” labels, therefore a pre-processing operation was performed on the dataframe to change the classification of VEGETATION and NOT_VEGETATED labels as “no-water”. This can be quickly achieved with the command below:

LABEL_NAME = 'water'
dataset[LABEL_NAME] = dataset['CLASSIFICATION'].apply(lambda x: 1 if x == 'WATER' else 0)

Data Cleaning

After loading, the user can inspect the dataframe with the pandas function dataset.info() to show a quick overview of the data, such as number of rows, columns and data types. A further statistical analysis can then be performed for each feature with the function dataset.describe(), which extracts relevant information, including count, mean, min & max, standard deviation and 25%, 50%, 75% percentiles.

dataset.info()
dataset.describe()

The user can quickly check if null data was present in the dataframe. In general, if features with null values are identified, the user should either remove them from the dataframe, or convert or assign them to appropriate values, if known.

dataset[dataset.isnull().any(axis=1)]

Correlation Analysis

The correlation analysis between “water” and “no-water” pixels for all features was performed with the pairplot() function of the seaborn library.

import seaborn as sns
sns.pairplot(dataframe, hue=LABEL_NAME, kind='reg', palette = "Set1")

This simple command generates multiple pairwise bivariate distributions of all features in the dataset, where the diagonal plots represent univariate distributions. It displays the relationship for the (n, 2) combination of variables in a dataframe as a matrix of plots, as depicted in the figure below (with ‘water’ points shown in blue).

The correlation between variables can also be visually represented by the correlation matrix, simply generated by the seaborn corr() function (see figure below). Each cell in the matrix represents the correlation coefficient, which quantifies the degree to which two variables are linearly related. Values close to 1 (in yellow) and -1 (in dark blue) respectively represent positive and negative correlations, and values close to 0 represent no correlation. The matrix is highly customisable with different format and colour maps available.

Distribution Density Histograms

Another good practice is to understand the distribution density of values for each column feature. The user can target specifically the distribution of specific features when related to the corresponding label “water”, plot this over the histograms, and save the output figure to file.

import matplotlib.pyplot as plt

for i, c in enumerate(dataset.select_dtypes(include='number').columns):
   plt.subplot(4,3,i+1)
   sns.distplot(dataset[c])
   plt.title('Distribution plot for field:' + c)
   plt.savefig(f'./distribution_hist.png')

Outliers detection

The statistical analysis and histogram plots provide an assessment regarding the data distribution of each feature. To further analyse the data distribution, it is advised to conduct a dedicated analysis to detect possible outliers in the data. The Tukey IQR method identifies outliers as values with more than 1.5 times the interquartile range from the quartiles — either below Q1 − 1.5 IQR, or above Q3 + 1.5 IQR. An example of the Tukey IQR method applied to the NDVI index is shown below:

import numpy as np

def find_outliers_tukey(x):
   q1 = np.percentile(x,25)
   q3 = np.percentile(x,75)
   iqr = q3 - q1
   floor = q1 - 1.5*iqr
   ceiling = q3 + 1.5*iqr
   outlier_indices = list(x.index[(x<floor) | (x>ceiling)])
   outlier_values = list(x[outlier_indices])
   return outlier_indices,outlier_values

tukey_indices, tukey_values = find_outliers_tukey(dataset['ndvi'])

Feature engineering and dimensionality reduction

Future engineering can be used when the available features are not enough for training an ML model, for example when a small number of features is available, or it is not representative enough. In such cases, feature engineering can be used to increase and/or improve the dataframe representativeness. In this case it was used the PolynomialFeatures function from the sklearn library to increase the number of overall features through an iterative combination of the available features. For an algorithm like a Random Forest where decisions are being made, adding more features through feature engineering could provide a substantial improvement. Algorithms like convolutional neural networks however, might not need these since they can extract patterns directly out of the data.

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(interaction_only=True, include_bias=False)
new_dataset = pd.DataFrame(poly.fit_transform(dataset))
new_dataset

On the other hand, Principal Component Analysis (PCA) is a technique that transforms a dataset of many features into fewer, principal components that best summarise the variance that underlies the data. This can be used to extract the principal components from each feature so they can be used in training. PCA is also a function offered by the sklearn Python library.

from sklearn.decomposition import PCA

pca = PCA(n_components=len(new_dataset.columns))
X_pca = pd.DataFrame(pca.fit_transform(new_dataset))

Conclusion

This work demonstrates the new functionalities brought by the AI/ML Enhancement Project with simple steps and commands a ML practitioner like Alice can take to analyse a dataframe for the preparatory step of a ML application lifecycle. Using this Jupyter Notebook, Alice can iteratively conduct the EDA steps to gain insights and analyse data patterns, calculate statistical summaries, generating histograms or scatter plots, understand correlation between features, and share results to her colleagues.

Useful links:

1 Like