Sentinel-2 L2A#

Product description#

The Sentinel-2 L2A data product available in the SALDi Data Cube (SDC) is a mirror of Digital Earth Africa’s product of the same name.

Detailed information can be found here.

The product abbreviation used in this package is s2_l2a

Loading data#

from sdc.load import load_product

ds = load_product(product="s2_l2a", 
                  vec="site06", 
                  time_range=("2021-01-01", "2022-01-01"),
                  s2_apply_mask=True)  # True by default
ds
[WARNING] Loading data for an entire SALDi site will likely result in performance issues as it will load data from multiple tiles. Only do so if you know what you are doing and have optimized your workflow! It is recommended to start with a small subset to test your workflow before scaling up.
<xarray.Dataset> Size: 689GB
Dimensions:      (time: 438, latitude: 5500, longitude: 6500)
Coordinates:
  * latitude     (latitude) float64 44kB -24.9 -24.9 -24.9 ... -26.0 -26.0 -26.0
  * longitude    (longitude) float64 52kB 30.75 30.75 30.75 ... 32.05 32.05
    spatial_ref  int32 4B 4326
  * time         (time) datetime64[ns] 4kB 2021-01-01T08:17:00 ... 2021-12-29...
Data variables:
    B02          (time, latitude, longitude) float32 63GB dask.array<chunksize=(438, 360, 425), meta=np.ndarray>
    B03          (time, latitude, longitude) float32 63GB dask.array<chunksize=(438, 360, 425), meta=np.ndarray>
    B04          (time, latitude, longitude) float32 63GB dask.array<chunksize=(438, 360, 425), meta=np.ndarray>
    B05          (time, latitude, longitude) float32 63GB dask.array<chunksize=(438, 360, 425), meta=np.ndarray>
    B06          (time, latitude, longitude) float32 63GB dask.array<chunksize=(438, 360, 425), meta=np.ndarray>
    B07          (time, latitude, longitude) float32 63GB dask.array<chunksize=(438, 360, 425), meta=np.ndarray>
    B08          (time, latitude, longitude) float32 63GB dask.array<chunksize=(438, 360, 425), meta=np.ndarray>
    B8A          (time, latitude, longitude) float32 63GB dask.array<chunksize=(438, 360, 425), meta=np.ndarray>
    B09          (time, latitude, longitude) float32 63GB dask.array<chunksize=(438, 360, 425), meta=np.ndarray>
    B11          (time, latitude, longitude) float32 63GB dask.array<chunksize=(438, 360, 425), meta=np.ndarray>
    B12          (time, latitude, longitude) float32 63GB dask.array<chunksize=(438, 360, 425), meta=np.ndarray>

We have now lazily loaded Sentinel-2 L2A data, meaning that the underlying data is not yet loaded into memory (please refer to Xarray, Dask and lazy loading for a short introduction on this topic). What is available, however, is an interactive representation of the xarray.Dataset object, which can be explored to get a better understanding of the data. You can click on different elements to expand them and see their content.

The dimensions of the xarray.Dataset are latitude, longitude and time. The values are by default available in memory and the underlying data can be inspected directly, e.g. by indexing the first 10 elements:

ds.time[0:10]
<xarray.DataArray 'time' (time: 10)> Size: 80B
array(['2021-01-01T08:17:00.000000000', '2021-01-01T08:17:14.000000000',
       '2021-01-03T08:07:00.000000000', '2021-01-03T08:07:03.000000000',
       '2021-01-03T08:07:14.000000000', '2021-01-03T08:07:17.000000000',
       '2021-01-06T08:16:58.000000000', '2021-01-06T08:17:13.000000000',
       '2021-01-08T08:07:02.000000000', '2021-01-08T08:07:05.000000000'],
      dtype='datetime64[ns]')
Coordinates:
    spatial_ref  int32 4B 4326
  * time         (time) datetime64[ns] 80B 2021-01-01T08:17:00 ... 2021-01-08...

Having a closer look at one of the bands (Data variable in Xarray terminology), we get an overview of the data structure without actually loading any of the almost 60 GB into memory. This representation is showing us how the data is “chunked” (i.e. how it is divided into smaller parts for parallel processing).

The example below shows the B08-band (Near Infrared 1 / NIR), which is available as an xarray.DataArray object. By default, sdc-tools has applied the chunking spatially while keeping the time dimension as a single chunk. This is a common strategy for time series analysis. Visualizing individual time slices, as we will do in the following sections of this notebook, is still possible but not very efficient.

ds.B08
<xarray.DataArray 'B08' (time: 438, latitude: 5500, longitude: 6500)> Size: 63GB
dask.array<rechunk-p2p, shape=(438, 5500, 6500), dtype=float32, chunksize=(438, 360, 425), chunktype=numpy.ndarray>
Coordinates:
  * latitude     (latitude) float64 44kB -24.9 -24.9 -24.9 ... -26.0 -26.0 -26.0
  * longitude    (longitude) float64 52kB 30.75 30.75 30.75 ... 32.05 32.05
    spatial_ref  int32 4B 4326
  * time         (time) datetime64[ns] 4kB 2021-01-01T08:17:00 ... 2021-12-29...
Attributes:
    nodata:   0

Sometimes it makes sense to assign data to new Python variables, e.g. to make the code more readable. Let’s assign the B08 band to a new variable called nir1. As you can see, the data structure is still the same.

nir1 = ds.B08
nir1
<xarray.DataArray 'B08' (time: 438, latitude: 5500, longitude: 6500)> Size: 63GB
dask.array<rechunk-p2p, shape=(438, 5500, 6500), dtype=float32, chunksize=(438, 360, 425), chunktype=numpy.ndarray>
Coordinates:
  * latitude     (latitude) float64 44kB -24.9 -24.9 -24.9 ... -26.0 -26.0 -26.0
  * longitude    (longitude) float64 52kB 30.75 30.75 30.75 ... 32.05 32.05
    spatial_ref  int32 4B 4326
  * time         (time) datetime64[ns] 4kB 2021-01-01T08:17:00 ... 2021-12-29...
Attributes:
    nodata:   0

Xarray Shorts: Indexing and basic plotting#

Note

Sections with the “Xarray Shorts”-prefix are meant to be practical introductions to various Xarray features. They are not product-specific and can (and should) be applied to other products and data as well.

The indexing and selecting capabilities of Xarray allow for a lot of flexibility in how to access the data.

In the following example we select a single time step using the .sel-method. Passing a date string and the method='nearest' argument, the nearest time step is selected. Looking at the data structure, we can see that it is now a 2-dimensional xarray.DataArray object and that the time-dimension has been dropped.

nir1_scene = nir1.sel(time="2021-06-26", method="nearest")
nir1_scene
<xarray.DataArray 'B08' (latitude: 5500, longitude: 6500)> Size: 143MB
dask.array<getitem, shape=(5500, 6500), dtype=float32, chunksize=(360, 425), chunktype=numpy.ndarray>
Coordinates:
  * latitude     (latitude) float64 44kB -24.9 -24.9 -24.9 ... -26.0 -26.0 -26.0
  * longitude    (longitude) float64 52kB 30.75 30.75 30.75 ... 32.05 32.05
    spatial_ref  int32 4B 4326
    time         datetime64[ns] 8B 2021-06-25T08:17:15
Attributes:
    nodata:   0

We can index in the same way along the latitude and longitude dimensions to select a single pixel. The result is a 1-dimensional xarray.DataArray object with only the time-dimension left.

px_water = (31.571, -24.981)
px_veg1 = (31.5384, -25.0226)
px_veg2 = (31.551, -25.034)

nir1_px_water = nir1.sel(longitude=px_water[0], latitude=px_water[1], method="nearest")
nir1_px_veg1 = nir1.sel(longitude=px_veg1[0], latitude=px_veg1[1], method="nearest")
nir1_px_veg2 = nir1.sel(longitude=px_veg2[0], latitude=px_veg2[1], method="nearest")

nir1_px_veg1
<xarray.DataArray 'B08' (time: 438)> Size: 2kB
dask.array<getitem, shape=(438,), dtype=float32, chunksize=(438,), chunktype=numpy.ndarray>
Coordinates:
    latitude     float64 8B -25.02
    longitude    float64 8B 31.54
    spatial_ref  int32 4B 4326
  * time         (time) datetime64[ns] 4kB 2021-01-01T08:17:00 ... 2021-12-29...
Attributes:
    nodata:   0

Let’s visualize the time series of the selected pixels. We can use the .plot-methods of the xarray.DataArray object to create different plots using the Matplotlib backend. An overview of different types can be found here, where you can also find more information on which arguments can be passed.

We use the .plot.scatter-method to create a scatter plot of the time series. Even though Matplotlib is already used in the background by Xarray, we can also load it here directly in order to further customize the plot.

from matplotlib import pyplot as plt

nir1_px_water.plot.scatter(x="time", label="Water")
nir1_px_veg1.plot.scatter(x="time", label="Vegetation 1")
nir1_px_veg2.plot.scatter(x="time", label="Vegetation 2")
plt.title("Comparison of Sentinel-2 L2A Near Infrared time series")
plt.ylabel("Reflectance")
plt.legend()
<matplotlib.legend.Legend at 0x7f09831e8230>
../../_images/8bc1928a407ba6c1f21080778a1447a4d0012722075444e0e4610d16010338cf.png

Tip

A lot of beginners start visualizing their data using the .plot-method, which is an obvious choice. However, this can often times lead to confusing or misleading time series plots. See here for what I mean exactly. Better start with a simple scatter plot and then move on to more complex plots, e.g. line plots in combination with interpolation methods.

Have you noticed that all cells up to this point were executed almost instantly, whereas this one took some time to finish?

Until now we have just looked at the lazily loaded representations of the data. In order to create the plot, however, the .plot-method has to load the underlying data into memory, do whatever calculations are necessary and then create the plot. This is why it took longer to execute.

Xarray Shorts: Spatial subset#

In Xarray Shorts: Indexing and basic plotting we have indexed our data by time and also selected a point coordinate. Using the .sel-method we can also select a spatial subset by slicing the latitude and longitude dimensions. This is also described here.

ds_subset = ds.sel(longitude=slice(31.4, 31.8), 
                   latitude=slice(-25.8, -25.4))
ds_subset.B08
<xarray.DataArray 'B08' (time: 438, latitude: 0, longitude: 2000)> Size: 0B
dask.array<getitem, shape=(438, 0, 2000), dtype=float32, chunksize=(438, 0, 425), chunktype=numpy.ndarray>
Coordinates:
  * latitude     (latitude) float64 0B 
  * longitude    (longitude) float64 16kB 31.4 31.4 31.4 31.4 ... 31.8 31.8 31.8
    spatial_ref  int32 4B 4326
  * time         (time) datetime64[ns] 4kB 2021-01-01T08:17:00 ... 2021-12-29...
Attributes:
    nodata:   0

Wait, why does it say the size array size is 0 Bytes and doesn’t show a data structure?

Maybe Xarray wants the coordinates in a different order?

Let’s try again…

ds_subset = ds.sel(longitude=slice(31.4, 31.8), 
                   latitude=slice(-25.4, -25.8))
ds_subset.B08
<xarray.DataArray 'B08' (time: 438, latitude: 2000, longitude: 2000)> Size: 7GB
dask.array<getitem, shape=(438, 2000, 2000), dtype=float32, chunksize=(438, 360, 425), chunktype=numpy.ndarray>
Coordinates:
  * latitude     (latitude) float64 16kB -25.4 -25.4 -25.4 ... -25.8 -25.8 -25.8
  * longitude    (longitude) float64 16kB 31.4 31.4 31.4 31.4 ... 31.8 31.8 31.8
    spatial_ref  int32 4B 4326
  * time         (time) datetime64[ns] 4kB 2021-01-01T08:17:00 ... 2021-12-29...
Attributes:
    nodata:   0

That’s better! 🙂

As you can see it is very useful to check the data structure inbetween steps to make sure everything is as expected.

This quick example only shows how to subset using a square bounding box. If you want to learn how to subset using more complex geometries, please refer to: …clip data to the extent of a vector file?

Cloud mask#

Sentinel-2 L2A data loaded via the load_product function is by default cloud masked. The mask is created from the SCL (Scene Classification) band and valid classes defined according to Table 4 in Baetens et al. (2019).

Set the s2_apply_mask-parameter to False to load the data without the cloud mask.

Let’s have a look at the Near Infrared band again and plot a scene by selecting a time index (10th scene in the array) instead of a specific date:

scene_B08 = ds_subset.B08.isel(time=10)

scene_B08.plot(robust=True)
plt.title(f"Sentinel-2 L2A, {scene_B08.time.data}")
Text(0.5, 1.0, 'Sentinel-2 L2A, 2021-01-08T08:07:16.000000000')
../../_images/317b72570a469297c5f689ac8623ff66802817c5b972e74516f7f04c3cf372d2.png

Tip

The robust=True argument of the .plot-method clips the color scale to the 2nd and 98th percentiles. You might be familiar with this from other software to handle geospatial raster data like QGIS or ArcGIS.

Notice, that you can see some areas that are not perfectly masked (e.g., cloud shadows). Depending on your analysis, further cleaning steps of the data might be necessary.

Masked areas are numpy.nan values in the data array, which we can plot by selecting them using Xarray’s isnull-method:

scene_B08.isnull().plot(cmap="cividis")
plt.title(f"Sentinel-2 L2A masked pixels, {scene_B08.time.data}")
Text(0.5, 1.0, 'Sentinel-2 L2A masked pixels, 2021-01-08T08:07:16.000000000')
../../_images/43754bc55eeaf1f7d8b00768ede8af6f5c460e711cd64b99328deaef0f774b70.png

Group acquisition slices#

One issue with the Sentinel-2 L2A product is that the data is split into multiple acquisition slices due to the overlapping grid used (which unfortunately is not the most efficient, see this recent publication by Bauer-Marschallinger et al. (2023).

You can view the entire grid over continental Africa in the Digital Earth Africa Explorer.

The following example shows an acquisition that has been sliced into four parts:

date = "2021-01-08"
ds.sel(time=date).time
<xarray.DataArray 'time' (time: 4)> Size: 32B
array(['2021-01-08T08:07:02.000000000', '2021-01-08T08:07:05.000000000',
       '2021-01-08T08:07:16.000000000', '2021-01-08T08:07:19.000000000'],
      dtype='datetime64[ns]')
Coordinates:
    spatial_ref  int32 4B 4326
  * time         (time) datetime64[ns] 32B 2021-01-08T08:07:02 ... 2021-01-08...
ds.sel(time=date).B08.plot(robust=True,
                           col="time",
                           col_wrap=2)
<xarray.plot.facetgrid.FacetGrid at 0x7f059cc436e0>
../../_images/08b7937ea15ddce6c3ea0ce5612aa2af66c29b513e4eed9849acc12f0827d399.png

To mitigate this issue, we can group these slices by using the sdc.utils.groupby_acq_slices-function. It returns an xarray.DataSet object with the same dimensions as the input data, but with groups of acquisition slices aggregated by calculating the mean of each group.

Notice that the number of time steps has decreased from 438 in the original data to 146.

from sdc.utils import groupby_acq_slices

ds_grouped = groupby_acq_slices(ds)
ds_grouped.B08
<xarray.DataArray 'B08' (time: 146, latitude: 5500, longitude: 6500)> Size: 21GB
dask.array<transpose, shape=(146, 5500, 6500), dtype=float32, chunksize=(146, 360, 425), chunktype=numpy.ndarray>
Coordinates:
  * latitude     (latitude) float64 44kB -24.9 -24.9 -24.9 ... -26.0 -26.0 -26.0
  * longitude    (longitude) float64 52kB 30.75 30.75 30.75 ... 32.05 32.05
    spatial_ref  int32 4B 4326
  * time         (time) datetime64[ns] 1kB 2021-01-01T08:00:00 ... 2021-12-29...
Attributes:
    nodata:   0

Executing the same code as we have at the beginning of this section, we can see that the acquisition slices have been grouped correctly:

ds_grouped.sel(time=date).time
<xarray.DataArray 'time' (time: 1)> Size: 8B
array(['2021-01-08T08:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
    spatial_ref  int32 4B 4326
  * time         (time) datetime64[ns] 8B 2021-01-08T08:00:00
ds_grouped.B08.sel(time=date).plot(robust=True)
plt.title(f"Sentinel-2 L2A, {ds_grouped.B08.sel(time=date).time.data}")
Text(0.5, 1.0, "Sentinel-2 L2A, ['2021-01-08T08:00:00.000000000']")
../../_images/8a98ec1f6549dc2a2fad78f8be783a504f4f455dbc5593d037cbbc535c9b2453.png