A Framework for Computational Reproducibility in Environmental Science with Support for Machine Learning Applications

Many satellite data users lack the expertise and infrastructure to efficiently and effectively access, pre-process, and utilize the growing volume of space-based data for local, regional, and national decision-making (Barker et al., 2022; Chue Hong et al., 2021). This represents a significant obstacle to realizing the full potential of and achieving successful utilization of space-based imagery data. In response to this challenge, various countries and international organizations have expressed a strong desire for support from the Committee on Earth Observation Satellites (CEOS), a pioneering advocate of the CEOS Analysis Ready Data (ARD) initiative (Sazonov et al., 2019).

The CEOS ARD initiative focuses on streamlining access to and processing of satellite data, transforming it into CEOS Analysis Ready Data for Land (CARD4L) products. These products are specifically designed to facilitate seamless time-series analysis without requiring supplementary acquisition information. By systematically and consistently providing CARD4L data, the burden on global satellite data users is expected to be significantly reduced, ultimately enhancing the usability of the data. The delivery of CARD4L data will adhere to the Guiding Principles of Findable, Accessible, Interoperable, and Reusable (FAIR) including various methods such as systematic data processing, hosting platforms, and toolkits made available to users (Wilkinson et al. 2016; Musen et al., 2022; Jacobsen et al., 2020; Weigel et al., 2020).

However, the sheer volume of data generated by the extensive Earth observation data and services provided by over 300 satellites poses a substantial challenge for analysts, scientists, and even non-experts (FrontierSI, 2020). This emphasizes the urgent need to establish a comprehensive FAIRified data model framework for computational workflows, aimed at fully unlocking the potential of Earth observation products in the environmental sector and beyond. Recent advancements in storage and computing power have not only made this cost-effective but also feasible to process and analyze data across various scales. Machine Learning (ML), a fundamental component of modern science and industry, significantly simplifies scientific analysis by efficiently identifying patterns, outliers, and discrepancies in datasets, streamlining data preprocessing, and promoting the development of more accurate, robust, and scalable applications (Peng et al. 2023).

Extensive research has been conducted in various domains focusing on the FAIRified data model framework integrated with Machine Learning (ML). One notable outcome of this research is the development of the Source-augmented Partial Convolution v2 model (SAPC2), which serves as an innovative solution for pixel reconstruction. SAPC2, built upon a partial convolution-enabled U-Net framework, leverages a complete source temporally adjacent to the reconstruction process. It employs an encoder-decoder structure to extract high-level features (Chen M, 2020). Traditionally, many applications have tended to disregard observations affected by cloud coverage or sensor saturation. A recent study has shed light on the issue of errors of commission arising from the Fmask technique. It emphasizes the importance of continued inclusion and enhancement of quality assurance bands in Analysis-Ready Data (ARD). The study advises users to utilize pixel quality flagging to mitigate potential biases under varying conditions (Ernst, 2018). This approach offers significant potential for standardizing the documentation of workflows and increasing trust in EO data products, thereby increasing productivity and utilization in environmental sector and other sectors. Furthermore, in-depth research efforts have been dedicated to dynamic environmental simulations, covering the intricate interplay between human activities and land-use changes (Searchinger et al., 2018; Newbold et al., 2015), climate dynamics (Findell et al., 2017), water resource management (Spera et al., 2016), and the socioeconomic system (Hostert et al., 2011).

Overall, this research advocates for a comprehensive approach that seamlessly incorporates the FAIR Guiding Principles into machine learning workflows. This integration aims to facilitate the generation of Analysis Ready Data (ARD) to significantly improve efficiency and reproducibility in the field of land observation and monitoring. The central tenet of this work underscores the critical significance of standardized data management procedures and automated workflows in empowering data-driven investigations in the environmental sector. By embracing this integrated framework, the global scientific community can fully harness the untapped potential of ARD products. This approach fosters collaborative efforts and paves the way for groundbreaking discoveries. Furthermore, it actively promotes interoperability between public and commercial data sources, enabling a wide array of applications. This approach is a cornerstone of a more sustainable and informed approach to land management, ensuring the optimal utilization of Earth Observation (EO) products across a multitude of sectors.

P3.42s

Project Leader:
Dr Ivana Ivanova, Curtin University

PhD Student:
Zhengyuan Chai, Curtin University

Participants: