8 Recent Data-driven Applications That Strive for Nature Positivity

Here is how big data and LLMs are transforming biodiversity protection and climate actions.

Sep 13, 2024

1. FireBench: A High-Fidelity Integrated Simulation Framework for Exploring Wildfire Behavior and Data-Driven Modeling

Recently, FireBench, a high-fidelity integrated simulation framework for exploring wildfire behavior and data-driven modeling, was officially released. This Python library is specifically designed for systematic benchmarking and comparison of fire models, addressing gaps in accuracy, efficiency, sensitivity, validity domains, and model compatibility in the current field of fire modeling. FireBench provides a dual evaluation approach, supporting model comparisons without large observational datasets, as well as benchmarking through validation datasets. Its core functions include accurately predicting fire front locations and plume dynamics, assessing computational resource requirements, analyzing model output responses to input variations, and detecting the validity domain of the model. By integrating the Swirl-Fire large eddy simulation tool with the Vizier optimization platform, FireBench achieves efficient and high-fidelity simulations, with all simulations executed on tensor processing units. Results indicate that this framework excels in predicting fire spread rates, fire acceleration, and fire front intensity.

For more details about this model, please visit the following link.

https://github.com/wirc-sjsu/firebench/blob/main/README.md

Related paper link:

https://arxiv.org/abs/2406.08589

2. Leafy Spurge: An aerial drone image dataset for weed classification

Invasive plants pose a threat to both agricultural and wasteland ecosystems. Euphorbia esula is one such plant that has spread from Eastern Europe to much of North America. To address this challenge, the research team utilized modern computer vision systems to collect relevant data over the western grasslands of Montana, USA, using commercial drones, and trained an image classifier based on this data to better identify and manage these invasive plants. To stimulate further research on the detection and control of invasive plants, the research team released the Leafy Spurge dataset, providing a valuable resource for the fields of machine learning, ecology, and remote sensing.

For more details about this model, please visit the following link.

https://github.com/leafy-spurge-dataset/leafy-spurge-dataset

Related paper link:

https://arxiv.org/abs/2405.03702

3. Fish-Vista: A dataset for fish image feature recognition

Fish are an integral part of ecosystems and economic sectors, and studying fish characteristics is crucial for understanding patterns of biodiversity and macroevolutionary trends. Fish-Vista is a large, annotated collection containing approximately 60K fish images, covering 1900 different species, and supporting several challenging and biologically relevant tasks, including species classification, trait recognition, and trait segmentation. The Fish Vista dataset consists of museum fish images from the Great Lakes Invasive Network (GLIN), iDigBio, and Morphbank databases. This dataset not only provides fine-grained labels for various visual features present in each image but also offers pixel-level annotations for 9 different traits across 2427 fish images, facilitating additional trait segmentation and localization tasks.

For more details about this model, please visit the following link.

https://github.com/sajeedmehrab/Fish-Vista/blob/main/README.md

Related paper link:

https://arxiv.org/abs/2407.08027

Figure 1: Fish-Vista

4. Arboretum: A Large Multimodal Dataset for Promoting Biodiversity

Arboretum is currently the largest publicly accessible dataset aimed at advancing the application of artificial intelligence in the field of biodiversity. This dataset is curated by the iNaturalist citizen science platform and reviewed by domain experts, containing 134.6 million images, exceeding existing datasets by an order of magnitude in scale. The Arboretum dataset encompasses image-language paired data for various species, including birds, arachnids, insects, plants, fungi, mollusks, and reptiles. Each image is annotated with scientific names, classification details, and common names, enhancing the robustness of AI model training. To demonstrate the value of this dataset, the research team released a set of CLIP models trained on a subset of 40 million captioned images and introduced new benchmarks for evaluating zero-shot learning and accuracy across life stages, rare species, mixed species, and classification hierarchies. The release of this dataset will play an outstanding role in fields such as pest control, crop monitoring, global biodiversity assessment, and environmental protection.

For more details about this model, please visit the following link.

https://github.com/baskargroup/Arboretum/blob/main/README.md

Related paper link:

https://arxiv.org/abs/2406.17720

Figure 2: Arboretum Database

5. BIOSCAN-5M: A Multimodal Dataset of Insect Biodiversity

BIOSCAN-5M is a comprehensive dataset that includes multimodal information from over 5 million insect specimens, including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, and geographic information, significantly expanding the existing image-based biological datasets. The research team demonstrated the impact of multimodal data on classification and clustering accuracy through three benchmark experiments. First, a masked language model was pretrained on the DNA barcode sequences of BIOSCAN-5M, enhancing species and genus-level classification performance. Second, zero-shot transfer learning was applied using images and DNA barcodes, achieving meaningful clustering. Finally, by integrating multimodal data through contrastive learning, a universal shared embedding space was created, making cross-modal classification possible. The release of this dataset provides a valuable resource for machine learning, ecology, and biodiversity research, and is expected to promote further development in insect studies.

For more details about this model, please visit the following link.

https://github.com/zahrag/BIOSCAN-5M/blob/main/README.md

Related paper link:

https://arxiv.org/abs/2406.12723

6. AMI dataset: Advancing New Progress in Field Insect Identification

Insects are a key component of global biodiversity, yet many are facing the threat of extinction, which poses serious impacts on ecosystems and agriculture. Due to the scarcity of human experts and insufficient monitoring tools, data on insect diversity and abundance remain severely lacking. To address this issue, ecologists have begun using camera traps to record insects and applying computer vision algorithms to process the data. In response to the challenges of field monitoring, such as long-tailed data, extremely similar categories, and significant distribution changes, the research team launched the large-scale AMI dataset. This dataset consists of two parts: AMI-GBIF, which includes 2.5 million insect images from citizen science platforms and museums; AMI-Traps, which contains 2,893 expert-annotated images captured by global automatic camera traps (labeling 52,948 insects). The AMI dataset aims to enhance generalization capabilities across geographical locations and hardware setups, advancing the development of insect monitoring technologies.

For more details about this model, please visit the following link.

https://github.com/rolnicklab/ami-dataset?tab=readme-ov-file

Related paper link:

https://arxiv.org/abs/2406.12452

7. Open Animal Tracks: A Dataset for Animal Footprint Recognition

Understanding animal habitats is crucial for the conservation of terrestrial biodiversity. Identifying animal footprints can provide valuable information about species distribution, abundance, and behavior, but this process has been hindered by the lack of public datasets. To address this issue, the research team has launched the OpenAnimalTracks dataset, which is the first publicly available labeled dataset aimed at advancing the automatic classification and detection of animal footprints. The dataset covers various footprints of 18 species of wild animals and establishes benchmark tests for species classification and detection. The experimental results using representative classifiers and detection models for automatic recognition show that the average accuracy of SwinTransformer reaches 69.41%, while the mAP of Faster-RCNN is 0.295. The OpenAnimalTracks dataset lays the foundation for the development of automated animal tracking technologies, promising to enhance the capacity for biodiversity conservation and management.

For more details about this model, please visit the following link.

https://github.com/dahlian00/OpenAnimalTracks/blob/main/README.md

Related paper link:

https://arxiv.org/abs/2406.09647

Figure 3: Dataset for animal footprint recognition

8. AGBD: A global-scale biomass dataset

Accurate estimation of above-ground biomass (AGB) is crucial for addressing climate change and biodiversity loss. However, existing datasets are either limited to specific regions or have low resolution, making it difficult to provide high-resolution information on a global scale. To fill this gap, the research team launched the AGBD dataset, which is a high-resolution benchmark dataset on a global scale. This dataset combines the AGB reference data from the GEDI mission with data from Sentinel-2 and PALSAR-2 imagery. In addition, it also includes pre-processed advanced features such as dense canopy height maps, elevation maps, and land cover classification maps. We also generated dense high-resolution (10m) AGB prediction maps for the entire area covered by the dataset. The release of this dataset will greatly advance global biomass research and related applications.

For more details about this model, please visit the following link.

https://github.com/ghjuliasialelli/AGBD/blob/main/README.md

Related paper link:

https://arxiv.org/abs/2406.04928

Nature Positive Lab’s Substack

Discussion about this post