Reference

Glossary

A comprehensive list of terms and definitions used throughout the EMO-BON project.

Physical concepts

Partner:
An EMO BON member, which is typically but not exclusively an institute.
Station:
An EMO BON Station. Stations collect EMO BON samples. Stations may have multiple observatories, and each observatory can involve contributions from one or more partners.
Observatory:
An EMO BON organisational unit linked to the collection of a specific sample type (e.g. water column, soft sediment) from a fixed, pre-determined location. While technically an observatory is tied to a sample type, this distinction is often ignored in casual use since the observatory’s base name (obs_id) is the same for all sample types.
- Notes: Definition may need an update once the ARMS units are fully incorporated in EMO BON.
Sampling event:
A sampling action performed at a particular observatory at a specific time, resulting in the collection of one or more samples.
Material Sample:
Refers to a material sample collected during a sampling event. Each unique material sample has a unique material sample ID. Also used to refer to the sample material that was sequenced, where the physical sample no longer exists but it is virtually present via its sample ID.

Digital concepts

Logsheet:
The spreadsheets in which the observatories write their sample and event data. The source spreadsheets are on the EMO BON googledrive, from where they are harvested as CSV into EMOBON’s GH space. The “transformed” logsheets are those that have been subjected to a date-range selection and a QC.
EMO BON data:
Here we mean: the content of the logsheets, which are filled by the observatories to describe their collected samples; the sequences in ENA; the outputs from bioinformatics processing.
- Notes: And once ARMS units are incorporated, this will also include ARMS images.
EMO BON metadata:
Here we mean the data that is used specifically to describe EMO BON data, performing the function of allowing discovery, understanding, organising, cataloguing, etc. Metadata are recorded in the rocrate-metadata.json files; they are added to ENA accessions; they are in files in the EMO BON repos governance-data, sequencing-data, observatory-profile, among others.
EMO BON record:
A digital representation of a sampling event, capturing the relevant data and metadata associated with it. There is no fixed idea of what is included in an EMO BON record, as that depends on the system that these records are being held in; for example, EMO BON records in EurOBIS and in Blue Cloud will not necessarily be the same.
Catalogue asset:
The smallest unit of “EMOBON dataset” that goes into a dataset’s metadata catalogue, i.e. a specific “EMO BON record” in a specific catalogue. Can be a single data file or a set of files.
EMO BON Repository:
A GitHub repository* that contains EMO BON data and metadata.
*A GitHub repository represents a storage location for files and their version history, managed using Git version control which allows users to track changes, collaborate with others, and maintain a complete record of the project’s development over time.
EMO BON RO-Crate:
EMO BON ro-crates* contain data associated with: logsheets from observatories, MetaGOflow runs, sequencing metadata. Usually, our ro-crates are single repositories, but for some, one repository contains multiple ro-crates. A ro-crate is manifest via a ro-crate-metadata.json file.
*An Ro-crate is a collection of data files, metadata, and contextual information that organizes research data in a structured format, enabling easy sharing, reuse, and understanding in both machine-readable and human-readable forms.
EM BON ro-crate-metadata.json:
A ro-crate-metadata.json file* that describes the contents of an EMO BON RO-Crate.
*A rocrate-metadata.json file is a JSON-LD file that provides a detailed description of the contents and structure of an RO-Crate. It maps relationships between files and their metadata, ensuring traceability, context, and purpose for data within research workflows.
Sequence:
A DNA string. Specifically, we mean (raw) sequences as produced from the material samples by Genoscope and held on their cloud drive and then archived on ENA.
processed sequences / OTUs / ASVs:
These are sequences that have been processed by a bioinformatics code to a stage where they can be/have been compared to taxonomic reference libraries.
Stub:
N/A

URI namespace

Below is a comprehensive overview of the URI namespaces used within EMO-BON, including defined patterns for RO-profile, RO-Crate, and data entity URIs. While ensuring consistency and interoperability, these URI patterns are designed for dereferenceability and publication of EMO-BON Data as static content, (without reliance on a triple store).

Overview

Notes:

the base-url for all these is http(s)://data.emobon.embrc.eu

For Entities of Type	URI split into `/repository` part	`/path-to/file.ext` part	`#fragment-identifier` part
RO-Profiles
ro-profiles	`/{name}-profile`	`/{version}`
RO-Crates
governance-crate	`/governance-crate`
observatory-crate	`/observatory-{obsid}-crate`
analysis-results-crate	`/analysis-results-{cluster}-crate`
sequencing-crate	`/sequencing-crate`
Data Entities
Observatory	`/observatory-{obs_id}-crate`	`/{env_package}/observatory/{obs_id}`
Sampling event	`/observatory-{obs_id}-crate`	`/{env_package}/sampling/{sampling_event}`
Sample	`/observatory-{obs_id}-crate`	`/{env_package}/sample/{source_mat_id}`
Observation	`/observatory-{obs_id}-crate`	`/{env_package}/observation/{source_mat_id}`	`#{observedProperty}`

Taxon summary	`/analysis-results-{cluster}-crate`	`/{genoscopeID}/taxonomy-summary`
Functional annotation	`/analysis-results-{cluster}-crate`	`/{genoscopeID}/functional-annotation`	`#{annotationID}`

batch	`/sequencing-crate`	`/shipment/batch/{batchID}`
sequence-run	`/sequencing-crate`	`/shipment/batch/{batchID}`	`#SequenceAnalysis`

Note - TODO - a similar overview could be useful for the most important external things we refer to

ena entities by «accession no» :: https://www.ebi.ac.uk/ena/browser/view/{accession no}
S3 objects stored via dvc :: https://TBD
other?

RO-Profiles

RO-profiles encapsulate Research Object (RO) profiles that describe conformity to a fixed set of expectations for data structures. They ensure consistency in the way research data is formatted and provide practical assets such as templates, SHACL files, documentation, and other reference materials. Used as metadata descriptors for crates that comply with specific profile expectations, RO-profiles help standardize research data management. More information is available at: Ro Crate profiles.

Creation & Management:
https://github.com/emo-bon/{name}-profile
URI Format:
https://data.embon.embrc.eu/{name}-profile/{version}
- {name}: observatory | sequencing | analysis-results
- {version}: latest or a specific version vM.m.p. (e.g., v1.0.0)

RO-Crates

Crates are structured data packages that encapsulate various types of research data. They serve as containers for datasets, metadata, and other related information, ensuring consistency, traceability, and compliance with RO-profiles. Within the EMO-BON ecosystem, we categorize crates into different types, each serving a specific purpose:

Governance Crate

Concept:
Contains information relating to the governance of EMO BON GitHub activities and actions.
URI Format:
https://data.emobon.embrc.eu/governance-crate/
Creation & Management:
https://github.com/emo-bon/governance-crate/

Observatory Crate

Concept:
Holds information about an EMOBON Observatory, its associated sampling events, and the data generated from those events.
Creation & Management:
https://github.com/emo-bon/observatory-{obsid}-crate/
URI Format:
https://data.emobon.embrc.eu/observatory-{obsid}-crate
- {obsid} = …

Analysis-Results Crate

Concept:
Holds information and data resulting from MetaGOFlow data analysis processes.
Creation & Management:
https://github.com/emo-bon/analysis-results-{cluster}-crate/
URI format:
https://data.emobon.embrc.eu/analysis-results-{cluster}-crate/
- {cluster} = …

Sequencing Crate

Concept:
Holds information and data resulting from sequencing runs.
Creation & Management:
https://github.com/emo-bon/sequencing-crate
URI format:
https://data.emobon.embrc.eu/sequencing-crate

Data Entities

The RDF-based triple files within each crate collectively form the EMO-BON data graph, with entities as its core components. Each entity has a unique URI and is defined by properties and connections to other entities. This ensures structured data, consistent referencing within the knowledge graph, and allows for efficient data retrieval and analysis within the EMO-BON ecosystem. Below is an overview of the primary entities, their roles, and their respective URI formats.

Observatory

Concept:
An EMO BON organisational unit linked to the collection of a specific sample type (e.g. water column, soft sediment) from a fixed, pre-determined location.
URI Format:
http://data.emobon.embrc.eu/observatory-{obs_id}-crate/{env_package}/observatory/{obs_id}
- {obsid}: …
- {env_package}: …

Sampling Event

Concept:
A sampling action performed at a particular observatory at a specific time, resulting in the collection of one or more samples.
URI Format:
http://data.emobon.embrc.eu/observatory-{obs_id}-crate/{env_package}/sampling/{sampling_event}
- {obsid}: …
- {env_package}: …
- {sampling_event}: …

Sample

Concept:
A material sample collected during a sampling event.
URI Format:
http://data.emobon.embrc.eu/observatory-{obs_id}-crate/{env_package}/measured/{source_mat_id}
- {obsid}: …
- {env_package}: …
- {source_mat_id}: …

Sample Replicate

Concept:
A material sample collected during a sampling event.
URI Format:
http://data.emobon.embrc.eu/observatory-{obs_id}-crate/{env_package}/sampling/{source_mat_id}
- {obsid}: …
- {env_package}: …
- {source_mat_id}: …

Observation

Concept:
A measurement or observations made from a sample.
URI Format:
http://data.emobon.embrc.eu/observatory-{obs_id}-crate/{env_package}/measured/{source_mat_id}#{observedProperty}
- {obsid}: …
- {env_package}: …
- {source_mat_id}: …
- {observedProperty}: …

Batch

Concept:
…
URI Format:
http://data.emobon.embrc.eu/shipment/batch/{batchID}
- {batchID}: …

Sequence Run

Concept:
…
URI Format:
http://data.emobon.embrc.eu/shipment/batch/{batchID}#SequenceAnalysis
- {batchID}: …

Taxon

Concept:
…
URI Format:
tbd

FunctionalAnnotation

Concept:
…
URI Format:
tbd

Ontologies

to contain a description of what we have in this are and where (/ns) they are published

RO-Profiles

analysis-results-profile

to describe our analysis-results profile + link to their ro-crate-topages & explain how conforming crates (and their gitrepos) are structured (i.e how content is organized in crate)

observatory-profile

to describe our observatory profile + link to their ro-crate-topages & explain how conforming crates (and their gitrepos) are structured (i.e how content is organized in crate)

sequencing-profile

to describe our sequencing profile + link to their ro-crate-topages & explain how conforming crates (and their gitrepos) are structured (i.e how content is organized in crate)

Software Components

Context

EMO BON = EMBRC Marine Omics Biodiversity Observation Network
Environmental DNA (eDNA) metabarcoding is applied to samples taken at sea (either water or sediment)
Species occurrences may be represented via RDF or DwC-A (species + location + time)

Relating samples with observatories (Part I)

Relevant repositories:

https://github.com/emo-bon/governance-data
https://github.com/emo-bon/repo-constructor-action

Samples are taken by observatories. Each observatory has an observatory id and an ENA project accession number (predefined). All observatories are grouped under a single ENA umbrella project number (predefined) PRJEB51688.

A single observatory may be operated by multiple organizations, for example the observatory identified by BPNS is operated by Ghent University (UGENT), Flanders Marine Institute (VLIZ), Royal Belgian Institute of Natural Sciences (RBINS) and Katholieke Universiteit Leuven (KULeuven).

A single observatory may take multiple samples. Therefore, each observatory maintains a list of samples taken (Google Sheets), along with their unique identifier (sample id) and other relevant attributes. These spreadsheets are known as “logsheets” (cfr. https://github.com/emo-bon/governance-data/blob/main/logsheets.csv).

In order to manage the observatories’ data on GitHub, a repository is automatically constructed for each observatory via a GitHub action, repo-constructor-action, acting on the governance-data repository. More specifically, this action reads the logsheets.csv file and generates a repository with these properties:

observatory id (repository name becomes observatory-{observatory_id}-crate)
Google Sheets URLs
RO-Crate profile URI
Downstream GitHub action workflow (see Part II)

The properties are eventually stored in the newly created repo under ./config/workflow_properties.yml

Relating samples with observatories (Part II)

Relevant repositories:

https://github.com/emo-bon/observatory-bpns-crate
https://github.com/emo-bon/observatory-profile

Once the observatory-{observatory_id}-crate repository is generated, a series of GitHub actions will be acting on it:

logsheet-downloader-action: Downloads the spreadsheets from Google Sheets and stores them under the ./logsheets folder, with each spreadsheet tab splitted out into a single CSV file. The download is scheduled to occur every 6 months.
data-quality-control-action: Reads the data_quality_control_threshold_date from ./config/workflow_properties.yml and runs a data quality control pipeline, repairing data where possible. Initially, the logsheets are filtered up to the threshold date and stored under ./logsheets-filtered. Next, data rules and corresponding repairs are applied to the filtered data and the results are stored under ./logsheets-transformed. Violations, errors and warnings are reported under ./data-quality-control:
- dqc.csv: Full list of data rule violations.
- logfile: Full list of errors and warnings.
- report.csv: Reduced list of data rule violations, with only the violations that can’t be repaired automatically.
Eventually, a GitHub issue is created, pointing the end user to the logfile and report.
rocrate-sembench-setup: Makes preparations for the next action, semantify, by initializing a rocrate from a default profile if necessary and assembling the required files and variables into the ~sembench_data_cache folder (i.e. files coming from the observatory-profile) and ~sembench_kwargs.json file, respectively. These steps are not handled by semantify, because we wanted to separate rocrate-specific logic from pysembench logic on a conceptual level. The utility files produced by this action and used by semantify are untracked via the .gitignore.
TODO semantify:
- generate ttl (with pysubyt task)
- validate ttl (with pyshacl task)
- generate ldes feed
- create list of generated items for reuse by rocrate-validate
TODO: rocrate-validate
- validate
- repair
TODO: rocrate-to-pages

Relating samples with sequencing runs

Relevant repositories:

https://github.com/emo-bon/sequencing-data
https://github.com/emo-bon/sequencing-profile

Samples coming from several observatories are aggregated into a single batch by EMBRC and are sent to Genoscope for DNA sequencing. A batch thus consists of a list of sample identifiers.

For each sample in the batch, an ENA sample accession number needs to be generated and immutably stored

Genoscope will upload the sequencing data under a run accession number below the given sample accession number

TODO …

Relating samples with species occurrences

TODO

metagoflow
uses existing computational workflow profile

References

European Marine Biological Resource Centre (EMBRC)
Resource Description Framework (RDF)
Darwin Core Archive (DwC-A)
European Nucleotide Archive (ENA)
Research Object Crate (RO-Crate)

Reference

Reference

Glossary

Physical concepts

Digital concepts

URI namespace

Overview

RO-Profiles

RO-Crates

Governance Crate

Observatory Crate

Analysis-Results Crate

Sequencing Crate

Data Entities

Observatory

Sampling Event

Sample

Sample Replicate

Observation

Batch

Sequence Run

Taxon

FunctionalAnnotation

Ontologies

RO-Profiles

analysis-results-profile

observatory-profile

sequencing-profile

Software Components

Context

Relating samples with observatories (Part I)

Relating samples with observatories (Part II)

Relating samples with sequencing runs

Relating samples with species occurrences

References

results matching ""

No results matching ""