Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

This section describes how EMO-BON handles large data files.

The Challenge

Git is designed for text and source code, not large binary files. EMO-BON analysis results often include:

These files are too large for Git and would bloat repositories.

DVC (Data Version Control)

EMO-BON uses DVC to manage large files.

What is DVC?

DVC is a version control system for data. It:

How DVC Works

  1. Add Large File: dvc add large_file.csv

    • Creates large_file.csv.dvc (metadata)

    • Adds large_file.csv to .gitignore

  2. Commit Metadata: git add large_file.csv.dvc

    • Git tracks the small .dvc file

    • Actual data not in Git

  3. Push Data: dvc push

    • Uploads data to S3

    • S3 stores the actual file

  4. Pull Data: dvc pull

    • Downloads data from S3

    • Restores files locally

DVC Configuration

In .dvc/config:

[remote "s3-storage"]
    url = s3://emobon-data/analysis-results
    region = eu-west-1

[core]
    remote = s3-storage

DVC in Workflows

GitHub Actions can use DVC:

steps:
  - name: Setup DVC
    uses: iterative/setup-dvc@v1
    
  - name: Pull data
    run: dvc pull
    env:
      AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
      AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

S3 Storage

EMO-BON uses Amazon S3 (Simple Storage Service) for large file storage.

Bucket Structure

emobon-data/
├── analysis-results/
│   ├── cluster-01/
│   │   ├── taxonomy-summary.tsv
│   │   ├── functional-annotation.tsv
│   │   └── ...
│   └── cluster-02/
└── sequences/
    └── raw/

Access Control

Storage Classes

Using Large Files

In Analysis Results Crates

  1. Generate Large Files: MetaGOflow produces outputs

  2. Add to DVC: dvc add results/*.tsv

  3. Commit Metadata: Git tracks .dvc files

  4. Push to S3: dvc push

  5. Reference in RO-Crate: Metadata links to S3 URLs

Accessing Data

Users can access large files:

Example

In ro-crate-metadata.json:

{
  "@id": "taxonomy-summary.tsv",
  "name": "Taxonomic Summary",
  "contentUrl": "s3://emobon-data/analysis-results/cluster-01/taxonomy-summary.tsv",
  "encodingFormat": "text/tab-separated-values",
  "contentSize": "524288000"
}

Best Practices

When to Use DVC

Use DVC for files that are:

Don’t Use DVC For

File Organization

repository/
├── .dvc/           # DVC configuration
├── data/
│   ├── raw.csv.dvc # DVC metadata
│   └── .gitignore  # Ignore actual data files
├── src/            # Source code (in Git)
└── README.md       # Documentation (in Git)

Performance Optimization

Caching

DVC caches data locally:

Partial Downloads

DVC can download only needed files:

dvc pull data/subset.csv.dvc

Compression

Large text files are compressed before upload:

Monitoring and Costs

Storage Usage

Transfer Costs

Backup and Recovery

Redundancy

Versioning

Disaster Recovery