This section describes how EMO-BON handles large data files.
The Challenge¶
Git is designed for text and source code, not large binary files. EMO-BON analysis results often include:
Large sequence files
Processed datasets
Analysis outputs
Binary data
These files are too large for Git and would bloat repositories.
DVC (Data Version Control)¶
EMO-BON uses DVC to manage large files.
What is DVC?¶
DVC is a version control system for data. It:
Tracks large files without storing them in Git
Stores metadata in Git (small .dvc files)
Stores actual data in remote storage (S3)
Maintains version history
Enables reproducible workflows
How DVC Works¶
Add Large File:
dvc add large_file.csvCreates
large_file.csv.dvc(metadata)Adds
large_file.csvto.gitignore
Commit Metadata:
git add large_file.csv.dvcGit tracks the small .dvc file
Actual data not in Git
Push Data:
dvc pushUploads data to S3
S3 stores the actual file
Pull Data:
dvc pullDownloads data from S3
Restores files locally
DVC Configuration¶
In .dvc/config:
[remote "s3-storage"]
url = s3://emobon-data/analysis-results
region = eu-west-1
[core]
remote = s3-storageDVC in Workflows¶
GitHub Actions can use DVC:
steps:
- name: Setup DVC
uses: iterative/setup-dvc@v1
- name: Pull data
run: dvc pull
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}S3 Storage¶
EMO-BON uses Amazon S3 (Simple Storage Service) for large file storage.
Bucket Structure¶
emobon-data/
├── analysis-results/
│ ├── cluster-01/
│ │ ├── taxonomy-summary.tsv
│ │ ├── functional-annotation.tsv
│ │ └── ...
│ └── cluster-02/
└── sequences/
└── raw/Access Control¶
Public: Some datasets are publicly accessible
Private: Sensitive data has restricted access
IAM: AWS IAM controls access permissions
Storage Classes¶
Standard: Frequently accessed data
Infrequent Access: Archived data
Glacier: Long-term preservation
Using Large Files¶
In Analysis Results Crates¶
Generate Large Files: MetaGOflow produces outputs
Add to DVC:
dvc add results/*.tsvCommit Metadata: Git tracks .dvc files
Push to S3:
dvc pushReference in RO-Crate: Metadata links to S3 URLs
Accessing Data¶
Users can access large files:
Via DVC: Clone repo, run
dvc pullDirect Download: S3 URLs in metadata
Web Interface: Links in GitHub Pages sites
Example¶
In ro-crate-metadata.json:
{
"@id": "taxonomy-summary.tsv",
"name": "Taxonomic Summary",
"contentUrl": "s3://emobon-data/analysis-results/cluster-01/taxonomy-summary.tsv",
"encodingFormat": "text/tab-separated-values",
"contentSize": "524288000"
}Best Practices¶
When to Use DVC¶
Use DVC for files that are:
Larger than 10 MB
Binary formats
Frequently updated
Part of reproducible workflows
Don’t Use DVC For¶
Source code (use Git)
Small text files (use Git)
Configuration files (use Git)
Documentation (use Git)
File Organization¶
repository/
├── .dvc/ # DVC configuration
├── data/
│ ├── raw.csv.dvc # DVC metadata
│ └── .gitignore # Ignore actual data files
├── src/ # Source code (in Git)
└── README.md # Documentation (in Git)Performance Optimization¶
Caching¶
DVC caches data locally:
Faster access to frequently used files
Avoids redundant downloads
Shared cache across projects
Partial Downloads¶
DVC can download only needed files:
dvc pull data/subset.csv.dvcCompression¶
Large text files are compressed before upload:
Reduces storage costs
Faster transfers
Transparent to users
Monitoring and Costs¶
Storage Usage¶
Monitor S3 bucket size
Track costs per repository
Lifecycle policies for old data
Transfer Costs¶
Minimize redundant uploads/downloads
Use caching effectively
Compress data when possible
Backup and Recovery¶
Redundancy¶
S3 provides automatic redundancy
Data replicated across availability zones
99.999999999% (11 9’s) durability
Versioning¶
S3 versioning enabled
Can recover previous versions
Protection against accidental deletion
Disaster Recovery¶
Regular backups to separate bucket
Cross-region replication
Export to institutional archives