Metadata and QC
Metadata files are all stored on OSF in the AllTheBacteria Metadata component.
These files all relate to INSDC metadata, tracking which samples have been processed, and then results of running the assembly (and related tools) pipeline. They include:
- ENA metadata (this is a snapshot at the time AllTheBacteria was updated to add more samples)
- Sample status at a high level: included in AllTheBacteria, or rejected for some reason when running the assembly pipeline
- Sylph results on the reads, and species calls made from the Sylph results
- Assembly statistics and checkm2 output
- Nucmer contig matches of aligning to the human genome
- “High quality” samples (defined below)
Metadata availability
The metadata are available in two forms:
- Flat files. These are described below. This is how all of the data was released when we first uploaded everything to OSF in 2024.
- In an SQLite database, available for releases 2024-08 and 2025-05. It gathers together all of the ENA metadata, assembly metadata, plus slpyh, checkm2 and assembly stats output. The 2024-08 database contains more stringent checks on the metadata, and so more samples are flagged as possibly unreliable than in the flat files. From 2025-05 onwards, the database has the same information as the flat files.
This page describes the flat files. It is simpler than the database. The details of the metadata and the SQLite database are complicated, and are described in a separate page: SQLite metadata
Metadata files
The latest complete set of data is release 0.2 plus incremental releases
2024-08 and 2025-05. The latest metadata files for this set are in the
Aggregated/Latest_2025-05/ folder of the Metadata
component.
Status file
The latest status of all processed samples is in the file status.202505.tsv.gz. It tracks the result of trying to download the reads, run sylph, assemble, and then human decontamination. The columns are:
Sample: the sample accession (SAM…)Status: status of the sample. This is either “PASS”, meaning that the pipeline finished successfully and we have an assembly, or “FAIL:…” if it failed and for what reasonDataset: the dataset the sample belongs toHQ_filter: “PASS” means this sample is in the “high quality” datasetAssembly_on_OSF:1or0to indicate if the assembly is available from OSF (and AWS). The complete set of AllTheBacteria assemblies is the samples with1in this column.
Note, the older status file for 2024-08 did not have the HQ_filter,
Assemby_on_OSF columns, and had an extra column Comments.
Sample lists
The file sample_list.txt.gz lists all samples that have an assembly.
For aggregated data, it is the samples that have “PASS” in the “Status”
column of the status file (described above).
All of the samples in sample_list.txt.gz will be in the files
described later (sylph, checkm2 etc). Those files will contain more
samples because not every sample results in an assembly. For example,
the reads for a given sample could be downloaded and sylph run
successfully, and then the assembly fails. That sample would have sylph
results, but no assembly, and so does not appear in
sample_list.txt.gz.
ENA metadata
When processing new samples, the first thing we do is download all
metadata from the ENA for all bacteria. The results are in
ena_metadata.YYYYMMDD.tsv.gz (where YYYYMMDD is the download date from
the ENA) providing a snapshot at the time of download. These files are
only included with each individual release. We do not make an aggregated
file across releases, since it does not really make sense to do so.
Sylph
After downloading the reads, sylph is run on them to get species
abundances. The results are in the file sylph.tsv.gz, which is the
original sylph output, except for these differences:
- The
Sample_filecolumn is replaced with the INSDC accession columnsSampleandRun. - An extra column
Speciesis added, which is a species call from theGenome_filecolumn, using GTDB species names.
Some samples have no matches and there is no output - these samples are
listed in the file sylph.no_matches.tsv.gz.
We also try to make a species call from the sylph output, which can be
found in species_calls.tsv.gz. The method used to make these calls has
changed over time. Please see species calls for more
information.
- r0.2 and incremental release 2024-08 used a naive method of requiring a sylph match with more than 99% abundance
- Incremental release 2025-05 used a more stringent method. The
aggregated file
species_calls.tsv.gzfor everything up to and including release 2025-05 has the old calls in the columnsylph_species_pre_202505and the old “high quality” call inin_hq_pre_202505. The updated species and high quality calls are in the columnssylph_species,HQ.
Decontamination
After assembly, we use nucmer to align the contigs to the human genome
(plus HLA sequences). Matching contigs are removed from the assembly.
The complete nucmer output is given in human_nucmer.gz. We do not
provide an aggregated nucmer file of the latest data because it is
relatively large.
Assembly statistics
The results of running assembly-stats (from
https://github.com/sanger-pathogens/assembly-stats) are provided in
assembly-stats.tsv.gz.
Checkm2
The results of running checkm2 are provided in checkm2.tsv.gz. The
columns in the output file are the original output from checkm2 but with
the first “Name” column replaced with “Sample”, and then the values are
the INSDC sample accession IDs.
High quality dataset
We define a high quality dataset for each release. For releases up to and including 2024-08, the requirements were:
- Have a sylph call with at least 99 percent minimum abundance. If a sample has more than one call (eg where it has more than one run), then require all species calls to be the same
- Minimum checkm2 completeness of 90%
- Maximum checkm2 contamination of 5%
- Total assembly length between 100kbp and 15Mbp
- Maximum number of contigs 2,000
- Minimum N50 2,000
From 2025-05 onwards, the sylph parsing was made stricter, as described here. All other rules remained the same.
The high quality samples are listed in hq_set.samples_list.txt.gz. The
rejected samples are listed in hq_set.removed_samples.tsv.gz.