Documentation
Full documentation is on readthedocs ↗.
Some highlighted features are also documented here
Searching and querying the assemblies
The ATB dataset has been indexed with LexicMap which supports searching and alignment to the full dataset (i.e. a BLAST-like search) using sequences >100bp in length.
Local search with pre-built index
This option avoids the high RAM indexing, but it does mean you have a very large (5 Tb) download.
Install awscli
via conda.
conda install -c conda-forge awscli
Test access.
aws s3 ls s3://allthebacteria-lexicmap/202408/ --no-sign-request
# output
PRE genomes/
PRE seeds/
2025-04-08 16:39:17 62488 genomes.chunks.bin
2025-04-08 16:39:17 54209660 genomes.map.bin
2025-04-08 22:32:35 619 info.toml
2025-04-08 22:32:36 160032 masks.bin
Download the index (it’s 5.24 TiB!!!, make sure you have enough disk space).
aws s3 cp s3://allthebacteria-lexicmap/202408/ atb.lmi --recursive --no-sign-request
# dirsize atb.lmi
atb.lmi: 5.24 TiB (5,758,875,365,595)
2.87 TiB seeds
2.37 TiB genomes
51.70 MiB genomes.map.bin
156.28 KiB masks.bin
61.02 KiB genomes.chunks.bin
619 B info.toml
Install and search with LexicMap.
Building an index and searching locally
This option requires a small (80Gb) download, but indexing the assemblies took 150Gb RAM, and 47 hours with 48 CPUs.
Make sure you have enough disk space, at least 8 TB, >10 TB is preferred.
Tools:
- LexicMap
- https://github.com/shenwei356/rush, for running jobs in parallel
Steps:
- Downloading the list file of all assemblies in the latest version (v0.2 plus incremental versions).
mkdir -p atb;
cd atb;
# Attention, the URL might change,
# please check in the browser: https://osf.io/zxfmy/files/osfstorage
wget https://osf.io/download/4yv85/ -O file_list.all.latest.tsv.gz
If you only need to add assemblies from an incremental version, please manually download the file list here.
- Downloading assembly tarball files.
# Tarball file names and their URLs
zcat file_list.all.latest.tsv.gz | awk 'NR>1 {print $3"\t"$4}' | uniq > tar2url.tsv
# Download. If it's interrupted, just rerun the same command.
cat tar2url.tsv | rush --eta -j 2 -c -C download.rush 'wget -O {1} {2}'
- Decompressing all tarballs. The decompressed genomes are stored in
plain text, so we use
gzip
(can be replaced with fasterpigz
) to compress them to save disk space.
# {^tar.xz} is for removing the suffix "tar.xz"
ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^.tar.xz}/*.fa'
cd ..
After that, the assemblies directory would have multiple
subdirectories. When you give the directory to lexicmap index -I
,
it can recursively scan (plain or gz/xz/zstd-compressed) genome
files. You can also give a file list with selected assemblies.
$ tree atb | more
atb
├── atb.assembly.r0.2.batch.1
│ ├── SAMD00013333.fa.gz
│ ├── SAMD00049594.fa.gz
│ ├── SAMD00195911.fa.gz
│ ├── SAMD00195914.fa.gz
- Prepare a file list of assemblies.
Just use
find
orfd
(much faster).# find find atb/ -name "*.fa.gz" > files.txt # fd fd .fa.gz$ atb/ > files.txt
What it looks like:
$ head -n 2 files.txt atb/atb.assembly.r0.2.batch.1/SAMD00013333.fa.gz atb/atb.assembly.r0.2.batch.1/SAMD00049594.fa.gz
(Optional) Only keep assemblies of high-quality. Please click this link to download the
hq_set.sample_list.txt.gz
file, or from this page.find atb/ -name "*.fa.gz" | grep -w -f <(zcat hq_set.sample_list.txt.gz) > files.txt
- Creating a LexicMap index (tutorials). It took 47h40m and 145GB RAM with 48 CPUs for 2.44 million ATB genomes.
lexicmap index -S -X files.txt -O atb.lmi -b 25000 --log atb.lmi.log
# dirsize atb.lmi
atb.lmi: 5.24 TiB (5,758,875,365,595)
2.87 TiB seeds
2.37 TiB genomes
51.70 MiB genomes.map.bin
156.28 KiB masks.bin
61.02 KiB genomes.chunks.bin
619 B info.toml
- Searching with LexicMap (tutorial).
Searching on AWS EC2
- Launch an EC2 instance in Europe London region (eu-west-2) where the index is located.
OS: Amazon Linux 2023 64-bit (Arm)
Instance type (You might need to increase the limit of CPUs):
- c7g.8xlarge (32 vCPU, 64 GiB memory, 15 Gigabit, 1.3738 USD per Hour)
- c6gn.12xlarge (48 vCPU, 96 GiB memory, 75 Gigabit, 2.46 USD per Hour) (recommended)
Storage: 20 GiB General purpose (gp3), only for storing queries and results.
Mount the LexicMap index with mount-s3 (it’s fast but still slower than local disks).
# Install mount-s3. You might need to replace arm64 with x86_64 for other architectures
wget https://s3.amazonaws.com/mountpoint-s3-release/latest/arm64/mount-s3.rpm
sudo yum install -y ./mount-s3.rpm
rm ./mount-s3.rpm
# Mount the v0.2+202408 index
mkdir -p atb.lmi
UNSTABLE_MOUNTPOINT_MAX_PREFETCH_WINDOW_SIZE=65536 \
mount-s3 --read-only --prefix 202408/ allthebacteria-lexicmap atb.lmi --no-sign-request
- Install LexicMap.
# Check the latest version here: https://github.com/shenwei356/LexicMap/releases
# You can also check the pre-release here: https://github.com/shenwei356/LexicMap/issues/10
# Binary's path depends on the architecture of the CPUs: amd64 or arm64
wget https://github.com/shenwei356/LexicMap/releases/download/v0.7.0/lexicmap_linux_arm64.tar.gz
mkdir -p bin
tar -zxvf lexicmap_linux_arm64.tar.gz -C bin
rm lexicmap_linux_arm64.tar.gz
# Here, we use a demo query
wget https://github.com/shenwei356/LexicMap/raw/refs/heads/main/demo/bench/b.gene_E_faecalis_SecY.fasta
- Run LexicMap (tutorial).
# Create and enter a screen session
screen -S lexicmap
# Run
# Searching with b.gene_E_faecalis_SecY.fasta takes 20 minutes with c7g.8xlarge, 12.5 minutes with c6gn.12xlarge
# Searching with b.gene_E_coli_16S.fasta takes 1h54m with c6gn.12xlarge.
lexicmap search -d atb.lmi b.gene_E_faecalis_SecY.fasta -o query.lexicmap.txt.gz --debug
- Unmount the index.
sudo umount atb.lmi