AFESM Cluster

Name	Uploaded	Size
v1/1-AFESMClusters-repId_memId_cluFlag_taxId_biomeId.tsv.gz	Wed, 04 Jun 2025 04:23:18 GMT	4.4 GB
v1/2-repID_isOnlyESM_nMem_nAllMem_repPlddt_avgPlddt_avgAllPlddt_repLen_avgLen_avgAllLen_LCAtaxID_nBiome_LCBID.tsv.gz	Wed, 04 Jun 2025 03:20:51 GMT	118.7 MB
v1/3-entryId_taxId.tsv.gz	Sun, 20 Apr 2025 15:44:25 GMT	3.3 GB
v1/3-repId_memId_cluFlag_taxId_biomeId.tsv.gz	Wed, 04 Jun 2025 04:24:38 GMT	6.7 GB
v1/4-biomeID_biome.tsv.gz	Wed, 23 Apr 2025 12:44:15 GMT	5.0 kB
v1/5-domId_afesmRepId_boundary_length_plddt_cathLabel_cathLevel_cathMethod_isGlobular.tsv.gz	Mon, 21 Apr 2025 03:44:25 GMT	135.7 MB
v1/6-cathID_presenceFlag_nNovel_nNonNovel_statTestMethod_pValue_adjPValue_log2Ratio_diffAbundanceFlag_nNovelPartners_cathName.tsv.gz	Wed, 23 Apr 2025 12:43:16 GMT	81.6 kB
v1/7-taxonomy.tar.gz	Thu, 01 May 2025 07:23:35 GMT	57.2 MB

Readme

This dataset presents 5.12 million non-singleton structural clusters generated by our Foldseek-based clustering algorithm. The clusters are derived from the AFESM database, which integrates AlphaFold Database (AFDB) and ESMFold structure predictions. Comprehensive cluster data is provided on this site, including taxonomic and biome annotations, domain-level assignments, clustering metadata, biome context, and protein similarity information for downstream analysis.

Yeo J., Han Y., Bordin N., Lau A. M., Kandathil S. M., Kim H., Karin E. L., Mirdita M., Jones D. T., Orengo C., Steinegger M. Metagenomic-scale analysis of the predicted protein structure universe. bioRxiv doi: doi.org/10.1101/2025.04.23.650224 (2025)

Updates

2025-06-04: Missing clusters in File no.1 is added. The entry ids are trimmed in File no. 2. Some members were included in wrong clusters in File no. 3
2025-05-07: File no.2 is updated

Website

For an interactive view on a subset of the data please check out our website here.

Code Availability

The code used for our analysis is available on our GitHub page here.

Data description

1-AFESMClusters-entryId_repId_taxId.tsv.gz: AFESM clusters' information

repId: Identifier of the representative
memId: Identifier of the member (AFDB sequence)
cluFlag: 1 = AFESM30, 2 = AFESM foldseek clusters
taxId: Taxonomy ID of the member
biomeID: Biome ID of the member (see 4-biome-biomeID.tsv)

2-repID_isOnlyESM_nMem_nAllMem_repPlddt_avgPlddt_avgAllPlddt_repLen_avgLen_avgAllLen_LCAtaxID_nBiome_LCBID.tsv.gz: Cluster overview file containing metadata for Foldseek clusters.

repID: ID of the representative protein in the Foldseek cluster
isOnlyESM: 1 if the cluster consists only of ESM-predicted structures; 0 otherwise
nMem: Number of non-fragment members in the Foldseek cluster
nAllMem: Number of non-fragment members in the full cluster, including MMseqs2-level clustering
repPlddt: pLDDT score of the representative structure
avgPlddt: Average pLDDT score of the Foldseek cluster
avgAllPlddt: Average pLDDT score of the full cluster
repLen: Length of the representative sequence
avgLen: Average sequence length of the Foldseek cluster
avgAllLen: Average sequence length of the full cluster
LCAtaxID: Taxonomy ID of the lowest common ancestor
nBiome: Number of entries with biome annotations
LCBID: Biome ID of the LCB (see 4-biome-biomeID.tsv)

3-repId_memIDcluFlag_taxId_biomeID.tsv.gz: Cluster information of 668 million AFESM sequences

repId: Identifier of the representative
memId: Identifier of the member (AFDB sequence)
cluFlag: 1 = AFESM30, 2 = AFESM foldseek clusters, 3 = removed (low plddt)
taxId: Taxonomy ID of the member
biomeID: Biome ID of the member (see 4-biome-biomeID.tsv)

4-biome-biomeID.tsv: Biome information mapping

biome: Unique biome path
biomeID: Serial number of the biome path

5-domId_afesmRepId_boundary_length_plddt_cathLabel_cathLevel_cathMethod_isGlobular.tsv.gz: Domain annotation for representatives in AFESM clusters. We provide domain assignments for all non-singleton cluster representatives, including (1) ESM-only and (2) AF-including clusters. For (2), if the representative was from ESM, we promoted the longest AFDB member to ensure AF-based annotation.

domId: Domain ID (repID_domainSerialNumber, N-to-C)
afesmRepId: Representative ID of the AFESM cluster (original if replaced)
boundary: Domain boundaries (start–end)
length: Length of the domain (aa)
plddt: Average pLDDT score of the domain
cathLabel: Assigned CATH ID (e.g., 1.10.10.10)
cathLevel: Assignment level - T (Topology) or H (Homologous superfamily)
cathMethod: Assignment method - foldseek or foldclass
isGlobular: 1 = globular (G), 0 = non-globular (NG)

6-cathID_presenceFlag_nNovel_nNonNovel_statTestMethod_pValue_adjPValue_log2Ratio_diffAbundanceFlag_nNovelPartners_cathName.tsv: Comparison of 2,906 CATH H-level categories between Novel and Non-novel MDPs. Each row includes the identifier, occurrence statistics, statistical test results, and abundance and presence labels.

cathID: CATH H-level identifier
presenceFlag: Presence category; 0 = shared, 1 = unique to Novel, 2 = unique to Non-novel
nNovel: Number of occurrences in Novel MDPs
nNonNovel: Number of occurrences in Non-novel MDPs
statTestMethod: Statistical test method; 0 = Chi-square test, 1 = Fisher exact test
pValue: Raw p-value from statistical test
adjPValue: Adjusted p-value (Benjamini–Hochberg correction)
log2Ratio: log₂(Novel / Non-novel frequency ratio)
diffAbundanceFlag: Differential abundance label; 0 = no significant difference, 1 = overrepresented, 2 = underrepresented
nNovelPartners: Number of unique novel domain pairing partners
cathName: Name of the H-level category (from CATH)

7-taxonomy.tar.gz: NCBI+GTDB taxdumps and the conversion table of the tax ids between NCBI and GTDB

License

All files are available under a Creative Commons Attribution 4.0 International License.