Readme

This dataset presents 5.12 million non-singleton structural clusters generated by our Foldseek-based clustering algorithm. The clusters are derived from the AFESM database, which integrates AlphaFold Database (AFDB) and ESMFold structure predictions. Comprehensive cluster data is provided on this site, including taxonomic and biome annotations, domain-level assignments, clustering metadata, biome context, and protein similarity information for downstream analysis.

Yeo J., Han Y., Bordin N., Lau A. M., Kandathil S. M., Kim H., Karin E. L., Mirdita M., Jones D. T., Orengo C., Steinegger M. Metagenomic-scale analysis of the predicted protein structure universe. bioRxiv doi: doi.org/NN.NN/2025.04.NN.NN (2025)

Foldseek Marv

Updates

Website

For an interactive view on a subset of the data please check out our website here.

Code Availability

The code used for our analysis is available on our GitHub page here.

Data description

1-AFESMClusters-entryId_repId_taxId.tsv.gz: AFESM clusters' information

  1. repId: Identifier of the representative
  2. memId: Identifier of the member (AFDB sequence)
  3. cluFlag: 1 = AFESM30, 2 = AFESM foldseek clusters
  4. taxId: Taxonomy ID of the member
  5. biomeID: Biome ID of the member (see 4-biome-biomeID.tsv)

2-repID_isOnlyESM_nMem_nAllMem_repPlddt_avgPlddt_avgAllPlddt_repLen_avgLen_avgAllLen_LCAtaxID_nBiome_LCBID.tsv.gz: Cluster overview file containing metadata for Foldseek clusters.

  1. repID: ID of the representative protein in the Foldseek cluster
  2. isOnlyESM: 1 if the cluster consists only of ESM-predicted structures; 0 otherwise
  3. nMem: Number of non-fragment members in the Foldseek cluster
  4. nAllMem: Number of non-fragment members in the full cluster, including MMseqs2-level clustering
  5. repPlddt: pLDDT score of the representative structure
  6. avgPlddt: Average pLDDT score of the Foldseek cluster
  7. avgAllPlddt: Average pLDDT score of the full cluster
  8. repLen: Length of the representative sequence
  9. avgLen: Average sequence length of the Foldseek cluster
  10. avgAllLen: Average sequence length of the full cluster
  11. LCAtaxID: Taxonomy ID of the lowest common ancestor
  12. nBiome: Number of entries with biome annotations
  13. LCBID: Biome ID of the LCB (see 4-biome-biomeID.tsv)

3-repId_memIDcluFlag_taxId_biomeID.tsv.gz: Cluster information of 668 million AFESM sequences

  1. repId: Identifier of the representative
  2. memId: Identifier of the member (AFDB sequence)
  3. cluFlag: 1 = AFESM30, 2 = AFESM foldseek clusters, 3 = removed (low plddt)
  4. taxId: Taxonomy ID of the member
  5. biomeID: Biome ID of the member (see 4-biome-biomeID.tsv)

4-biome-biomeID.tsv: Biome information mapping

  1. biome: Unique biome path
  2. biomeID: Serial number of the biome path

5-domId_afesmRepId_boundary_length_plddt_cathLabel_cathLevel_cathMethod_isGlobular.tsv.gz: Domain annotation for representatives in AFESM clusters. We provide domain assignments for all non-singleton cluster representatives, including (1) ESM-only and (2) AF-including clusters. For (2), if the representative was from ESM, we promoted the longest AFDB member to ensure AF-based annotation.

  1. domId: Domain ID (repID_domainSerialNumber, N-to-C)
  2. afesmRepId: Representative ID of the AFESM cluster (original if replaced)
  3. boundary: Domain boundaries (start–end)
  4. length: Length of the domain (aa)
  5. plddt: Average pLDDT score of the domain
  6. cathLabel: Assigned CATH ID (e.g., 1.10.10.10)
  7. cathLevel: Assignment level - T (Topology) or H (Homologous superfamily)
  8. cathMethod: Assignment method - foldseek or foldclass
  9. isGlobular: 1 = globular (G), 0 = non-globular (NG)

6-cathID_presenceFlag_nNovel_nNonNovel_statTestMethod_pValue_adjPValue_log2Ratio_diffAbundanceFlag_nNovelPartners_cathName.tsv: Comparison of 2,906 CATH H-level categories between Novel and Non-novel MDPs. Each row includes the identifier, occurrence statistics, statistical test results, and abundance and presence labels.

  1. cathID: CATH H-level identifier
  2. presenceFlag: Presence category; 0 = shared, 1 = unique to Novel, 2 = unique to Non-novel
  3. nNovel: Number of occurrences in Novel MDPs
  4. nNonNovel: Number of occurrences in Non-novel MDPs
  5. statTestMethod: Statistical test method; 0 = Chi-square test, 1 = Fisher exact test
  6. pValue: Raw p-value from statistical test
  7. adjPValue: Adjusted p-value (Benjamini–Hochberg correction)
  8. log2Ratio: log₂(Novel / Non-novel frequency ratio)
  9. diffAbundanceFlag: Differential abundance label; 0 = no significant difference, 1 = overrepresented, 2 = underrepresented
  10. nNovelPartners: Number of unique novel domain pairing partners
  11. cathName: Name of the H-level category (from CATH)

License

All files are available under a Creative Commons Attribution 4.0 International License.