Name | Uploaded | Size |
---|---|---|
v1/1-AFESMClusters-repId_memId_cluFlag_taxId_biomeId.tsv.gz | Wed, 04 Jun 2025 04:23:18 GMT | 4.4 GB |
v1/2-repID_isOnlyESM_nMem_nAllMem_repPlddt_avgPlddt_avgAllPlddt_repLen_avgLen_avgAllLen_LCAtaxID_nBiome_LCBID.tsv.gz | Wed, 04 Jun 2025 03:20:51 GMT | 118.7 MB |
v1/3-entryId_taxId.tsv.gz | Sun, 20 Apr 2025 15:44:25 GMT | 3.3 GB |
v1/3-repId_memId_cluFlag_taxId_biomeId.tsv.gz | Wed, 04 Jun 2025 04:24:38 GMT | 6.7 GB |
v1/4-biomeID_biome.tsv.gz | Wed, 23 Apr 2025 12:44:15 GMT | 5.0 kB |
v1/5-domId_afesmRepId_boundary_length_plddt_cathLabel_cathLevel_cathMethod_isGlobular.tsv.gz | Mon, 21 Apr 2025 03:44:25 GMT | 135.7 MB |
v1/6-cathID_presenceFlag_nNovel_nNonNovel_statTestMethod_pValue_adjPValue_log2Ratio_diffAbundanceFlag_nNovelPartners_cathName.tsv.gz | Wed, 23 Apr 2025 12:43:16 GMT | 81.6 kB |
v1/7-taxonomy.tar.gz | Thu, 01 May 2025 07:23:35 GMT | 57.2 MB |
This dataset presents 5.12 million non-singleton structural clusters generated by our Foldseek-based clustering algorithm. The clusters are derived from the AFESM database, which integrates AlphaFold Database (AFDB) and ESMFold structure predictions. Comprehensive cluster data is provided on this site, including taxonomic and biome annotations, domain-level assignments, clustering metadata, biome context, and protein similarity information for downstream analysis.
2025-06-04: Missing clusters in File no.1 is added. The entry ids are trimmed in File no. 2. Some members were included in wrong clusters in File no. 3
2025-05-07: File no.2 is updated
For an interactive view on a subset of the data please check out our website here.
The code used for our analysis is available on our GitHub page here.
1-AFESMClusters-entryId_repId_taxId.tsv.gz: AFESM clusters' information
2-repID_isOnlyESM_nMem_nAllMem_repPlddt_avgPlddt_avgAllPlddt_repLen_avgLen_avgAllLen_LCAtaxID_nBiome_LCBID.tsv.gz: Cluster overview file containing metadata for Foldseek clusters.
3-repId_memIDcluFlag_taxId_biomeID.tsv.gz: Cluster information of 668 million AFESM sequences
4-biome-biomeID.tsv: Biome information mapping
5-domId_afesmRepId_boundary_length_plddt_cathLabel_cathLevel_cathMethod_isGlobular.tsv.gz: Domain annotation for representatives in AFESM clusters. We provide domain assignments for all non-singleton cluster representatives, including (1) ESM-only and (2) AF-including clusters. For (2), if the representative was from ESM, we promoted the longest AFDB member to ensure AF-based annotation.
6-cathID_presenceFlag_nNovel_nNonNovel_statTestMethod_pValue_adjPValue_log2Ratio_diffAbundanceFlag_nNovelPartners_cathName.tsv: Comparison of 2,906 CATH H-level categories between Novel and Non-novel MDPs. Each row includes the identifier, occurrence statistics, statistical test results, and abundance and presence labels.
7-taxonomy.tar.gz: NCBI+GTDB taxdumps and the conversion table of the tax ids between NCBI and GTDB
All files are available under a Creative Commons Attribution 4.0 International License.