Name | Uploaded | Size |
---|---|---|
v1/1-AFESMClusters-repId_memId_cluFlag_taxId_biomeId.tsv.gz | Fri, 25 Apr 2025 00:21:51 GMT | 3.5 GB |
v1/2-repID_isOnlyESM_nMem_nAllMem_repPlddt_avgPlddt_avgAllPlddt_repLen_avgLen_avgAllLen_LCAtaxID_nBiome_LCBID.tsv.gz | Fri, 25 Apr 2025 04:57:25 GMT | 110.4 MB |
v1/3-entryId_taxId.tsv.gz | Sun, 20 Apr 2025 15:44:25 GMT | 3.3 GB |
v1/3-repId_memId_cluFlag_taxId_biomeId.tsv.gz | Wed, 23 Apr 2025 12:05:19 GMT | 7.1 GB |
v1/4-biomeID_biome.tsv.gz | Wed, 23 Apr 2025 12:44:15 GMT | 5.0 kB |
v1/5-domId_afesmRepId_boundary_length_plddt_cathLabel_cathLevel_cathMethod_isGlobular.tsv.gz | Mon, 21 Apr 2025 03:44:25 GMT | 135.7 MB |
v1/6-cathID_presenceFlag_nNovel_nNonNovel_statTestMethod_pValue_adjPValue_log2Ratio_diffAbundanceFlag_nNovelPartners_cathName.tsv.gz | Wed, 23 Apr 2025 12:43:16 GMT | 81.6 kB |
This dataset presents 5.12 million non-singleton structural clusters generated by our Foldseek-based clustering algorithm. The clusters are derived from the AFESM database, which integrates AlphaFold Database (AFDB) and ESMFold structure predictions. Comprehensive cluster data is provided on this site, including taxonomic and biome annotations, domain-level assignments, clustering metadata, biome context, and protein similarity information for downstream analysis.
For an interactive view on a subset of the data please check out our website here.
The code used for our analysis is available on our GitHub page here.
1-AFESMClusters-entryId_repId_taxId.tsv.gz: AFESM clusters' information
2-repID_isOnlyESM_nMem_nAllMem_repPlddt_avgPlddt_avgAllPlddt_repLen_avgLen_avgAllLen_LCAtaxID_nBiome_LCBID.tsv.gz: Cluster overview file containing metadata for Foldseek clusters.
3-repId_memIDcluFlag_taxId_biomeID.tsv.gz: Cluster information of 668 million AFESM sequences
4-biome-biomeID.tsv: Biome information mapping
5-domId_afesmRepId_boundary_length_plddt_cathLabel_cathLevel_cathMethod_isGlobular.tsv.gz: Domain annotation for representatives in AFESM clusters. We provide domain assignments for all non-singleton cluster representatives, including (1) ESM-only and (2) AF-including clusters. For (2), if the representative was from ESM, we promoted the longest AFDB member to ensure AF-based annotation.
6-cathID_presenceFlag_nNovel_nNonNovel_statTestMethod_pValue_adjPValue_log2Ratio_diffAbundanceFlag_nNovelPartners_cathName.tsv: Comparison of 2,906 CATH H-level categories between Novel and Non-novel MDPs. Each row includes the identifier, occurrence statistics, statistical test results, and abundance and presence labels.
All files are available under a Creative Commons Attribution 4.0 International License.