AlphaFold Wiki

Hey there! This page lists a collection of useful links of AlphaFold, as well as HPC centers and GitHub issues.

Useful Links
HPC centers
Video
GitHub Issues

Useful Links

Site	Introduction
DeepMind AlphaFold Github	official site
DeepMind AlphaFold colab	official colab
ColabFold colab	by Sergey Ovchinnikov, Milot Mirdita and Martin Steinegger
MoonBear	Use AlphaFold 2 in your browser
AlphaFold Protein Structure Database	by DeepMind and EMBL-EBI
MMseqs2	by Martin Steinegger
AlphaFold2 dismantling new book	by Yoshitaka Moriwak
AlphaFold2 IDR complex prediction	by Balint Meszaros
simultaneous folding and docking protocol FoldDock	by Patrick Bryant, Gabriele Pozzati and Arne Elofsson

HPC centers

HPC Centers	Clusters
Shanghai Jiao Tong University	AlphaFold2 on π2.0
NIH	AlphaFold2 on Biowulf
University of Lausanne	AlphaFold2 on DCSR
SBGrid	AlphaFold2 in Harvard Medical School
Deutsches Elektronen-Synchrotron DESY	AlphaFold2 on Maxwell
Tokyo Institute of Technologya	AlphaFold2 on TSUBAME3.0
University of Florida	AlphaFold2 on HiPerGator
University of Virginia	AlphaFold2 on Rivanna
RCCS Okazaki National Institute	AlphaFold2 on cclx
Czech National Grid Infrastructure	AlphaFold2 on MetaCentrum
Kyoto University	AlphaFold2 on SCL
Cornell University	AlphaFold2 on BioHPC Cloud
The University of Texas at Austin	AlphaFold2 on TACC
Georgia Advanced Computing Resource Centery	AlphaFold2 on Sapelo2
Northwestern University	AlphaFold2 on Quest

Video

《Alphafold2: 如何应用AI预测蛋白质三维结构》
by Bozitao Zhong 钟博子韬 2021-07-22 (68 min)
Slides download

《Alphafold2 原理、结构与未来》
by Bozitao Zhong 钟博子韬 2021-09-15 (90 min)
Slides download

GitHub Issues

Issue 5: Database disk type

(Augustin-Zidek) The genetic search tools are very IO intensive, hence having an SSD helps.

For more details see e.g. the HH Suite wiki that discusses HHBlits performance: https://github.com/soedinglab/hh-suite/wiki#running-hhblits-efficiently-on-a-computer-cluster

Issue 6: How long does it take on T1050 (779 residues)

(Augustin-Zidek) This is hard to answer without more context, especially without knowing the speed of your CPU and your hard drive (whether it is an SSD or an HDD).

But in general, you can expect the time to grow with the length of the protein and the MSA search taking up to a few hours with a slow disk / CPU.

For the actual folding (i.e. running the AlphaFold model), the disk speed doesn't matter anymore, what matters is whether you are using a GPU and its performance.

Issue 9: AlphaFold's speed

(tfgg) If you read the Nature paper, you'll see that AlphaFold 2 is more accurate, and the GPU times are in fact very fast: 0.6 minutes at 256 residues, 1.1 minutes at 384 residues, and 2.1 hours at 2,500 residues. These appear to be comparable to or faster than RoseTTAFold.

Issue 12: GPU required?

(tfgg) You can run without a GPU (with the --use_gpu=False flag) but it'll be much slower.

(tfgg) You can run outside Docker without GPU by setting CUDA_VISIBLE_DEVICES to be empty (0 will correspond to the first GPU). If your machine doesn't have a GPU, this won't be necessary and it will run on CPU.

(huhlim) For the timing using CPU, it took more than predictions using GPUs. For a ~140 residue protein, it took ~25 minutes x 5 models for model buildings + ~25 minutes for the input feature generation. I have two Intel Xeon Silver 4214 @ 2.2GHz (24 threads for each) on a node. I have no idea how many threads were used for the inference.

Issue 20: jaxlib version

(tfgg) We require version of 0.1.69 jaxlib to be able to use CUDA unified memory for running long sequences. If you don't need this you can probably run with 0.1.68, but that might be related to the illegal address error that you see.

Issue 53: GDT and lDDT Scores

(abridgland) There are a number of external tools available for computing these metrics. For GDT consider using LGA (http://proteinmodel.org/AS2TS/LGA/lga.html) and for lDDT consider the SWISS-MODEL server (https://swissmodel.expasy.org/lddt). These tools are reference implementations for their respective metrics. Please note that the lDDT scores will be computed on all atoms, not just C-alpha atoms. When we use the latter in our paper it is referred to as lDDT-Ca.

Issue 61: predicted TM-score (pTM)

(abridgland) The *_ptm models were fine-tuned from the non-pTM models (see section 1.9.7 in the supplementary information of our paper). This is why the outputs from these models do not match exactly.

We recommend running the non-pTM models for structure prediction because these were used in CASP14 and have been the most thoroughly validated. We think that the pTM models are very slightly worse than the regular models (around 0.5 GDT on CASP14). You can also run one of the pTM models separately in order to get predicted aligned error. This is the protocol we use in our Colab notebook (we choose model_2_ptm for predicted aligned errors).

Issue 30&66: Distribution over multiple GPUs

(abridgland) This code is not designed to make use of more than one GPU.

(tfgg) We don't parallelize the model itself in JAX over multiple GPUs, but we do enable unified memory which should (CUDA might be smarter) at least be able to use the host RAM as well.

Issue 74: Speed up prediction

(abridgland) See https://github.com/deepmind/alphafold#inferencing-many-proteins for advice on how to avoid re-compilations when inferencing many proteins if that is relevant to your use case.

In general having more GPU memory won’t help speed up model inference unless you’re finding the GPU memory is full (i.e. you’re overflowing to host RAM). However you may be able to improve speed a bit by using a larger subbatch size: alphafold/alphafold/model/config.py Line 325

'subbatch_size': 4,

which trades memory for speed. Picking the right value will depend on your inputs and hardware.

Finally, depending on how much you care about accuracy, you could simply run just 1 or 2 models rather than all 5. I hope that helps!

Issue 67: compiling the AlphaFold model

(abridgland) Model compilation is performed by the CPU. Only model inference is performed on the GPU.

Issue 92: visualization in Colab

(ValZapod) That is using molstar/molstar#236

Issue 150: standard amino acids in FASTA sequences

(AnyaP) AlphaFold expects an input query sequence with capitalized one-letter amino-acid types from this set:

restypes = [ 
              'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 
              'S', 'T', 'W', 'Y', 'V' 
          ]

Just to clarify, the input can't be a Multiple Sequence Alignment, so it shouldn't include '-' or lowercase characters. In case there are any unknown amino-acid types (represented by capital letters not from the set above), they can be converted to 'X' (see here), but this is unsupported on the Amber relaxation stage.

Issue 123: speed up the HHblits

(andzajan) Increasing memory for the job on HPC from 32 GB to 64 GB did speed up hhblits step from 12 hours to about 2 hours for 1100 residues sequence. Once it has been cached it's even faster.

Some other recommedation are here: https://github.com/soedinglab/hh-suite/wiki#running-hhblits-efficiently-on-a-computer-cluster

HHblits has to do a lot of I/O operations, so storing DB's on very fast SSD will help as well.

Issue 33&149: CUDA Unified Memory

(AnyaP) You could try increasing the value of XLA_PYTHON_CLIENT_MEM_FRACTION flag, which is set to 4.0 by default: alphafold/docker/run_docker.py Line 185

'XLA_PYTHON_CLIENT_MEM_FRACTION': '4.0',

Currently this allows allocating no more than 4*GPU_RAM in total, so you could be hitting this limit.

(tfgg) As far as I understanding the paging mechanism with unified memory will use the VRAM first, so there shouldn't be any performance hit when increasing the memory fraction..

(ashrafgt) Consider fiddling with these options to be able to run the prediction with limited resources:
https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html#gpu-memory-allocation

Issue 136: change the number of recycling

(tfgg) You can change the number of recycling iterations in AlphaFold by changing these two configuration options (to the same new number):

config.data.common.num_recycle
config.model.num_recycle

this could be done, for example, in this function.

However, I would caution that increasing the number of recycling iterations to beyond the number used at training time ("hypercycling" as I like to call it) isn't officially tested or validated to actually provide higher levels of accuracy on an appropriate test set. Anecdotally there appears to be some successes and that unofficial Colab notebook you mention does provide the ability to do so.

It might be the case that pLDDT is no longer an accurate estimate of model confidence when the model is hypercycled, but you can use your own judgement and knowledge of the target to decide when to trust the prediction. Good luck!

Issue 147: skip Amber relaxation

(tfgg) The easiest thing to do right now would be to comment out these lines:

https://github.com/deepmind/alphafold/blob/main/run_alphafold.py#L183-L193

and replace them with

relaxed_pdbs[model_name] = protein.to_pdb(unrelaxed_protein)

I haven't tested this but this should write out the unrelaxed proteins at the end, and skip the relax step.