Great summary! Apart from James and Mark's caveats about experimental structures being biased toward low-energy conformations, there's also the basic fact that every "experimental" structure is also a model and that even the PDB contains models with error - ill-defined or missing densities, extra (or unnecessary) water molecules, ligands with strain, etc. Also, out of the 200K PDB structures, only about 2oK have co-crystallized ligands. That's not a very small number, but I think it's a major reason why AF, while trained on the PDB and otherwise working well, works less well on predicting small molecule interactions which are still the bottleneck for structure-based design.
Very nice summary, thank you for the writeup!
A small nit on PropEn:
> Grouping datapoints is important because it expands the training set exponentially.
My understanding is that, in the best case, the dataset is expanded quadratically.
Great summary! Apart from James and Mark's caveats about experimental structures being biased toward low-energy conformations, there's also the basic fact that every "experimental" structure is also a model and that even the PDB contains models with error - ill-defined or missing densities, extra (or unnecessary) water molecules, ligands with strain, etc. Also, out of the 200K PDB structures, only about 2oK have co-crystallized ligands. That's not a very small number, but I think it's a major reason why AF, while trained on the PDB and otherwise working well, works less well on predicting small molecule interactions which are still the bottleneck for structure-based design.