PDB-IHM

System for Archiving Integrative Structures

User guide

Understanding the PDB-IHM Validation Report
1. Overview
- 1.1. Summary
- 1.2. Overall Quality Assessment
2. Model Details
3. Data Quality Assessment
4. Model Quality Assessment
5. Fit to Data Used for Modeling Assessment
6. Fit to Data Used for Validation Assessment
Understanding the PDB-IHM Summary Table
References

Understanding the PDB-IHM Validation Report

This validation report was created based on the guidelines and recommendations from IHM TaskForce (Berman et al. 2019). The current version of the PDB-IHM validation report consists of following categories:

1. Overview: This section provides a succinct "executive" summary of the entry's content and key quality indicators. If there should be serious issues with a structure, this would usually be evident from this summary.

2. Model Details: This section outlines model details and includes information on ensembles deposited, chains and residues of domains, model representation, software, protocol, and methods used. All deposited structures have this section.

3. Data Quality Assessment: Data quality assessments are available for Small Angle Scattering datasets (SAS), Chemical Crosslinking Mass Spectrometry (crosslinking-MS), and 3D Electron Microscopy (3DEM) data and is based on the guidelines published by the wwPDB SAS validation task force (Trewhella et al., 2017), crosslinking-MS (Leitner et al., 2020), and 3DEM (Kleywegt et al., 2024) communities.

4. Model Quality Assessment: Model quality for models at atomic resolution is assessed using MolProbity (Williams et al., 2018), consistent with wwPDB. Model quality for coarse-grained or multi-resolution structures are assessed by computing excluded volume satisfaction based on reported distances and sizes of beads in the structures. Model precision, defined as the variability among the models that satisfy the input data and calculated as the density-weighted root mean-square fluctuation (RMSF) from the bead/atom center of density, is annotated and visualized using PrISM (Ullanat et al., 2022). Only coarse-grained beads (or CA atoms for atomic models) of deposited models are used for precision assessment and visualization.

5. Fit to Data Used for Modeling Assessment: Fit to data used to build the model is available for SAS, crosslinking-MS, and 3DEM datasets. This section was developed in collaboration with the SAS, crosslinking-MS, and 3DEM communities and SASBDB (Valentini et al., 2015), PRIDE (Perez-Riverol et al., 2025), and EMDB (Turner et al., 2024) resources. For details on the metrics, guidelines, and recommendations used, refer to the community guidelines (Trewhella et al., 2017, Leitner et al., 2020, Kleywegt et al., 2024). All experimental datasets used to build the model are listed, however, validation criteria for other types of experimental data are currently under development.

6. Fit to Data Used for Validation Assessment: Fit to data not used during the modeling. This category is under development.

1. Overview

1.1. Summary: Summary of the structure, including number of models deposited, datasets used to build the models and information on model representation.

1.2. Overall Quality Assessment: This is a set of plots that represent a snapshot view of the validation results. There are four tabs, one for each validation criterion: (i) model quality, (ii) data quality, (iii) fit to data used for modeling, and (iv) fit to data used for validation.

1.2.1. Model quality: For atomic structures, MolProbity is used for evaluation. We evaluate bond outliers, side chain outliers, clash score, rotamer satisfaction, and Ramachandran dihedral satisfaction (Williams et al. 2018) . Details on MolProbity evaluation and tables can be found here. For coarse-grained structures of beads, we evaluate excluded volume satisfaction. An excluded volume violation or overlap between two beads occurs if the distance between the two beads is less than the sum of their radii (S. J. Kim et al. 2018). Excluded volume satisfaction is the percentage of pair distances in a structure that are not violated (higher values are better).
1.2.2. Data quality: Data quality assessments are available for SAS and crosslinking-MS datasets. The current plot displays radius of gyration (R_g) for each SAS dataset used to build the model. R_g is obtained from both a P(r) analysis (see more here), and a Guinier analysis (see more here). For the crosslinking-MS datasets we assess consistency between experimental data and data used for modeling (see more here).
1.2.3. Fit to data used for modeling: Fit to data used for modeling assessments are available for SAS and crosslinking-MS datasets. The plot displays Χ² Goodness of Fit Assessment for SAS-model fits (see more here) and percentage of satisfied crosslinking-MS restraints (see more here).
1.2.4. Fit to data used for validation: Fit to data used for validation is currently under development.

2. Model Details

2.1. Ensemble Information: Number of ensembles deposited, where each ensemble consists of two or more structures.

2.2. Representation: Number and details on rigid and flexible elements of the structure.

2.3. Datasets Used: Number and type of experimental datasets used to build the model.

2.4. Methodology and Software: Methods, protocols, and softwares used to build the integrative structure.

3. Data Quality

3.1. SAS

3.1.1 Scattering Profiles: Scattering data from solutions of biological macromolecules are presented as both log I(q) vs. q and log I(q) vs. log (q). The I(q) is the scattering intensity (preferably on an absolute scale in cm-1, but arbitrary units are accepted) and q is the modulus of the scattering vector (nm-1 or Å-1).

3.1.2 Experimental Estimates: Molecular weight (MW) and volume data are displayed. Theoretical MW can be compared to SAS-derived values using the forward scatter (I(0)) and the known concentration and partial specific volume of the scattering particle, or as estimated from the Porod volume and partial specific volume (Trewhella et al., 2017, Trewhella et al., 2023).

3.1.3. Flexibility analysis: In a Porod-Debye plot, a clear plateau is observed for globular (partial or fully folded) domains, whereas flexible-modular, fully unfolded domains or extended/stiff rod-shaped domains lack a discernible plateau (Rambo and Tainer 2013). A bell-shaped Kratky plot (q²I(q) vs. q) with a well-defined maximum is observed for compact/folded structures. For partially flexible/modular or extended structures the Kratky plot can show multiple maxima and/or an increase in intensity at higher q-values depending on the degree of flexibility and extension. Fully intrinsically disordered structures yield a Kratky plot that systematically increases with increasing q values and will be near linear for highly extended molecules. The dimensionless Kratky plot ((qR_g)²I(q) vs. qR_g) is useful for quantifying differences in shape and foldedness among scattering objects of different sizes (Trewhella et. al., 2023).

3.1.4. P(r) Analysis: The the atom-pair distance distribution function (PDDF) or P(r) represents the distribution of distances between all pairs of atoms within the particle weighted by the respective scattering contrasts (Moore, 1980). The second moment of P(r) yields the radius of gyration (Rg), which is a measure of the overall size and shape of a macromolecule (i.e. the spatial distribution of volume elements). A protein with a smaller R_g is more compact than a protein with a larger R_g, provided both have the same molecular weight.

3.1.5. Guinier Analysis: The linearity of the Guinier plot (ln(q) vs. q²) at very-low angle (qRg < 1.3) is a sensitive indicator of the quality of the sample in relation to its homogeneity; a linear Guinier plot is a necessary but not sufficient demonstration that a solution contains monodisperse particles of the same size. Deviations from linearity can point to strong interference effects from particle attraction or repulsion, polydispersity of the samples, or improper background subtraction (Feigin et al., 2013). Residual difference plots and Pearson correlation coefficient determination (R²) are measures to assess quality of the linear fit to the Guinier region. A perfect fit has an R² value of 1. Residual values should be equally and randomly spaced around the horizontal axis with no evident systematic upward or downward curvature. Agreement between the P(r) and Guinier-determined R_g is a good measure of the self-consistency of the SAS profile.

3.2. Crosslinking-MS: At present, data validation is only available for crosslinking-MS data deposited as a fully compliant dataset in PRIDE database. Data completeness shows how many experimentally-detected crosslinks for given entities were actually used for modeling. We compare entities in the MS search database with those reported in the mmCIF file using pyHMMER and match corresponding crosslinks. The values are reported as percentages of crosslinks present in the data and have to be interpreted in the context of the experiment (i.e., only a minor fraction of an in-situ or in-vivo dataset can be used for modeling). of the experiment (i.e. only a minor fraction of in situ or in vivo dataset can be used for modeling).

3.3. 3DEM: PDB-IHM validation pipeline for 3DEM data reuses elements of the wwPDB EM validation pipeline. Detailed descriptions of data quality metrics and visualisations are available on the wwPDB EM map validation help page.

3.4. Other datasets: Validation for other types of input data is currently under development.

4. Model Quality Assessment

Excluded volume assessments are performed for coarse-grained structures and MolProbity analysis is performed for atomic structures.

4.1a. Excluded Volume Analysis: Excluded volume violation is defined as percentage of overlaps between coarse-grained beads in a structure. This percentage is obtained by dividing the number of overlaps/violations by the total number of pair distances in a structure. An overlap or violation between two beads occurs if the distance between the two beads is less than the sum of their radii (S. J. Kim et al. 2018).

4.1b. MolProbity Analysis: MolProbity analysis for atomic structures reported is consistent with PDB standards for X-ray structures (Williams et al. 2018). Summarized information is available in both the HTML and PDF reports. Detailed descriptions are available on the wwPDB X-ray validation help page, in section 5. Model quality.

4.2. PrISM Precision Analysis: Regions of low and high precision, defined as the variability among the models that satisfy the input data and calculated as the density-weighted root mean-square fluctuation (RMSF) from the bead/atom center of density, are annotated and visualized using PrISM (Ullanat et al. 2022). The per-bead or per-residue precision is computed from the deposited ensemble of superposed integrative models. High- and low-precision regions are then determined by clustering beads of similar precision based on their proximity in the structure. Only coarse-grained beads (or CA atoms for atomic models) of deposited models are used for precision assessment and visualization, and three projections for each representative model are generated.

5. Fit to Data Used for Modeling Assessment

5.1. SAS

Recommendations from SAS validation task force (SASvtf) for model fit assessment include:

All software, including version numbers, used for modelling; three-dimensional shape, bead or atomistic modelling.

All modelling assumptions clearly stated, including adjustable parameter values. In the case of imposed symmetry, especially in the case of shape models, comparison with results obtained in the absence of symmetry restraints.

For atomistic modelling, a description of how the starting models were obtained (e.g. crystal or NMR structure of a domain, homology model etc.), connectivity or distance restraints used and flexible regions specified and the basis for their selection.

Any additional experimental or bioinformatics-based evidence supporting modelling assumptions and therefore enabling modelling restraints or independent model validation.

For three-dimensional models, values for adjustable parameters, constant adjustments to intensity, χ² and associated p-values and a clear representation of the model fit to the experimental I(q) versus q including a residual plot that clearly identifies systematic deviations.

Analysis of the ambiguity and precision of models, e.g. based on cluster analysis of results from multiple independent optimizations of the model against the SAS profile or profiles, with examples of any distinct clusters in addition to any final averaged model.

5.1.1. Model versus Experimental Scattering Profiles: Model fits displayed in this section are obtained from SASBDB. χ² values are a measure of fit of the model to data. A perfect fit has a χ² value of 1.0. (Trewhella et al. 2013, Schneidman-Duhovny, Kim, and Sali 2012, and Rambo and Tainer 2013).

5.1.2. Χ² Goodness of Fit Assessment: χ² values are a measure of the overall fit of the model to the 1D scattering profile. A model that fits the data within its error estimates will have a χ² value close to one, provided that the dominant errors are the random statistical errors (i.e. no systematic errors) from the SAS measurement that are correctly propagated (Trewhella et al. 2013, Schneidman-Duhovny, Kim, and Sali 2012, and Rambo and Tainer 2013).

5.1.3. CorMap Test: Correlation Map (CorMap) test (Franke et al., 2015) is a variance-covariance analysis on the scattering intensities comparing two (or more) scattering profiles (e.g. model versus experiment or multiple measures from the same sample). The CorMap test complements χ² and importantly is independent of the reported errors. The method assigns a probability (P-value based on a 1-tailed Schilling test) for finding the longest string of experimental data points that lie systematically above (+1) or below (-1) the model profile. The P-value lies between 0 – 1 and a significance threshold is chosen below which the model fit is judged to show systematic deviation from experiment. A typical range statisticians use to indicate significant deviation is 0.01 - 0.05. As implemented in the ATSAS (Manalastas-Cantos et al. 2021) suite, the reported CorMap P-value is green (model fit is good) for P > 0.05, yellow for 0.01 < P < 0.05, and red (model deviates significantly) for P < 0.01.

5.2. Crosslinking-MS

5.2.1. Restraint types: This table summarizes information about crosslinker(s) used for data generation, and how crosslinking information was translated into actual modeling restraints. Restraints assigned "by-residue" are interpreted as between CA atoms. Restraints between coarse-grained beads are indicated as "coarse-grained". Restraint group represents a set of crosslinking restraints applied collectively in the modeling. Restraints with identical thresholds are grouped into one plot. Only the best distance per restraint per model group/ensemble is plotted. Inter- and intramolecular (including self-links) restraints are also grouped into one plot. Distance for a restraint between coarse-grained beads is calculated as a minimal distance between shells; if beads intersect, the distance will be reported as 0.0. A bead with the highest available resolution for a given residue is used for the assessment. Distograms (i.e., histogram plots of distances) provide an overview of distributions of distances between residues for which chemical crosslinks were identified. The shift of the distogram relative to the threshold value may indicate a poor model.

5.2.2. Satisfaction rates: Satisfaction of restraints is calculated on a restraint group (a set of crosslinking restraints applied collectively in the modeling) level. Satisfaction of a restraint group depends on satisfaction of individual restraints in the group and the conditionality (all/any). A restraint group is considered satisfied, if the condition was met in at least one model of the model group/ensemble. Only deposited models are used for validation right now.

5.3. 3DEM: PDB-IHM validation pipeline for 3DEM data reuses elements of the wwPDB EM validation pipeline. Detailed descriptions of model to map fit metrics and visualisations are available on the wwPDB EM validation help page, in section 9. Map-Model fit.

5.4. Other datasets: Validation for other types of input data is currently under development.

6. Fit to Data Used for Validation Assessment

This includes assessing model fit to data that was not used explicitly or implicitly in modeling. This section is currently under development.

Understanding the Summary Table

1. Model composition: Summary description of the entry.

1.1. Entry composition: List of unique molecules that are present in the entry.
1.2. Datasets used for modeling: List of input experimental datasets used for modeling.

2. Representation: Representation of modeled structure.

2.1. Number of represenations: Total number of represenations used in the entry.
2.2. Scale: Types and sizes of geometric objects comprising structural models.
2.3. Number of rigid and flexible segments: A rigid segment consists of multiple coarse-grained (CG) beads or atomic residues. In a rigid segment, the beads (or residues) have their relative distances constrained during conformational sampling. Flexible segments consist of strings of beads that are restrained by the sequence connectivity.

3. Restraints: A set of restraints used to compute modeled structure.

3.1. Physical restraints: A list of restraints derived from physical principles to compute modeled structure.
3.2. Experimental information: A list of restraints derived from experimental datasets to compute modeled structure.

4. Validation: Assessment of models based on validation criteria set by IHM task force (Sali et al. 2015 and Berman et al. 2019).

4.1. Sampling validation: Validation metrics used to assess sampling convergence for stochastic sampling. Sampling precision is defined as the largest allowed Root-mean-square deviation (RMSD) between the cluster centroid and a model within any cluster in the finest clustering for which each sample contributes structures proportionally to its size (considering both the significance and magnitude of the difference) and for which a sufficient proportion of all structures occur in sufficiently large clusters (Viswanath et al. 2017).
4.2. Number of ensembles: Number of solutions or ensembles of modeled structure.
4.3. Number of models in ensemble(s): Number of structures in the solution ensemble(s).
4.4. Number of deposited models: Total number of models in the entry.
4.5. Model precision: Measurement of variation among the models in the ensemble upon a global least-squares superposition. Provided by the depositor.
4.6. Data quality: Assessment of data on which modeled structures are based. See section 3. Data quality of the full validation report for more details
4.7. Model quality: Assessment of modeled structures based on physical principles. See section 4. Model Quality Assessment of the full validation report for more details.
4.8. Fit to data used for modeling: Assessment of modeled structure based on data used for modeling. See section 5. Fit to Data Used for Modeling Assessment of the full validation report for more details
4.9. Fit to data used for validation: Assessment of modeled structure based on data not used for modeling. See section 6. Fit to Data Used for Validation Assessment of the full validation report for more details.

5. Methodology and Software: List of methods on which modeled structures are based and software used to obtain structures.

5.1. Method name: Name(s) of the modeling step(s).
5.2. Method type: Name(s) of method(s) used to generate modeled structures.
5.3. Method description: Details of method(s) used to generate modeled structures.
5.4. Number of computed models: Number of models computed at each modeling step.
5.5. Software: Software used to compute modeled structure, also includes scripts used to generate and analyze models.

References

Berman, Helen M., Paul D. Adams, Alexandre A. Bonvin, Stephen K. Burley, Bridget Carragher, Wah Chiu, Frank DiMaio, et al. 2019. "Federating Structural Models and Data: Outcomes from A Workshop on Archiving Integrative Structures." Structure 27 (12): 1745–59.

Manalastas-Cantos, Karen, Petr V. Konarev, Nelly R. Hajizadeh, Alexey G. Kikhney, Maxim V. Petoukhov, Dmitry S. Molodenskiy, Alejandro Panjkovich, et al. 2021. "ATSAS 3.0: Expanded Functionality and New Tools for Small-Angle Scattering Data Analysis." Journal of Applied Crystallography 54 (Pt 1): 343–55.

Rambo, Robert P., and John A. Tainer. 2011. "Characterizing Flexible and Intrinsically Unstructured Biological Macromolecules by SAS Using the Porod-Debye Law." Biopolymers 95 (8): 559–71.

Sali, Andrej, Helen M. Berman, Torsten Schwede, Jill Trewhella, Gerard Kleywegt, Stephen K. Burley, John Markley, et al. 2015. "Outcome of the First wwPDB Hybrid/Integrative Methods Task Force Workshop." Structure 23 (7): 1156–67.

Trewhella, Jill, Anthony P. Duff, Dominique Durand, Frank Gabel, J. Mitchell Guss, Wayne A. Hendrickson, Greg L. Hura, et al. 2017. "2017 Publication Guidelines for Structural Modelling of Small-Angle Scattering Data from Biomolecules in Solution: An Update." Acta Crystallographica. Section D, Structural Biology 73 (Pt 9): 710–28

Valentini, Erica, Alexey G. Kikhney, Gianpietro Previtali, Cy M. Jeffries, and Dmitri I. Svergun. 2015. "SASBDB, a Repository for Biological Small-Angle Scattering Data." Nucleic Acids Research 43 (Database issue): D357–63.

Perez-Riverol, Yasset et al. 2025. "The PRIDE database at 20 years: 2025 update." Nucleic Acids Research 53 (Database issue): D543-D553.

Viswanath, Shruthi, Ilan E. Chemmama, Peter Cimermancic, and Andrej Sali. 2017. "Assessing Exhaustiveness of Stochastic Sampling for Integrative Modeling of Macromolecular Structures." Biophysical Journal 113 (11): 2344–53.

Williams, Christopher J., Jeffrey J. Headd, Nigel W. Moriarty, Michael G. Prisant, Lizbeth L. Videau, Lindsay N. Deis, Vishal Verma, et al. 2018. "MolProbity: More and Better Reference Data for Improved All-Atom Structure Validation." Protein Science: A Publication of the Protein Society 27 (1): 293–315.

Leitner, Alexander, Alexandre Bonvin, et al. 2020. "Toward Increased Reliability, Transparency, and Accessibility in Cross-linking Mass Spectrometry." Structure 28 (11): 1259-1268.

Ullanat, Varun, Nikhil Kasukurthi, Shruthi Viswanath. 2022. "PrISM: precision for integrative structural models." Bioinformatics 38 (15): 3837-3839.

Kleywegt, Gerard J., Paul D. Adams, Sarah J. Butcher, Catherine L. Lawson, et al. 2024. "Community recommendations on cryoEM data archiving and validation." IUCrJ, 11, 140-151.

IHMValidation Version 3.1