Data quality and model parameterisation Martyn Winn CCP4, Daresbury Laboratory, U.K. Prague, April 2009

Data quality and model parameterisation

  • Upload

  • View

  • Download

Embed Size (px)


Data quality and model parameterisation. Martyn Winn CCP4, Daresbury Laboratory, U.K. Prague, April 2009. Model Parameters. E.g. asymmetric unit contains n copies of a protein of N atoms Coordinates 3 x N x n xyz co-ordinates - PowerPoint PPT Presentation

Citation preview

  • Data quality and model parameterisation

    Martyn WinnCCP4, Daresbury Laboratory, U.K.

    Prague, April 2009

  • Model ParametersE.g. asymmetric unit contains n copies of a protein of N atomsCoordinates3 x N x n xyz co-ordinatesor ... 6 x M x n if each protein modelled as M rigid bodiesor ... ~ 0.5 x N x n torsion anglesDisplacement parameters1 x N x n B factorsor ... 6 x N x n anisotropic U factorsor ... 20 x M x n if each protein has M TLS groups

  • Model Parameters (2)OccupanciesUsually fixed at 1.0 for protein... except for alternative conformations (usually sum to 1.0)Water/ligand occupanciesScaling parameters etc.koverall, Boverall, kBabinet, BBabinet, ksolvent, Bsolventtwin fractionUltra-high resolutionMultipolar expansion coefficientsInteratomic scatterers

  • Reflection DataNumber of independent reflections, dependent on:spacegroupresolutioncompletenessFor each reflection, one has at least F/sigF.Might also have reliable experimental phases or F(+)/F(-)

  • Data / parameter ratioRefinement means minimise -log(likelihood):Nonlinear function of model parameters.Global minimum and many local minima.Need good data/parameter ratio.

    Strong dependence on resolution.No strong dependence on protein size.

    Generally not enough data ....Reduce number of parameters - constraintsAdd data - restraints

  • RestraintsExpected geometry of the protein treated as additional databond lengthsbond anglestorsions / dihedral (but not ,)chirality (e.g. chiral volume)planaritynon-bonded (VdW, H-bonds, etc.)B factors (between bonded atoms)U factor restraints (similarity, sphericity, rigid bond)NCS (position or conformation)

  • Data / parameter ratioNot really true ... assumes all data independentbond lengths and angles and planar restraints in ring systembond length restraint vs. high resolution diffraction dataEstimate as: no. reflections + no. restraints no. parameters

    Restraints may be more necessary in poorly determined parts of the structure.Restraints have associated weights:Overall w.r.t. reflection data Individual weights e.g. WB

  • calmodulin at 1.8 (1clm) 1132 protein atoms, 4 Ca atoms, 71 waters4828 x, y, z, B factors

    No. of unique reflections 10610 (deposited 1993 no test set!) data/parameter = 2.2Bond restraints: 1144Angle restraints: 1536Torsion restraints: 429Chiral restraints: 170Planar restraints: 874Non-bonded restraints: 1391B factor restraints: 2680(no NCS)total restraints = 8224 data/parameter = 3.9

  • calmodulin at 1.0 (1exr) 1467 protein atoms (inc. alt. conf.), 5 Ca atoms, 178 waters4950 x, y, z+ 9900 anisotropic U factors+ 316 occupancy parameters total parameter count = 15166

    No. of unique reflections 77150No. in test set 7782 (10%)Data for refinement 69368

    No. of restraints (PDB header) 22732

    data/parameter = 4.6 data/parameter = 6.1

  • GCPII at 1.75 (3d7g)5724 protein atoms (inc. alt. conf.), 211 ligand atoms, 617 waters26046 x, y, z, B factors + 162 anisotropic U factors (S, Zn, Ca, Cl only)+ 225 occupancy parameters total parameter count = 26433

    No. of unique reflections 105077No. in test set 1550 (1.5%)Data for refinement 103527

    No. of restraints (PDB header) 44652

    data/parameter = 3.9 data/parameter = 5.6

  • Thioredoxin reductase at 3.0 (1h6v)22514 protein atoms, 552 ligand atoms, 9 waters92300 x, y, z, residual B factors 6 TLS groups120 TLS parameters

    No. of unique reflections 69328No. in test set 3441 (5%)Data for refinement 65887

    No. of restraints 209378(inc. 44484 NCS restraints) data/parameter = 0.7 data/parameter = 3.0

  • Getting a good R-factorThe old way:Refine parameters so that Fcalc (from model) agrees with Fobs for all reflectionsCalculate: R = |Fobs| - s | Fcalc | / |Fobs| (Note: precise value may depend on scaling used)Add parameters until R is sufficiently low

    Whats wrong with that ? ?

  • Avoiding overfitting: RfreeWhat's wrong?:Can add any old parameters to improve R-factor, when low data/parameter ratioMay not be physically correct "overfitting"

    Solution:Calculate R-factor on a set of reflections not used in refinement = "Rfree"If changes to model improve Rfree as well as R, then they are good.Note: Rfree is global number - useful for refinement strategies, not useful for assessing changes to a few atoms

  • Choosing your free reflectionsUsually a randomly chosen subset.Typically 5-10% (CCP4 default is 5%)If you have enough reflections, impose maximum number (2000 in phenix.refine)Free set also used in maximum likelihood to estimate A parameters

  • Rfree and NCSNCS operators map different regions of reciprocal asymmetric unit onto each other. Reflections in these regions are correlated. gaps = free setworking reflectionsfree reflections

  • Rfree and NCSSolution: choose free set from thin shells in reciprocal spacePros:NCS operators link regions of same resolution which should be both in a shell or outside itCons:Large number of shells thin shells most free reflections close to edge and correlated to non-free reflectionsSmall number of shells significant gaps in resolution range, poor determination of ASFTOOLS: RFREE 0.05 SHELL 0.0013rd argument = width of shells in -1Also DATAMAN.

  • Width 0.013 shellsWidth 0.001320 shells(default)1xmp (1.8 )Width 0.0053 shellsWidth 0.000520 shells(default)XXX (3.8 )

  • Can increase size of free set to mitigate edge effectsOr use NCS-related free set islands

    Reflections also correlated to immediate neighbours in reciprocal space - can exclude these from working and free setsFabiola, Korostelev & Chapman, Acta Cryst D62, 227, (2006)Rapidly run out of working reflections!

    Be aware that correlations can artificially reduce your RfreeRfree and NCS

  • Rfree and twinningTwinning operator might relate e.g. reflection (1,2,3) to (2,1,-3)

    These two reflections should both be in the working set or the free set.

    Select free set in thin shells (as NCS) Select free reflections in higher lattice symmetry

  • Transferring free R setsUse the same free set for:additional datasets for same proteindatasets from isomorphous proteins (derivatives, complexes, etc.)(how isomorphous is not clear, but play safe ...)Otherwise initial R & Rfree will be similar and low for second structure - it has been refined against most of your free reflectionsFurther refinement may lead to divergence of R & Rfree, masking the bias. Harder to detect over-fitting. Although may eventually reset Rfree.

    How:Use "CAD" / "Merge MTZ files (CAD)" in CCP4.

  • Useful resourceshttp://ccp4wiki.org/ - CCP4 Wikihttp://strucbio.biologie.uni-konstanz.de/ccp4wiki/ - CCP4 community wikiProceedings of Study Weekend 2004 (Acta Cryst D, Dec 2004)

    *Data is only 71% complete.*SHELXL-97additional restraints probably for aniso Us.

    *Refmac5Paper describes mixed isotropic/anisotropic model, but not clear why or if it helped!*Also NCS-related islands*Also NCS-related islands*(2,1,-3) -> (-2,-1,3) -> (2,1,3)Someone mentioned phenix.xtriage but not sure if it does freeR sets - just to find twin law I thinkphenix.refine seems to have something about the symmetry of free R set*