A guide to benchmarking enzymatically catalysed reactions: the importance of accurate reference energies and the chemical environment

We explore two significant factors on the outcomes of benchmark studies for enzymatically catalyzed reactions, namely the level of theory of the benchmarks and the size of the model system used to represent the enzyme active site. For the benchmarks, we compare two potential alternatives to canonical coupled cluster results for situations where CCSD(T) is computationally too demanding: a strategy to estimate finite basis set coupled cluster values and the local-correlation DLPNO-CCSD(T) method at the complete basis set limit. We confirm the high accuracy of DLPNO-CCSD(T) used with tight thresholds. We also show that notable differences can be seen when using both sets of references for a benchmark study, with absolute deviations from the higher-quality references generally smaller than those from lower-quality ones as well as changes in the ranking of the assessed methods. For geometries, we test three models for the active site of 4-oxalocrotonate tautomerase: one typical of the QM region that may be used in QM/MM studies, and two smaller variants that neglect the surrounding chemical environment. Benchmarking of 12 density functionals known to perform well on enzymatically catalyzed reactions shows inconsistent performance of each method across the three models, contradicting the common idea that small representative systems can be used to accurately assess the applicability of low-level methods for larger biochemical applications. Our findings shall serve as a reminder on the standards that should be adhered to in benchmark studies, and as a guide for future studies, both on enzyme-related and other chemical problems.


Introduction
Enzymes have long been of keen interest to (bio)chemists due to their remarkable catalytic properties-not only are they especially efficient, they are also highly selective and allow for reaction conditions that are milder than those required with many inorganic catalysts. The treatment of enzymatically catalyzed reactions on the molecular level through computational techniques enables researchers to explore possible mechanisms and the particular structural factors that affect an enzyme's efficiency. This information can then be used to explain experimental data, as well as modify or design new enzymes for specific applications, such as treatment of pollutants and drug discovery [1][2][3][4][5][6][7]. While theoretically designed enzymes have yet to perform with the same efficiency as naturally occurring ones [8], they can often be improved with the highly successful directed evolution method [9][10][11][12], the importance of which can be seen in the 2018 Nobel Prize in Chemistry which was awarded for 50% to Prof. Frances Arnold. Complementarily, directed evolution works best from a good starting point, so synthetic enzyme design also benefits from the insights of computational studies.
Quantum-chemical methods grow ever more useful for the analysis of enzymatically catalyzed reactions due to improvements in computer hardware and software that allow for the treatment of larger systems. One approach involves "cluster" models [13] which treat the active site and surrounding enzyme environment inside a dielectric cavity to Topical Collection dedicated to the contributions of Young Investigators. account for polarization effects, but the models must often be large to include enough of the chemically important environment. It can thus be more efficient to use a hybrid quantum mechanics/molecular mechanics (QM/MM) approach [14][15][16][17][18], which involves treating the active site with a higher (QM) level of theory, and using faster (MM) methods to calculate the indirect contributions of the rest of the enzyme. As this provides a more specific definition of the surrounding environment, the "QM region" that contains the active site can be smaller. Density functional approximations (DFAs) are a common choice of QM method, but semi-empirical or ab initio methods are also possible. A reliable QM method is necessary to ensure good results, but accuracy must often be balanced with computational cost, and thus benchmark studies focussing on enzyme active-site models can be beneficial as they enable computational (bio)chemists to make informed choices of appropriate QM methods.
Benchmarking is ubiquitous in computational chemistry, as it benefits both the method developer who seeks to make accurate DFAs and the general user who wishes to use them. The process involves selecting a test set, calculating the relevant energies or properties at a highly accurate level of theory, and then comparing lower level methods against the benchmarks to assess their performance. The success of a study is, thus, highly dependent on the quality of the benchmarks, and poor reference values will alter the perceived performance of the tested methods, which can significantly change the conclusions of the study [19,20]. The choice of test set is also important, as studies are most useful when they are guided by the result one aims to achieve. When the goal is to find robust and widely applicable methods, one should use broad test sets like GMTKN55 [20][21][22][23], MGCDB84 [24] and Database 2015B [25], but when choosing a method for a specific application, a more targeted test set of relevant reactions often gives better guidance.
One such specific set for biochemically relevant reactions is our set of barrier heights (BHs) and reaction energies (REs) for enzymatically catalyzed reactions [26,27], which is an updated version of an earlier set published by Kromann et al. [28], and contains active-site models of varying sizes for five specific enzymatically catalyzed reactions. Our work on this set focused on the importance of London dispersion corrections in geometry optimisations to ensure that the intermolecular enzyme-substrate interactions are represented correctly, and then subsequently finding appropriate benchmark levels of theory for systems of intermediate size using DLPNO-CCSD(T)-domain based local pair natural orbital coupled cluster with singles, doubles and perturbative triples [29,30]. This method is a common recommendation for a reliable second choice when canonical CCSD(T) [31], often referred to as the "gold standard" of chemical accuracy, is not achievable due to computational constraints. It reduces the cost significantly by prescreening the pair correlations and only including those above a certain threshold at each step [32], thereby decreasing the number of pairs treated at the coupled cluster level while retaining most of the conventional approach's accuracy [32,33]. Therefore, we thoroughly tested the different default sets of "PNO thresholds", as well as the choice of basis sets used for the extrapolation to the complete basis set (CBS) limit, to find an appropriate balance between cost and accuracy and suggested different strategies for obtaining reliable benchmarks based on system size [26].
Other recent studies [34,35] have used smaller models that included considerably smaller portions of the specific enzyme environment, which allowed for the use of canonical CCSD(T). Paiva et al. [35] have used a set of four minimalistic models that represented biochemical reactions to benchmark DLPNO-CCSD(T) alongside a range of DFAs, and showed that the error of DLPNO-CCSD(T)/CBSextrapolated from double-and triple-atomic-orbital (AO) basis sets-was 0.56 kcal/mol with the NormalPNO thresholds and 0.40 kcal/mol with the TightPNO thresholds, significantly larger than the reported <0.25 kcal/mol general error of DLPNO-CCSD(T) for reaction energies when used with the TightPNO thresholds in ref [32]. The benchmarks used were estimated CCSD(T)/CBS values, i.e., generally second-order Møller-Plesset perturbation theory (MP2/CBS) extrapolated from triple-and quadruple-basis sets and then corrected with the difference in correlation energy between CCSD(T) and MP2 at the triple-level or lower. While various extrapolation strategies and basis sets were tested for these estimated benchmarks to find an appropriate balance of accuracy and efficiency [36][37][38][39], the error of this type of estimation scheme with an aug-cc-pVDZ [40] correction term is around 0.6 kcal/mol [36,41,42], already larger than the observed deviations of DLPNO-CCSD(T)/CBS in their study. We have also shown in our own tests of basis sets that not all extrapolations are equal, and we have recommended against DLPNO-CCSD(T)/CBS values extrapolated from double-/triple-basis set pairs as they are often significantly different from values extrapolated from triple-/quadruplebasis sets [26]. It is likely then that significantly lower errors will be seen when comparing DLPNO-CCSD(T)/CBS(TZ/ QZ) to true CCSD(T)/CBS results, which is something that we will demonstrate herein.
Another recently published, small test set is that of Sirirak et al. [34], which takes the approach of modeling steps in enzyme reaction mechanisms through reactions of small molecules that represent common functional groups, with all reactant and product molecules being treated individually. Again, an estimated CCSD(T) benchmark was used by taking CCSD(T)/aug-cc-pVDZ numbers and correcting them with the difference between aug-cc-pVTZ [40] and aug-cc-pVDZ for spin-component-scaled MP2 (SCS-MP2) [43]. The use of a single, finite basis set result is problematic for coupled cluster methods due to the slow convergence of the correlation energy [44], and therefore energies should be extrapolated to the CBS limit to get reliable results. Thus, we question the use of estimated finite basis set results as benchmarks as the process can be unreliable, and any error in the benchmarks will also affect the calculated errors of any methods tested against them-previous studies on pericyclic reactions and inorganic reactions have already shown how the quality of the reference values influences the outcome of Density Functional Theory (DFT) benchmarks [19,20]. While quadruple-level CCSD(T) results may not be achievable even with smaller models, estimation is unlikely to be the best second-choice method for calculating reference values.
Based on these two studies and in light of the previous examples on how to conduct better benchmark studies, we pursue two main aims with this work. The first is to explore how various coupled cluster-based levels of theory compare to each other, and then how the choice of benchmark affects the observed performance of a range of QM methods. For this purpose, we use the set of Sirirak et al. [34], which consists of 20 REs associated with reactions designed to be characteristic of steps in enzyme reaction mechanisms. The first 13 reactions in the set are proton transfer reactions, and the remaining seven are non-proton transfer reactions. All 20 reactions are shown in Fig. 1, with the enzyme used to catalyze each reaction listed alongside its scheme.
The second factor we aim to explore is how the choice of active-site model structures affects benchmarking. We have previously explored how the geometry optimisation level of theory, particularly the inclusion of London dispersion corrections, changes the calculated BHs and REs of a set of models for five enzymatically catalyzed reactions [26], and although it was not a specific focus of the work, the effect of increasing the model size could also be indirectly gauged from the reactions which had multiple models of different sizes. Herein, we consider how decreasing the size of the model changes the results, as many studies choose to use smaller models to test methods which will then be applied to larger ones.
There are multiple possibilities for how an enzyme's active site can be modelled-the enzyme RE test set we use herein uses small molecules that represent general functional groups of substrates and enzymes, while other studies have used models of the active site that contain the substrate and some parts of the surrounding enzyme environment like a cluster model or QM region [13,26,28,[45][46][47]. Smaller structures have the benefit of being less specific to a particular enzyme and computationally less demanding, but they can miss a large portion of potentially important non-covalent interactions and geometric factors when neighboring amino-acid residues are not taken into account. The limitations of small models are most notable when the enzyme and substrate components are optimized and calculated separately. When benchmarking toward the goal of choosing a QM method for enzyme-related QM/MM studies, one seeks to find a method that will appropriately treat the enzymesubstrate interactions in the QM region, and so exclusion of these interactions brings into question whether conclusions drawn from testing smaller models apply to larger ones.
To explore the extent to which smaller, simplified models can impact the recommendations of a benchmark study, we use three models of the reaction catalyzed by 4-oxalocrotonate tautomerase (4-OT), shown in figure 2. This enzyme converts unconjugated -keto acids to their conjugated tautomers through a two-step mechanism, which involves a proton transfer from the substrate to the N-terminal catalytic proline residue (Pro1) in the first step, and another proton transfer from Pro1 back to the substrate in the second step. This reaction has been studied extensively with QM/MM methods [48][49][50][51], and is a clear example of the importance of the chemical environment, as the negatively charged substrate is strongly stabilized through hydrogen bonding and electrostatic effects, particularly in the chargeseparated intermediate. The smallest recommended model that adequately captures this stabilisation consists of the substrate, Pro1 residue, three arginine residues and two water molecules (for further details see the description of "Model A" in ref [51]). While a smaller QM region is inappropriate for mechanistic studies of 4-OT and the results will not be consistent with experimental data of the whole enzyme, one can still question the extent to which a reduced model can impact the performance of methods in a benchmark study, where the calculated reference data will also be influenced by the deficiencies of the model.
When exploring both of these factors, we use a select set of DFAs to conduct multiple small benchmark studiesagainst two sets of references for the RE test set, and with three different models for the 4-OT reaction-in order to demonstrate how the perceived performance of the DFAs depends on the references and active-site model. We are confident this study serves as a guide for those seeking to benchmark model systems of enzymatically catalyzed reactions for QM/MM applications, in their choices of both reference values and model systems. In fact, many findings of the present study can also be transferred to benchmarking of other scenarios and our guide can be seen as a standard that could be adopted in such studies.
In the following, we list the relevant computational details before we proceed with updating the reference values for the enzyme RE test set. We then use these and the old reference values to conduct an example benchmark study. We also briefly discuss the importance of using dispersion corrections with DFAs, comparing dispersion-uncorrected functionals to their DFT-D3(0) and -D3(BJ) corrected variants [52,53]. We then turn to the model systems of the 4-OT active site, and again conduct an example benchmark study in order to assess the similarity of the results between the models.

Calculation of reference values
Herein, we follow the protocol for calculating reference values determined in our previous work on enzymatically catalyzed reactions. We therefore refrain from a detailed study of basis set effects and other factors that affect the results, as they have all been discussed in ref [26]. Benchmark REs for all systems were obtained at the DLPNO-CCSD(T)/CBS level of theory, using the predefined TightPNO thresholds [32]. For the small molecules in the RE test set, the values were extrapolated from the augmented correlation consistent Dunning AO basis sets aug-cc-pVTZ and aug-cc-pVQZ [40]. The minimally augmented Ahlrichs-type basis sets ma-def2-TZVPP and ma-def2-QZVPP [54] were used for the new 4-OT models to ensure consistency with the original model, for which reference data at this level had already been published [26]. The resolution of the identity approximation for Coulomb integrals and chain of spheres approximation for exchange integrals (RIJCOSX) [55] with the default "GridXS2" setting was also used with the Ahlrichs-type basis sets to speed up these calculations. Values were extrapolated to the CBS limit following the standard two-point extrapolation schemes for HF [56] and correlation [57]  Corr are the related HF (self consistent field, SCF) and correlation energies, respectively, and the ∞ indicates the CBS energies. and are basis set specific constants. For a triple-/quadruple-extrapolation with the Dunning-type aug-cc-pVnZ basis sets these are 5.46 and 3.05, respectively; for the equivalent extrapolation with the Ahlrichs-type ma-def2-nZVPP basis sets they are 7.88 and 2.97 [58].
We calculated CCSD(T) results for reactions 3, 5, 7, 9 and 13 from the RE test set shown in figure 1 using the augcc-pVTZ and aug-cc-pVQZ basis sets, and these results are also extrapolated using the above schemes. These particular reactions were chosen to compare the differences between various coupled-cluster approaches for calculating benchmarks as they are small enough to obtain quadruple-level CCSD(T) results, and they cover the whole range of elements included in the set. All coupled cluster calculations were conducted using ORCA version 4.2.1 [59,60].

Construction of new model systems for 4-OT
Our previous work on the 4-OT reaction [26] used an activesite model, originally created by Sevastik and Himo [51], which we reoptimised at the PBEh-3c [61] level of theory. The model contained the substrate, Pro1 residue and some surrounding environment involved in stabilising the charge. Herein, we have used this model, which we refer to as the original model, to create two further models: a minimal model, which contains only the substrate and Pro1 residue, and a separated model, with individual substrate and Pro1 structures. Geometries involved in these two new models were also optimised at the PBEh-3c level of theory using ORCA version 4.2.1, with the "tight" setting for SCF convergence and the default setting for geometry convergence. The "grid3" and "finalgrid5" settings were used for ORCA's multigrid option.

Tests of density functional approximations
All methods tested in our exemplary benchmark studies are listed in Table 1. We have chosen 12 DFAs from the best performers in the categories of Generalised Gradient Approximation (GGA), meta-GGA, hybrid and double-hybrid density functionals from our previous study on enzymatically catalyzed reactions as our main set of DFAs, and for the enzyme RE test set we also take the results of the benchmark study conducted by Sirirak et al. and calculate the deviations of each method from our new reference values. We additionally test BLYP and B3LYP when looking at DFT-D3-type London dispersion corrections. All calculations, except those using the SCAN functional, were run with ORCA version 4.2.1 [59,60] with the default settings for SCF and geometry convergence. TURBOMOLE version 7.4.1 [91][92][93][94] was used for the SCAN functional, with the recommended grid options "gridsize 4" and "radsize 40" [20,70,71] and a convergence criterion of 1 × 10 −7 E h . The RIJCOSX approximation (again with "GridXS2") was used with most functionals, and the frozencore approximation was used with the double-hybrid functionals. Van-der-Waals DFAs used the nonlocal VV10-kernel in its post-SCF implementation; it was shown in ref [22] that this does not impact the results but halves the computational effort compared to the originally developed, full-SCF version. All functionals were evaluated with the def2-QZVP basis set [87]. All deviations in our following discussions were calculated as the difference between an assessed method and the reference value.

On the quality of reference values
To provide a comparison between various CCSD(T)-based approaches, Table 2  Comparison of the estimated and actual CCSD(T)/augcc-pVTZ REs shows that estimation does not fully replicate the results of the latter. The estimated method gives differences under 0.2 kcal/mol for reactions 5, 9 and 13, but is slightly worse for reactions 3 and 7, with differences approaching 0.4 kcal/mol. The differences between tripleand extrapolated CCSD(T) results are larger, with differences of at least 0.6 kcal/mol seen for reactions 3, 5 and 13. As CCSD(T)/CBS results are our ideal benchmarks, the accuracy of the other approaches can be gauged by how closely they replicate the CCSD(T)/CBS values. The est. CCSD(T)/aug-cc-pVTZ approach shows the most noticeable absolute deviations, in the range of 0.53-0.96 kcal/mol. The difference between DLPNO-CCSD(T)/CBS and CCSD(T)/ CBS for these reactions is almost negligible, with all absolute differences being in the range of 0.01-0.06 kcal/mol. As predicted, this range is significantly lower than the 0.4 kcal/mol error of DLPNO-CCSD(T)/CBS (DZ/TZ) reported by Paiva et al. [35], and also lower than the error of the est. CCSD(T)/CBS references used in that study.
While none of the approaches tested here deviate by more than 1 kcal/mol-the generally accepted chemical accuracy limit for REs-for these five tested reactions, it is clear that the estimation strategy is not appropriate for calculating reference data for a benchmark study and that one should not rely on TZ data as a benchmark. We also see that the DLPNO-CCSD(T)/CBS (TZ/QZ) level of theory is the best alternative when CCSD(T)/CBS is computationally unfeasible, which is often the case when benchmarking models of enzymatically catalyzed reactions and indeed is the case here with the saccharides in reaction 20 (Fig. 1), which contain 25 and 22 atoms for the reactant and product, respectively.   Table 2 show minimal differences between these two methods, absolute differences of >1 kcal/mol are seen for six of the 20 reactions in the set, and four of those, namely reactions 14 -17, have absolute differences of >2 kcal/mol. These large differences occur only in the non-proton transfer reactions, suggesting that the estimation strategy is even less appropriate when considering processes other than proton transfers. When conducting a benchmark study, a DFA is judged by its ability to replicate a reference value, so the quality of the chosen reference value will directly affect the DFA's perceived accuracy. In Fig. 3, we present the mean absolute deviations (MADs) of each method in Table 1 from two sets of references-est. CCSD(T)/aug-cc-pVTZ and DLPNO-CCSD(T)/CBS-to explore how the observed performance of the methods changes with the benchmark against which they are tested. We also show the est. CCSD(T)/aug-cc-pVTZ results in this plot and mention that the overall MAD of this method against our updated reference values is 1.1 kcal/mol. Again, we discourage its use as a reference in further studies.
Most of the MADs are lower against the DLPNO-CCSD(T)/CBS references, with the majority of the exceptions being MP2, SCS-MP2 and CCSD(T) results from the original study. Considering that the SCS-MP2 and CCSD(T)/aug-cc-pVDZ results were used in the calculation of the est. CCSD(T)/aug-cc-pVTZ values, it is understandable that their deviations from those estimated references are lower than against our updated ones. Interestingly, we also see that BP86-D3(0) and TPSS-D3(0) perform worse against the DLPNO-CCSD(T)/ CBS references than the est. CCSD(T)/aug-cc-pVTZ ones; in the case of BP86, the DFT-D3(0) variant even ends up with a higher MAD than the uncorrected DFA. In passing, we note that the semi-empirical MO methods SCC-DFTB, AM1 and PM3 have some of the largest MAD decreases when using the new references, but proportional to the actual values these are not significant improvements and they are by far outperformed by all assessed DFAs. The largest difference between the two sets of references is seen for B97M-D3(BJ), which has an MAD of 3.3 kcal/mol against the est. CCSD(T)/aug-cc-pVTZ references and 2.1 kcal/mol against DLPNO-CCSD(T)/ CBS-a decrease of 1.2 kcal/mol. We also see significant reductions in the other statistics for B97M-D3(BJ) when going to higher quality references, with its root mean square deviation (RMSD) dropping from 3.9 to 2.8 kcal/mol and the error range from 11.4 to 8.0 kcal/mol.
In general, we see that good methods-ones which have performed well in previous studies [20][21][22]24]-perform even better against better references, and therefore have larger changes in the MADs when going from est. CCSD(T)/ aug-cc-pVTZ to DLPNO-CCSD(T)/CBS references. B97M-V and B97M-D3(BJ), two meta-GGAs which have been shown to outperform many hybrid functionals [22, 26,  There are also differences in the ordering of the DFAs, particularly in the hybrids and double hybrids newly tested in this work. With the old references, B97M-V and SOS0-PBE0-2-D3(BJ) are the best of their respective categories, while B97M-D3(BJ) and revDOD-PBE-D3(BJ) outperform these with respect to the new references. For the GGA and meta-GGA DFAs, we see the same ranking with both sets of references, although the MADs of the meta-GGAs are slightly more spread out with the new ones. Overall, we conclude that est. CCSD(T)/aug-cc-pVTZ REs are not good enough references for benchmarking, and we stress the importance of using high-level reference values.
The ordering of the functionals also allows us to qualitatively compare this set to our previous benchmark set of active-site models in Ref [26]. In Fig. 3, the 12 DFAs newly chosen for this work are presented in order of their accuracy for our previous test set within their respective rungs of Jacob's ladder-for example, OLYP-D3(BJ) was the best performing GGA, followed by PBE-D3(BJ) and revPBE-NL; consequently, a small upward trend within each rung is also expected for the present set. This behaviour is indeed seen for the GGAs and meta-GGAs, but the rankings of both the hybrids and double hybrids are reversed. While slight differences between test sets are to be expected, one would expect similar trends from two sets that are both designed to represent similar types of reactions. Therefore, this is an indication that the approach of modeling enzymatically catalyzed reactions without the surrounding chemical environment may not be appropriate when trying to find DFAs for subsequent QM/MM studies.

The effect of London dispersion corrections
Having updated the references for the reactions shown in figure 1, we briefly detour from our main aims and update the test of dispersion corrections conducted by Sirirak et al. with our new reference data, additional DFAs, and the DFT-D3(BJ) correction, which was recommended as the default and is to be preferred over the original DFT-D3(0) correction by the developers [53]. In Fig. 4 we show the MADs and mean deviations (MDs) from the DLPNO-CCSD(T)/ CBS benchmarks for each method in its uncorrected form and with the DFT-D3(0) and DFT-D3(BJ) dispersion corrections. We note that M062X-D3(0) is the only dispersion corrected form of M062X as the D3(BJ) correction has been shown to overcorrect for the Minnesota functionals [65,95], and we do not use DFT-D3(0) with DOD-SCAN or rev-DOD-PBE because these functionals have been specifically parameterised for use with DFT-D3(BJ) [75].
The DFAs used with the Pople basis sets-as done by Sirirak et al.-overestimate the REs, resulting in positive MDs, whereas our choice of the def2-QZVP basis set results in consistent underestimation of the REs. We additionally provide BLYP/def2-QZVP and B3LYP/def2-QZVP results here to confirm that this is caused by the basis set, not the specific functionals. Methods that consistently underestimate REs can result in the perception that dispersion corrections make the results worse, as the energies are further lowered and thus deviate more from the values one wishes to replicate. This perception is incorrect, however, and should not be considered as a reason not to use the corrections as they are merely revealing weaknesses in the underlying DFA that are otherwise cancelled out by the incorrect long-range behaviour [20]. This also applies to the higher MADs seen when DFT-D3(BJ) is applied compared to DFT-D3(0), as the Becke-Johnson damping function additionally recovers some short-range dispersion effects and as such provides larger absolute dispersion energies than its DFT-D3(0) predecessor [53].

The importance of the enzyme environment
While the RE test set we have just explored is adequate for testing different benchmark levels of theory, the approach of using small, separate molecules cannot realistically represent a particular active site due to the missing surrounding enzyme environment. This is one of the potential reasons why we saw trends in DFA performance in Sect. 3.1 that were inconsistent with our previous work on larger activesite models. In this section, we explore this aspect further by testing the inadequacies of smaller models using three models of the 4-OT reaction (see Fig. 2): a large active-site model-the original model from our previous benchmark study that includes neighboring residues that stabilise the reaction, updated from a model created by Sevastik and Himo [51]-a minimal model that contains only the substrate and catalytic Pro1 residue, and a model that involves separate structures for the substrate and Pro1 to mimic the approach used for the RE test set discussed earlier. Geometries of the original and minimal models are shown in Fig. 5, and DLPNO-CCSD(T)/CBS REs for all three models are given in Table 3. In Sevastik and Himo's study of the mechanism of 4-OT, it was stated that small models that did not account for the surrounding environment would give unrealistic energies. Indeed, the REs obtained by us show that the minimal and separated models give significantly different results to the original one.
The main difficulty in modeling the 4-OT reaction is the charge separation in the intermediate, and this is the main cause of the differences between the two smaller and the original models. The first problem caused by the charge separation is the extremely large REs for the separated model, where the newly formed charges are not stabilised at all. We note that although anionic species can potentially be problematic for the ma-def2-nZVPP basis sets as they have only been minimally augmented with diffuse functions, this is unlikely to be a factor in this case because CBS results extrapolated from the aug-cc-pVTZ/QZ basis sets are very similar. The original model accounts for the charge-separated intermediate best by including three cationic arginine residues to stabilise all three negative charges, but even without these the minimal model provides some stabilisation through electrostatic interactions between the substrate and protonated Pro1 residue. The strength of the stabilisation can be quantified by taking the difference between the separated and minimal models for each structure, and we find that combining the structures results in a lowering of the total energy of 282.39 kcal/mol for the intermediate when these electrostatic effects are involved, while the reactant and product energies only decrease by 11.44 and 13.84 kcal/ mol, respectively (see table S14).
The second problem caused by the charge separation is represented by the signs of the REs and how they reverse between the original and minimal models. The intermediate form of the substrate is conjugated all the way along the molecule, and therefore is thermodynamically favoured over the reactant and product as long as the charges are appropriately balanced. This results in a negative first RE (RE-A)  Table 3 Reaction energies (REs) (kcal/mol) at the DLPNO-CCSD(T)/CBS level of theory for the three models of 4-OT a Reaction energy for the first step (see Fig. 2) b Reaction energy for the second step (see Fig. 2 (table S15). Before continuing the discussion of the 4-OT models, a comparison with the enzyme RE test set discussed in the previous subsections is appropriate, as the issue of chargeseparation is also seen for reaction 13, which is the only reaction in the set that involves a cationic species reacting with an anionic one. The RE for this reaction (−111.23 kcal/ mol) is the highest-magnitude RE in the set, and it is almost four times that of the second highest-magnitude RE (reaction 3 with RE= − 29.62 kcal/mol). The strongly negative RE in reaction 13 is due to the neutral products being significantly more stable than the ionic reactants, especially when they cannot interact with each other through electrostatic effects; this is similar to RE-B of the separated 4-OT model, which is also large and negative as the system leaves the charge separated intermediate state. This allows us to question the suitability of the current form of reaction 13 for the assessment of methods for subsequent use in enzymatically catalyzed reactions.
Aside from the charge-separation problem involving the intermediate, the overall REs in the 4-OT models are also influenced by the lack of charge stabilisation provided by the chemical environment, showing a trend of increasing thermodynamic stability of the product relative to the reactant with larger model size. The overall RE of the separated model is purely the difference between the two tautomers of the substrate as the proline energies are identical and cancel each other. Consequently, the product tautomer is seen to be slightly less stable than the reactant one. While the product is the tautomer with the larger -system due to conjugation between the double bond and the pyruvate group, the position of the double bond in the reactant allows for conjugation with the carboxylate group, and the additional minor resonance form provided may stabilise that negative charge in the absence of any external stabilisation. In the minimal model, Pro1 provides some stabilisation through ion-dipole interactions, and the arginine residues in the original model provide even more through electrostatic effects. When the charge is even somewhat stabilised, the conjugated tautomer then becomes the thermodynamically favoured product as expected.
Overall, one can see that there are significant differences between the models, and that the smaller ones have an incomplete representation of the chemistry within the active site, particularly with respect to charged species. The approach of treating each component separately is especially problematic; while it is unlikely to be considered by those conducting mechanistic studies-since it is not suited to the calculation of transition states and barrier heights-we nonetheless strongly recommend that it be avoided even in simplified, preliminary studies that are designed to guide further computational enzyme work.
Having identified the main differences between the three 4-OT models, we continue by testing our set of DFAs on them to determine if and how strongly the deficiencies of the smaller models would impact the conclusions of a QM benchmark study. In Fig. 6 we present MADs and MDs for each model, calculated with the 12 DFAs picked from our previous work [26]. Since each model only has two associated REs, these statistics are not a reliable indicator of the performance of each DFA for subsequent applications; however, the numbers exemplify the impact of the different models on the DFA rankings.
While almost all MDs are negative, not all REs are underestimated. The general trend is that RE-A is underestimated, while RE-B is overestimated, but this is not consistent between models nor methods. The minimal model gives the lowest MADs of the three 4-OT models for all but three DFAs; interestingly, these are all hybrid functionals. All MADs for this model are less than or equal to 1 kcal/mol, which would imply that any of these DFAs, including the GGAs, could be an appropriate choice for applications when the systems are similar to this model. The same cannot be said when considering the other models, however, as the MADs for the other two models are generally larger, with the worst combination being GGAs applied to the original model (for example, PBE-D3(BJ) gives the largest MAD at 8.6 kcal/mol).
Based on the MADs shown here, one can say that the best DFAs for the original model are SOS0-PBE0-2-D3(BJ), B97M-V, and B97M-D3(BJ); for the minimal model, the three double-hybrid DFAs perform best; and for the separated model, M062X-D3(0), B97M-V, and B97M-D3(BJ). The only DFA that appears in the top three for more than one model is SOS0-PBE0-2-D3(BJ), but it has the 4th highest MAD for the separated model. Significant differences can also be seen in the ordering of the DFAs within their rungs of Jacob's Ladder-based on the original model, OLYP-D3(BJ) is the best GGA; however, it has the highest MAD of the three GGAs for the minimal model. Similarly, SCAN-D3(BJ) is the best meta-GGA when using the minimal model, but the worst for the original model. The overall impression from comparing these three models in this way is that the conclusions drawn for one model do not necessarily hold for the others.
While benchmark studies do not necessarily require perfectly realistic models, especially when the references are independent of any experimental data, the models must reflect the results one aims to achieve. It is understandable that many studies involve relatively small models because the computational demands of treating larger ones can be high, but the comparison between our three example models shows that the results of the smaller systems are not consistent with those of the largest model. When the goal is to choose a DFA for QM/MM analysis of an enzyme, one must also consider the parts of the active site that influence the reaction without being directly involved, and therefore a simplified study on a smaller model may wrongly identify and recommend DFAs that are not the most appropriate for the actual application. Therefore, we strongly recommend to anyone seeking to benchmark for their own computational treatments of enzyme-related problems that a study should use the whole QM region, or at least the largest model for which one can obtain high-level reference values while retaining the necessary chemical environment.

Summary and conclusions
Herein, we explored two important factors that influence the outcomes of benchmark studies-namely the quality of the benchmarks themselves as well as the choice of underlying structures-in the context of studying enzymatically catalyzed reactions, and choosing QM methods for QM/MM analysis thereof. Through the use of a set of 20 reactions of small molecules designed to represent common steps in enzyme mechanisms, we showed that DLPNO-CCSD(T)/ CBS with the TightPNO thresholds is a good alternative to conventional CCSD(T)/ CBS when computational resources are limited, while estimation of high-level data using corrected low-level numbers and single basis set results such as CCSD(T)/aug-cc-pVTZ should be avoided. A selection of QM methods, mainly density functional approximations, was then applied to the enzyme RE test set, and the deviations from estimated CCSD(T)/aug-cc-pVTZ and DLPNO-CCSD(T)/CBS benchmarks were assessed. We saw that most methods had lower mean absolute deviations when compared to DLPNO-CCSD(T)/CBS, and the biggest decreases occurred for DFT methods that had performed well in other studies. The fact that good methods perform better with improved reference data is further proof of their general reliability and robustness as also pointed out in refs [19] and [20]. The ordering of the tested methods is also dependent on the choice of references, particularly for hybrid and double-hybrid DFAs. Using our new DLPNO-CCSD(T)/ CBS reference values, we also briefly compared some of the pure DFAs with their DFT-D3(0) and DFT-D3(BJ) corrected variants, and confirmed the importance of using a dispersion correction in DFT studies.
We also used a two-step tautomerisation reaction, catalyzed by the enzyme 4-oxalocrotonate tautomerase, to explore if smaller active-site models could give similar benchmarking results to larger ones despite their chemical deficiencies. The three tested models of this enzyme include an active-site model that contained some neighboring amino acid residues, a simplified model that contained only the substrate and the catalytic proline residue which was directly involved in the proton transfer steps, and a separated model where the substrate and proline residue were treated separately. Considering the high-level REs calculated to be used as the benchmarks, it is clear that ignoring the chemical environment neglects significant intermolecular interactions, particularly when the important components are also separated from each other. This is especially problematic in charge-separated systems that are strongly stabilised by electrostatic effects. Benchmarking twelve DFAs on all three models showed very little consistency between the results for each model, both in the ordering of functionals and actual magnitudes of the deviations. As such, one must ensure that the model used is large enough to adequately represent the chemistry that will be studied, in order to get the most relevant recommendations.
Overall, the main point we wish to stress to those seeking to conduct their own benchmark studies on enzymatically catalyzed reactions is to not be unnecessarily "cheap." Lower quality references and smaller model systems will undoubtedly make the benchmarking process faster, but they can easily make poor-performing methods look just as appealing as good ones. Instead, one should always aim to use the highest quality references and largest model systems that are computationally feasible, in order to get the most accurate results not only in their benchmark studies, but also from their applications by using an appropriate method that has been chosen for the right reasons. While our study focused on enzymatically catalyzed reactions, we hope that our insights can also serve as a guideline for other future chemical benchmark studies.