A Machine Learning Model for Predicting Enantioselectivity in Hypervalent Iodine(III) Catalyzed Asymmetric Phenolic Dearomatizations

Abstract

Open AccessCCS ChemistryRESEARCH ARTICLES22 Feb 2024A Machine Learning Model for Predicting Enantioselectivity in Hypervalent Iodine(III) Catalyzed Asymmetric Phenolic Dearomatizations Ben Gao, Liu Cai, Yuchen Zhang, Huaihai Huang, Yao Li and Xiao-Song Xue Ben Gao Key Laboratory of Fluorine and Nitrogen Chemistry and Advanced Materials, Shanghai Institute of Organic Chemistry, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200032 , Liu Cai Key Laboratory of Fluorine and Nitrogen Chemistry and Advanced Materials, Shanghai Institute of Organic Chemistry, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200032 , Yuchen Zhang Key Laboratory of Fluorine and Nitrogen Chemistry and Advanced Materials, Shanghai Institute of Organic Chemistry, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200032 , Huaihai Huang School of Chemistry and Materials Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024 , Yao Li *Corresponding authors: E-mail Address: [email protected] E-mail Address: [email protected] Key Laboratory of Fluorine and Nitrogen Chemistry and Advanced Materials, Shanghai Institute of Organic Chemistry, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200032 and Xiao-Song Xue *Corresponding authors: E-mail Address: [email protected] E-mail Address: [email protected] Key Laboratory of Fluorine and Nitrogen Chemistry and Advanced Materials, Shanghai Institute of Organic Chemistry, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200032 School of Chemistry and Materials Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024 https://doi.org/10.31635/ccschem.024.202303774 SectionsSupplemental MaterialAboutAbstractPDF ToolsAdd to favoritesDownload CitationsTrack Citations ShareFacebookTwitterLinked InEmail Catalytic asymmetric dearomatization (CADA) of phenols has emerged as a powerful strategy for constructing stereochemically complicated architectures from planar aromatic feedstocks. However, the development of novel catalysts for highly enantioselective phenolic oxidative dearomatization continues to be a time- and resource-intensive endeavor, attributable mainly to the paucity of a reliable predictive catalyst design strategy. In this study, we systematically compiled a dataset of 847 literature-reported asymmetric phenolic dearomatization by hypervalent iodine(III) catalysts (HVI-CADA dataset), a unique type of catalyst that is gaining increasing attention owing to their ecofriendly features. Leveraging this reaction dataset, we established a machine learning predictive model to predict enantioselectivity. The XGBoost algorithm exhibited the optimal performance, with a root-mean-square error of 0.26 (kcal/mol) and an R2 of 0.84. This established model can effectively guide the selection of the optimal catalyst and additives in out-of-sample tests. Subsequent independent experiments were conducted to validate the results obtained from the model predictions. We anticipate that our current work will facilitate further design, optimization, and development of novel chiral hypervalent iodine catalysts for new asymmetric phenolic dearomatization reactions. Download figure Download PowerPoint Introduction Hypervalent iodine (HVI) compounds have evolved from chemical curiosities into mainstream reagents, garnering extensive use in organic chemistry as a safe, mild, economical, and environmentally friendly alternative to heavy metal or rare metal reagents.1–4 Their unique reactivities have inspired the exploration of new synthetic transformations that would be challenging to achieve otherwise.1 In recent years, the discovery of enantioselective molecular catalysts based on iodine(I/III) redox chemistry has introduced a new aspect to hypervalent iodine chemistry.5–9 To date, chiral hypervalent iodine catalysts have been used in various chemical reactions, such as oxidations of sulfides, arylation and alkynylation reactions, α-functionalization of carbonyl compounds, and dearomatization reactions.4,10 Catalytic asymmetric dearomatization (CADA) is a powerful strategy with significant potential for constructing stereochemically complex structures from readily available planar aromatic feedstocks.11–21 The primary challenge in achieving asymmetry lies in controlling enantioselectivity while overcoming the loss of aromaticity, especially in hypervalent iodine-mediated/catalyzed phenolic oxidative dearomatization. Indeed, the development of chiral hypervalent iodines as stoichiometric reagents or catalysts with high efficiency and satisfactory enantioselectivity has long been a challenging task.4,9,10 In 2008, Kita and coworkers22 made a breakthrough with the introduction of a novel conformationally rigid μ-oxo-bridged iodine(III) reagent based on the axially chiral spirobiindane scaffold. Two years later, Ishihara et al.23 marked another pivotal milestone by designing conformationally flexible iodoarenes for the same oxidative dearomatization reaction. Ishihara's catalyst demonstrated good to excellent enantioselectivity across a broad range of substrate scopes. Subsequently, numerous chemists explored asymmetric phenolic dearomatizations by chiral iodine catalysis, with representative structures of aryl iodide catalysts and reactants as shown in Figure 1a,b. Figure 1 | (a) Asymmetric dearomatization reaction; (b) some representative structures of hypervalent iodine catalysts and representative reactants (in the middle of the diagram); and (c) machine learning-assisted chemical experiment workflows. Download figure Download PowerPoint Despite these notable advances in this area, the development of chiral hypervalent iodine catalysts for phenolic oxidative dearomatization remains a time- and resource-consuming process, primarily due to the lack of a predictive catalyst design strategy. The classic strategy to design or improve asymmetric catalysis is the mechanism-based approach by quantum mechanical calculations.24–28 Recently, density functional theory (DFT) calculations have been employed to elucidate issues related to stereoselectivity or even catalyst design in chiral hypervalent iodine catalyzed processes.29–34 However, the mechanism-based approach often encounters challenges because the performance of asymmetric catalysis is sensitive to the molecular properties of every reaction component. A subtle variation in the reaction condition may lead to a profound impact on stereoselectivity, thereby limiting DFT calculations in the screening and optimization of asymmetric catalysis reactions. In the last decade, with the advancement of artificial intelligence technology, machine learning has been applied in predicting reaction activity and selectivity.35–38 Notable examples include Doyle et al.,39 who predicted the yields of Buchwald–Hartwig C–N coupling reactions using quantum chemical descriptors of substrates, catalysts, and additives, achieving a final model performance with a 7.8% root-mean-square error (RMSE) and an R2 value of 0.92. Denmark et al.40 employed support vector machines (SVMs) and feedforward neural networks to predict the selectivity of thiol addition reactions to imines catalyzed by chiral phosphoric acids. Sunoj et al.41 predicted the enantioselectivity of asymmetric hydrogenation reactions, achieving good predictive performance with an 8.4% RMSE on the test set. Luo et al.42–44 utilized XGBoost algorithms to predict pKa in diverse solvents and proposed the use of SPOC descriptors for studying chemical reactions. Hong et al.45–49 employed machine learning to predict regioselectivity in radical C–H functionalization of heterocycles, enantioselectivity of asymmetric hydrogenation of olefins, and pallada-electrocatalysed C–H activation reactions. Most recently, Liao et al.50 reported the optimization of the reaction conditions for the iridium-catalyzed cross-dimerization of sulfoxonium ylides by utilizing high-throughput experiments combined with machine learning algorithms. Furthermore, Yu and coworkers51 successfully employed machine learning algorithms to predict the yield of Cu-catalyzed radical-type oxy-alkylation reactions, specifically in the synthesis of 2-oxazolidones, yielding promising results regarding predictive accuracy. Building upon recent applications of machine learning in catalysis optimization, the potential for a universal model that employs unsupervised algorithms to expedite asymmetric catalyst development is both viable and promising.52 Despite the adoption of machine learning for predicting outcomes in a series of organic reactions, a predictive model specifically dedicated to chiral hypervalent iodine catalysis has not yet been established.53,54 Previously, we conducted computational studies on the mechanism of chiral hypervalent iodine-catalyzed asymmetric phenolic oxidative dearomatizations, aiming to facilitate catalyst design and development using the mechanism-based approach.29,32 In this contribution, we aim to establish a machine learning predictive model for the enantioselectivity of this type of important reaction. We compiled a dataset of the literature-reported hypervalent iodine-catalyzed asymmetric phenolic dearomatization reactions with a reasonable distribution of enantioselectivity (details in Supporting Information Tables S1–S3).22,23,29,55–76 Utilizing this reaction dataset, we developed a machine learning predictive model to predict enantioselectivity, allowing rapid screening of catalyst structures and solvents for reactions (Figure 1c). Experimental Method General procedure for hypervalent iodine-catalyzed asymmetric dearomatization reaction In an oven-dried reaction tube under a nitrogen atmosphere, chiral iodoarene (0.0075 mmol, 15 mol %), meta-chloroperoxybenzoic acid (m-CPBA, 14.6 mg, 0.065 mmol, 1.3 equiv) were dissolved in CH2Cl2 or distilled chloroform (2.5 mL), then naphthol 1a (0.05 mmol) was added, and the mixture was stirred at −20 °C for 48 h. The resulting mixture was poured into aqueous Na2S2O3 (5 mL) and aqueous NaHCO3 and extracted with CH2Cl2 (2 times). The organic layers were dried over anhydrous Na2SO4, and solvents were removed in vacuo. The residue was purified by flash column chromatography on silica gel (eluent: hexane-EtOAc = 5:1 to 3:1) to give 2a. Computational method All the DFT calculations in this work were performed with the Gaussian 16 software package.77 Geometry optimizations were carried out using the B3LYP-D3(BJ) functional in conjunction with a mixed basis set: Stuttgart/Dresden ECPs (SDD) for the I atom and 6-31G** for other main group atoms.78–82 The Multiwfn software was used to convert log files to xyz files.83 The DFT-based Steric Parameters (DBSTEP) package was employed to calculate the sterimol steric parameters and buried volume parameters with all settings set to default.84 Additionally, extensive conformational analyses were executed through the application of 2nd generation of extension tight binding method designed to yield reasonable geometries, vibrational frequencies, and noncovalent interactions (GFN2-xTB) within the conformer–rotamer ensemble sampling tool (CREST) program and molclus software (details in Supporting Information Tables S6 and S7).85–87 HVI-CADA Dataset We have curated a database encompassing 847 reaction datasets for the asymmetric dearomatization reactions mediated by hypervalent iodine catalysts (HVI-CADA dataset), originally scattered throughout the literature. This database primarily consists of two types of reactions: ortho-dearomatization reactions and para-dearomatization reactions. There are 637 instances of ortho-dearomatization reactions and 210 instances of para-dearomatization reactions (Figure 2a). Overall, among all the reactions, only 20% of them exhibit an enantiomeric excess (ee) value above 90%, whereas 60% of the reactions have an ee value below 80%. The distribution of data within a dataset critically influences the predictive ability of machine learning models developed for asymmetric catalysis. An overrepresentation of reactions with high enantiomeric excess (ee) values, specifically above 90%, biases a model towards predicting disproportionately favorable outcomes. This is a notable issue in databases comprising asymmetric transition metal catalysis, where high ee outcomes are prevalent. Such a bias in machine learning is not conducive to creating a well-rounded predictive model. A balanced dataset, embodying a spectrum of ee values, enhances model robustness by providing a more accurate portrayal of reaction landscapes. In the realm of hypervalent iodine-catalyzed asymmetric dearomatization reactions, our dataset exhibits a balanced composition, with a substantial fraction of reactions reporting lower ee values, thereby mitigating the potential bias and facilitating more reliable global prediction ability in the resulting machine learning model. Figure 2 | (a) The distribution of data and the first highly enantioselective spiro-type chiral hypervalent iodine by Kita; and (b) the top three most commonly used catalyst types in the dataset. Download figure Download PowerPoint In terms of catalyst structures employed in these reactions, the most commonly utilized catalyst is of the Ishihara type (including both first and second generations). Following this, the second and third most prevalent catalyst types are Nachtsheim type and Maruoka type catalysts, respectively (Figure 2b). The asymmetric dearomatization reaction catalyzed by chiral hypervalent iodine reagents primarily consists of several key components: substrates, catalysts, oxidants, solvents, and additives. In this study, the oxidant is fixed as m-CPBA, so the consideration for the oxidant only involves selecting its equivalent as a descriptor. For substrates, catalysts, solvents, and additives, we recorded their relevant reaction information using simplified molecular input line entry system (SMILES) notation and employed reaction development toolkit (RDKit) descriptors and molecular access system keys (MACCS keys) as features. It is worth noting that in dearomatization reactions, additives, and solvents are often not singular entities; there are often two solvents (or additives) mixed in a certain proportion, and the blending of different solvents can enhance chiral control in the reaction. Therefore, we made some modifications to the input format for solvents and additives (details in Supporting Information Tables S4 and S5). Besides, other reaction conditions also need to be considered, such as reaction temperature, catalyst equivalents, reaction concentration, and whether the solvent is distilled. We directly use numerical values as descriptors for reaction temperature, catalyst equivalents, and reaction concentration. To indicate whether the solvent is distilled, we use "one-hot encoding" as a representation. In the end, we performed geometric optimization of the catalyst using DFT calculations, selecting certain geometric parameters. For example, the buried volume of the iodine atom and the Sterimol parameters were calculated from the iodine atom as the initial atom using the program provided by the Paton group. We aim to enhance the predictive capability of our model by selecting relevant descriptors based on our understanding of the reactions. In asymmetric dearomatization reactions, solvent effects are significant. Kita et al.22 pointed out that the ET(30) value of the solvent can, to some extent, reflect the solvent's influence on enantioselectivity. Based on this idea, we examined other literature reports on dearomatization reactions (Figure 3a,b). We also found a good linear relationship between the ET(30) value and enantioselectivity within a certain range. When the solvent polarity is too high, the dissociation mechanism may become the predominant pathway, hindering effective chiral control (Figure 3c).88 Therefore, we selected ET(30) as a descriptor for solvents. Other descriptors can be found in the Supporting Information Figures S1 and S2. Figure 3 | (a) The relationship between solvent ET(30) values and enantioselectivity; (b) the reactions corresponding to plots in (a); and (c) the mechanism of hypervalent iodine-catalyzed asymmetric dearomatization reactions. Download figure Download PowerPoint The samples are also crucial for constructing machine learning models. 80% of the data was used as the training set, while the remaining served as the test set. Finally, we conducted extrapolation experiments using the established model to assess its practicality. Performance Evaluation of Machine Learning Models The general process of establishing a machine learning model is depicted in Figure 4a. We transform substrates, catalysts, solvents, and other factors into feature vectors, which are then concatenated (Figure 4b). The concatenated vector becomes the reaction vector. The reaction vectors of all samples constitute the feature dataset. Subsequently, a machine learning model is established based on the feature dataset. Figure 4 | (a) The workflow for building a machine learning model; (b) descriptor selection for dearomatization reactions; and (c) the performance of the XGBoost model. Download figure Download PowerPoint In this work, we employed a diverse array of machine learning algorithms, namely least absolute shrinkage and selection operator (LASSO) regression,89 SVM,90 k-Nearest Neighbors,91 Decision Trees,92 Random Forest,93 XGBoost,94 AdaBoost,95 and Artificial Neural Networks,96 for model development. To enhance model performance and robustness, we employed a fivefold cross-validation approach during model training, and the model parameters obtained through fivefold cross-validation were considered as the optimal model parameters. Subsequently, the final machine learning models were trained using the meticulously optimized hyperparameters. This methodology aimed to ensure the accuracy and generalizability of our predictive models in a scholarly context. First, we trained the model using the % ee as the target variable. Among these models, XGBoost exhibited the most robust predictive performance. This model achieved an average coefficient of determination (R2) of 0.80 during fivefold cross-validation on the training dataset, demonstrating good stability (refer to Supporting Information Tables S8–S11 for details). Furthermore, the model performed well on the test dataset, with an R2 of 0.81 and an RMSE of 12.75% ee. Subsequently, we attempted to train the model using the free energy change ( Δ Δ G ⧧ ) as the target value. The operational transformation of % ee to a label of Δ Δ G ⧧ necessitates logarithmic manipulation, inherently altering the data distribution on which the model trains. Such transformation substantially modifies the functional association derived by machine learning algorithms. Consequently, the observed discrepancies between two predictive models' outcomes may be attributed to their respective internal representations, crafted in response to the shifted data landscape presented by the logarithmic transformation of % ee to Δ Δ G ⧧ . The fivefold cross-validation R2 on the training set was 0.79, and the performance on the test set (R2) was 0.84 with an RMSE of 0.26 kcal/mol. This performance was slightly better than the model trained with enantiomeric excess as the target value. Compared to the energy errors typically associated with traditional DFT calculations, an error of 0.26 kcal/mol is a highly favorable result. The performance results of the two models are illustrated in Figure 4c. Application of machine learning model To illustrate the practicality of the machine learning model, we next evaluated it through the following applications. The first three instances leveraged data extracted from the literature, where we intentionally selected experimental datasets pertaining to the optimization of reaction conditions to assess the model's ability to extrapolate.23,67,72 In the last application, we conducted experiments, utilizing new experimental data to further interrogate the predictive accuracy of the machine learning model. For the first application, catalyst screening, we employed experimental results reported by Nachtsheim et al.66 Although the predictions from the model did not attain perfect accuracy, they effectively identified optimal catalysts in agreement with the experimental outcomes (Figure 5). In predicting the enantioselectivities associated with various catalysts, the % ee and the energy models exhibited a trend aligning closely with experimental observations. Remarkably, the energy model yielded predictions with a notably higher degree of accuracy. Figure 5 | Case1: Model-driven catalyst structure screening. aExperimental % ee taken from ref 67. Download figure Download PowerPoint In the second instance, we further assessed the model for catalyst screening with experimental data derived from Chen et al.72 As shown in Figure 6, our analysis was broadened to include a more diverse array of catalyst structures. Both the % ee model and the energy model successfully identified the optimal catalyst structures through screening. Although a degree of deviation exists between theoretical predictions and experimental values, there is a qualitative agreement with the experimental outcomes. Figure 6 | Case2: Model-assisted catalyst structure screening. aExperimental % ee taken from ref 72. Download figure Download PowerPoint In the third case study, we conducted solvent screening for the reaction using a % ee prediction model with experimental data reported by Ishihara et al.23 As depicted in Table 1, our model precisely predicted the optimal solvent for this case. Furthermore, the error in predicting enantioselectivity for each solvent was relatively low, which demonstrated the effectiveness of our established machine learning model in aiding solvent selection for reaction conditions. Table 1 | Case3: Solvent Condition Screening Entry Solvent Exp. (% ee)a Pred. (% ee) 1 DCM 83 85.4 2 CHCl3 90 92.4 3 Toluene 77 89.4 4 MeCN 83 85.7 5 EtOAC 81 76.4 6 TFE 70 68.9 7 HFIP 41 46.3 8 CHCl3/MeNO2=2 90 92.1 aExperimental % ee from ref 23. DCM, dichloromethane; TFE, trifluoroethanol; HFIP, hexafluoroisopropanol. The three aforementioned case studies demonstrate the potential of our established machine learning models in assisting with reaction condition selection. Subsequently, we conducted additional experiments and compared the predicted results with the experimental outcomes. As shown in Figure 7, the model demonstrated relatively accurate predictions when ethanol is introduced as an additive to the reaction. Introducing ethanol as an additive allows the model to provide feedback on the predicted results. For Cat3-1, the model's predictive results are accurate, and the introduction of ethanol enhances chiral control, noting that the predicted outcome of Cat3-1 in chloroform comes from the training set. In the case of Cat3-2, our qualitative analysis through the predicted ee values indicates that the addition of ethanol has little to no impact on chiral control, indicating an absence of positive effects. This observation is consistent with the experimental results as well. Figure 7 | Case4: Prediction of the influence of additives and catalysts on the reaction. aExperimental % ee taken from ref 23. bExperiments in this work. Download figure Download PowerPoint In our previous computational investigations, we identified the presence of an N–H···O hydrogen bond and π···π stacking to stabilize the favored transition state in this asymmetric Kita-dearomative spirolactonization.29 Employing the machine learning model, we sought to assess its proficiency in the aforementioned However, regarding the influence of in the properties on the catalyst resulting from aromatic for a more accurate prediction with the other results some degree of to the enantioselectivity. Therefore, there is for in the model's capability to and noncovalent interactions within the reaction. In this study, we and curated literature-reported asymmetric phenolic dearomatization reactions catalyzed by hypervalent iodine(III) compounds, creating the HVI-CADA dataset. Based on the HVI-CADA dataset, we successfully established a machine learning model with robust predictive capability for enantioselectivity in hypervalent iodine(III) catalyzed asymmetric phenolic The of this model was demonstrated through case studies the screening of chiral hypervalent iodine(III) The development of this model and reliable prediction of the enantioselectivity in chiral hypervalent dearomatization reactions of phenolic We anticipate that the current work will the development of novel chiral hypervalent iodine(III) catalysts and new asymmetric phenolic dearomatization reactions. Supporting Information Supporting Information is available and dataset computational experimental and machine learning The HVI-CADA dataset and are available on at of There is no of to This work was by the of and of the of and the for in and and the of the Chinese Academy of The numerical calculations in this were carried out on the We and at Shanghai Institute of Organic Chemistry for and in of Hypervalent Hypervalent in in in with Hypervalent Hypervalent in Asymmetric Cai of Hypervalent in Organic Hypervalent in Asymmetric Ishihara in Organic Zhang Asymmetric Li Hong of Zhang Asymmetric (CADA) of and in of