Making a big impact with small datasets using machine-learning approaches

Abstract

As comprehensive electronic medical records increasingly capture more data points from large-scale clinical cohorts, machine-learning approaches have been applied to identify patterns that would otherwise be challenging to ascertain with traditional statistical methods. Although we often think of machine learning for analysis of large, complex datasets, in The Lancet Rheumatology, George Robinson and colleagues1Robinson GA Peng J Dönnes P et al.Disease-associated and patient-specific immune cell signatures in juvenile-onset systemic lupus erythematosus: patient stratification using a machine-learning approach.Lancet Rheumatol. 2020; 2: e485-e496Summary Full Text Full Text PDF PubMed Scopus (16) Google Scholar successfully applied machine-learning algorithms to derive information from a small dataset in a rare disease. Using blood from 67 patients with juvenile-onset systemic lupus erythematosus (SLE), the authors identified dysregulation in multiple immune cell subsets using a combination of balanced random forest, sparse partial least squares-discriminant analysis, and traditional logistic regression, which was subsequently validated using ten-fold cross-validation. They found an immune signature specific to juvenile-onset SLE, which was further stratified into four different immune profile cluster groups that were associated with disease activity over time. These unique immune profiles could be useful in predicting prognosis and response to treatment in juvenile-onset SLE. Development of a personalised approach to therapy is especially relevant because children tend to receive more intensive drug therapy and accrue more end-organ damage than patients with adult-onset SLE.2Brunner HI Gladman DD Ibañez D Urowitz MD Silverman ED Difference in disease features between childhood-onset and adult-onset systemic lupus erythematosus.Arthritis Rheum. 2008; 58: 556-562Crossref PubMed Scopus (352) Google Scholar The findings of this study not only contribute to our current understanding of juvenile-onset SLE, but they also outline a potential future machine-learning roadmap for untangling the complex immunological underpinnings of other rheumatic conditions. The data density in this analysis comes not from the cohort size, but rather from detailed analysis of peripheral blood mononuclear cells (PBMCs) by flow cytometry. Subsequently, machine-learning algorithms were applied, and this strategy might be particularly useful for studying other rheumatic conditions such as vasculitis, myositis, or systemic sclerosis, which are relatively rare and highly heterogeneous, making accrual of multiple, large clinical cohorts challenging. Nevertheless, the absence of an external validation dataset is an important limitation because it might result in problematic model overfitting and inappropriate generalisation. The generalisability issue is magnified when a small training dataset is used or when baseline parameters have highly skewed distributions, one of the recognised limitations of methods such as random forest. In this example, Robinson and colleagues acknowledge that their cohort consisted of few black patients and few patients with severely active disease, collected from a single centre. Although their findings are highly novel, readers should be cautious in the interpretation as advanced statistical methods are an inadequate replacement for external validation. How will Robinson and colleagues' findings change clinical practice? Our current methods for classifying patients with juvenile-onset SLE and other rheumatic diseases remains relatively crude.3Petri M Orbai AM Alarcón GS et al.Derivation and validation of the Systemic Lupus International Collaborating Clinics classification criteria for systemic lupus erythematosus.Arthritis Rheum. 2012; 64: 2677-2686Crossref PubMed Scopus (1) Google Scholar Accordingly, our ability to identify subgroups of patients who are likely to have disease progression over time or to predict which patients might respond to a specific medical therapy is limited. Robinson and colleagues highlight two potential future directions for the field. First, we need to improve the classification of patients on the basis of their biological, rather than clinical, disease profiles. Identification of unique patient subgroups who have different clinical outcomes using PBMCs would represent an important advance over current classification methods that require multiple longitudinal assessments. Second, classification of patients according to their immunological signature might allow targeting of specific pathways that maximise therapeutic response and minimise off-target side-effects. Similar lessons can be applied across disciplines to other inflammatory conditions. One example is in inflammatory bowel disease. Like juvenile-onset SLE, the inflammatory cascade in patients with inflammatory bowel disease is intricate, with multiple inputs including genetic susceptibility, environmental exposures, and faecal microbiota composition, interacting to influence a complex pro-inflammatory cytokine milieu.4Lee SH eun Kwon J Cho M-L Immunological pathogenesis of inflammatory bowel disease.Intest Res. 2018; 16: 26-42Crossref PubMed Scopus (184) Google Scholar However, inflammatory bowel disease has been historically categorised using clinical descriptors such as disease location, extent, and behaviour, and without incorporating assessment of specific immunological aberrancies.5Satsangi J Silverberg MS Vermeire S Colombel JF The Montreal classification of inflammatory bowel disease: controversies, consensus, and implications.Gut. 2006; 55: 749-753Crossref PubMed Scopus (1925) Google Scholar Consequently, treatment decisions are based on clinician and patient preference rather than disease biology. Clinicians simply do not have the requisite tools to distinguish which patients are likely to be responsive to different therapeutic classes, and conventional statistical methods applied in clinical trial development programmes and real-world cohorts have failed to produce a reliable, accurate companion diagnostic biomarker for treatment response. Therefore, adoption of machine-learning methods on PBMCs might offer additional insights into disease classification that will permit the future implementation of a more refined therapeutic algorithm. As more studies are using machine-learning methods, the European League Against Rheumatism (EULAR) has published recommendations on the optimal use of these applications.6Gossec L Kedra J Servy H et al.EULAR points to consider for the use of big data in rheumatic and musculoskeletal diseases.Ann Rheum Dis. 2020; 79: 69-76Crossref PubMed Scopus (37) Google Scholar Importantly, the implementation of machine learning requires an interdisciplinary approach because health-care providers might not be familiar with these statistical methods, whereas data scientists might lack the clinical context to interpret the findings. However, it is clear that machine learning has the versatility and potential to unlock tremendous opportunities for health research in inflammatory conditions, even when using small datasets. MYC is supported by the Gary S Gilkeson Career Development Award from the Lupus Foundation of America. CM declares no competing interests. Disease-associated and patient-specific immune cell signatures in juvenile-onset systemic lupus erythematosus: patient stratification using a machine-learning approachMachine-learning models can define potential disease-associated and patient-specific immune characteristics in rare disease patient populations. Immunological association studies are warranted to develop data-driven personalised medicine approaches for treatment of patients with juvenile-onset SLE. Full-Text PDF Open Access

References

Page 1

	Year	Citations

Page 1