News Magazine

Data-driven identification of ageing-related diseases from electronic health records

We grouped 278 high-burden diseases into nine main clusters using unsupervised machine-learning. Four of these clusters consisted of diseases that increased with age, albeit with strikingly different age trajectories and median ages of disease onset (82y, 77y, 69y and 57y for Clusters 1, 2, 3 and 4, respectively), indicating that different aetiologies may drive each cluster. Diseases in these four clusters spanned diverse organ systems and clinical specialties. Cluster 1 consisted of dementia, delirium, hip fracture, bifascicular and trifascicular heart blocks. Cardiovascular diseases were most highly represented in Cluster 2, cancers in Cluster 3, and diseases of the digestive system in Cluster 4. Benign neoplastic, skin and psychiatric disorders, the three disease categories with the lowest median age of disease onset (50y, 43y and 38y, respectively), were largely absent from these four clusters. Four clusters (Clusters 6, 7, 8 and 9) were clearly not ageing-related. Cluster 5 comprised diseases with varying age-related disease onset patterns.

Next, we applied actuarial techniques to assess whether diseases were ageing-related according to how well the rate of disease onset data fitted the Gompertz and Gompertz–Makeham models. While this method was based on very different principles from the clustering algorithm, the results were highly concordant (Table 2, Fig. 6) indicating that these two data-driven approaches can be used synergistically to identify ARDs.

All diseases in Clusters 1 and 2 were highly likely to be ageing-related. A small number of diseases in Clusters 3 and 4 fit slightly less well with the actuarial models. Unlike clustering techniques, parametric methods such as the Gompertz and GM models rely on sufficient sample sizes to assess how well the model fits a particular distribution. Where sample sizes are small (i.e. data is sparse), the goodness-of-fit statistics are lower, reflecting the lower degree of certainty with which the assumed model fits the data. The relationship with age for diseases in Cluster 5 was more complex than for diseases in the other clusters. Given the heterogeneity in the age-specific rate of disease onset curves in this cluster, the actuarial method was useful in differentiating diseases which were likely to be ageing-related, such as erectile dysfunction, from those that were not, such as irritable bowel syndrome (Supplementary Fig. S6).

Clustering of age density patterns of ICD-10 codes on medical claims from an insurance company in Brazil has been described previously28, but to our knowledge, this is the first report of clustering of age-specific rates of disease onset of curated disease phenotypes in a representative population set, with the results corroborated using an independent parametric method, namely actuarial models. Unlike data from a universal healthcare system such as the National Health Service (NHS) in England, insurance claims data may be biased and not representative of a population of interest as they exclude individuals without health insurance, and data collected primarily for financial purposes may not be suitable to assess epidemiological measures such as prevalence and incidence of disease29,30. Furthermore, the previous study did not provide details of which ICD-10 codes fell into each cluster, while in this study we present the age-specific rate of onset curves for 289 diseases and their respective clusters so that readers can observe how disease incidence progresses with age.

In its latest version of the International Classification of Diseases, ICD-11, the World Health Organisation (WHO) has implemented an extension code for “ageing-related” diseases (XT9T), defined as those “caused by pathological processes which persistently lead to the loss of organism’s adaptation and progress in older ages”31. This study provides an objective method for identifying candidate diseases to which this extension can be applied.

The ARDs we identified extend across the full range of conventional classifications of disease, which are based on organ systems, as reflected in the International Classification of Diseases. We introduce an alternative paradigm for the classification of ARDs based on the age of disease onset patterns. The analytic approaches in this study can be applied to any of the thousands of phenotyped health conditions in any representative population setting to identify and categorise ARDs according to the relationship between age and rate of disease onset. Our findings facilitate the organisation of clinical specialties, particularly geriatric medicine, around the prevention or care of clusters of ARDs.

The identification of ARDs, and the presentation of age incidence curves in particular, enable clinicians to assess the likelihood of different diseases occurring at different ages. This information can be used to formulate a list of differential diagnoses when assessing individual patients. Conditions in Cluster 1 such as dementia, delirium and hip fracture were more likely to occur in the most elderly patients, while conditions in Cluster 2, consisting mainly of cardiovascular diseases, occurred at a slightly younger age, and those in Cluster 3, such as cancers, occurred earlier yet. These findings have resource implications as well. Health care providers will need to allocate more resources to diseases in Clusters 1 and 2 as populations get older. These include increased funding towards social care and allied health professional support such as physiotherapists and occupational therapists to address the functional implications of cognitive loss in dementia. These findings should also prompt increased provision of cardiac rehabilitation services to improve the quality of life of individuals who experience heart failure and arrhythmias as a result of insults to the cardiovascular system at an earlier age. Our results can also guide health services to target preventive measures for ARDs in the different clusters at different ages over the lifecourse, such as providing occupational health assessments for individuals above the age of 80 years to prevent falls leading to hip fractures. The findings from this study also give basic science researchers a perspective on the incidence of ARDs over the lifecourse and demonstrate which ARDs have similar patterns of disease onset with age, thereby informing research into how long various hallmarks or mechanisms of ageing may take to cause ARDs in the different clusters. Future research is needed to investigate whether diseases in the same cluster share common mechanisms or risk factors of ageing.

ARDs that occur together more often than expected by chance may share common biological mechanisms. If so, existing drugs targeting these mechanisms could be repurposed for other ARDs with similar molecular pathways. For example, interleukin 6 (IL6), an inflammatory cytokine, has been implicated in the pathogenesis of rheumatoid arthritis32, coronary heart disease33, atrial fibrillation34 and abdominal aortic aneurysm35. Drugs such as tocilizumab, which inhibits the IL6-receptor and is already licensed for the treatment of rheumatoid arthritis and giant cell arteritis, might therefore be effective in treating these other diseases. New drugs can also be developed to modulate the biological pathways for multiple ARDs based on common genetic or other molecular risk factors.

ARDs such as alcoholic liver disease, COPD, cirrhosis, cancers, peptic ulcer, and actinic keratosis are caused by the cumulative damage of exogenous substances including alcohol, smoking, medications, deleterious dietary compounds, and radiation. Research into environmental causes and public health campaigns that target these are important to prevent ARDs amenable to lifestyle and public policy changes.

We identified ARDs using methods that relied on large population EHR datasets. Replication in independent representative population cohorts would validate the application of these methods to big data with defined disease phenotypes (not just ICD-10 or other billing codes) from other healthcare systems that are representative of the general population. This would pave the way to comparisons of how diseases may vary with age across high, medium and low-income countries, and countries with different population age structures.

One potential limitation of our analysis was that the age of disease onset was represented by the age of first recorded diagnosis for each individual11. This could introduce biases in the rate of disease onset for several reasons. Diseases such as chronic obstructive pulmonary disease (COPD) are clinically silent for long periods, leading to delays between each of the following events: disease onset, presentation to a clinician, diagnosis and documentation in the EHR. Other conditions such as hypertension, dyslipidaemia or obesity were more likely to be diagnosed in individuals aged 40–74 years because of the NHS Health Checks programme which began in 2009 with the aim of reducing CVD risks36. Conditions that are usually asymptomatic, such as chronic kidney disease, were more likely to be detected in individuals already diagnosed with co-existing morbidities than in individuals having no contact with health services. Other factors, such as screening, may also affect recorded diagnosis rates. An example is breast cancer, where small spikes in the rate of disease onset curve are apparent at the ages of 50 and 70, which correspond to the ages between which breast screening takes place (Supplementary Fig. S5a). However, given that disease onset is often latent with minimal clinical features, and that diagnosis from clinical manifestation in this current age of medicine in high-income countries such as England is usually time-efficient, EHRs present us with the best available proxy for age of disease onset, for the widest spectrum of disease, in the form of age at first recorded diagnosis.

Variable patterns of consultation could also affect the accuracy of the records. Disease frequency estimates for conditions which can be self-managed by over-the-counter medications or conditions affecting individuals at the mild end of the symptom spectrum may be underestimated using EHRs. Another limitation of this study is that we did not use free text comments to supplement the phenotyping algorithms for disease definition. This could have led to missing diagnoses for conditions that might not be well coded37. However, studies have shown that most diseases, including cancers, inflammatory bowel diseases, asthma, cataract, glaucoma and autism are reliably captured using diagnosis codes in primary care CPRD data linked to HES secondary care data38,39,40,41,42,43. Finally, we did not evaluate the data quality of the CPRD linked dataset44, but the use of diagnostic codes in the CPRD dataset for research purposes has previously been validated14,45.

In conclusion, we have developed a protocol to identify and classify ARDs from any EHR dataset representative of the general population. Our findings can be used to explore which ARDs co-occur more often than expected by chance and the common endogenous or environmental drivers behind them, leading to further research investigating the most suitable interventions to prevent or treat multiple ARDs effectively. This work is therefore the first, critical step towards tackling the challenges of ageing and ARDs, which are emerging as costly afflictions in the modern world.