Development and external validation of a breast cancer absolute risk prediction model in Chinese population

In contrast to developed countries, breast cancer in China is characterized by a rapidly escalating incidence rate in the past two decades, lower survival rate, and vast geographic variation. However, there is no validated risk prediction model in China to aid early detection yet. A large nationwide prospective cohort, China Kadoorie Biobank (CKB), was used to evaluate relative and attributable risks of invasive breast cancer. A total of 300,824 women free of any prior cancer were recruited during 2004–2008 and followed up to Dec 31, 2016. Cox models were used to identify breast cancer risk factors and build a relative risk model. Absolute risks were calculated by incorporating national age- and residence-specific breast cancer incidence and non-breast cancer mortality rates. We used an independent large prospective cohort, Shanghai Women’s Health Study (SWHS), with 73,203 women to externally validate the calibration and discriminating accuracy. During a median of 10.2 years of follow-up in the CKB, 2287 cases were observed. The final model included age, residence area, education, BMI, height, family history of overall cancer, parity, and age at menarche. The model was well-calibrated in both the CKB and the SWHS, yielding expected/observed (E/O) ratios of 1.01 (95% confidence interval (CI), 0.94–1.09) and 0.94 (95% CI, 0.89–0.99), respectively. After eliminating the effect of age and residence, the model maintained moderate but comparable discriminating accuracy compared with those of some previous externally validated models. The adjusted areas under the receiver operating curve (AUC) were 0.634 (95% CI, 0.608–0.661) and 0.585 (95% CI, 0.564–0.605) in the CKB and the SWHS, respectively. Based only on non-laboratory predictors, our model has a good calibration and moderate discriminating capacity. The model may serve as a useful tool to raise individuals’ awareness and aid risk-stratified screening and prevention strategies.


Introduction
Breast cancer is the most common and rapidly increasing female malignancy in China [1]. Compared with developed countries, breast cancer in China is characterized by a rapidly increasing incidence rate, lower survival rate, and vast geographic variation. The annual percent increase in breast cancer incidence was 4.5% and 9.1% in urban and rural areas of China, respectively [2]. In 2015, there were 304,000 newly diagnosed cases and 70,000 deaths from breast cancer, with an incidence rate of 54.3 per 100,000 in urban areas and 34.5 per 100,000 in rural areas [3]. The 5-year relative survival rates during 2003-2015 only ranged from 73.1% to 82.0% in Chinese women (55.9% to 72.9% for rural women), which were much worse than that of 90% for American women [4]. Early detection is the cornerstone of preventing morbidity and mortality due to breast cancer. However, it was impeded by the lack of individuals' awareness and national scale screening program.
Following the pioneering model derived by Gail et al. in 1989 [5], multiple models have been developed [6]. However, most models were developed in the western populations and may not be applicable to Chinese women, even the Gail model modified for Chinese-Americans [7]. A previous meta-analysis showed that these models tended to overestimate the risk of Asia women [8], and some predictors, such as the number of prior breast biopsies, are not available for most Chinese women. Several models have also been developed in China [9][10][11][12][13][14][15]. However, most of them were developed using a case-control design, which is subjected to selection and recall bias. Additionally, all these studies were conducted with participants from the eastern provinces of China, where breast cancer incidence rates are higher than those in the other areas of China [1]. More importantly, of the seven models, only one, which was conducted in Shandong province, has been externally validated in a small cohort with only 34 cases. Therefore, a validated breast cancer risk prediction model based on data from Chinese women with good generalizability is more than timely and much needed.
In this study, we used data from a large nationwide prospective cohort, the China Kadoorie Biobank (CKB), as well as national age-and residence (urban and rural)-specific invasive breast cancer incidence rates and non-breast cancer mortality rates to develop a risk prediction model considering competing risk, and used data from another large prospective cohort, the Shanghai Women's Health Study (SWHS), to independently validated the model.

Data for model development
Data from the CKB, a large-scale prospective study, was used to derive the relative risk (RR) model [16]. The study took place in 10 study sites, 5 in urban area (Qingdao, Harbin, Haikou, Suzhou, Liuzhou) and 5 in rural area (Pengzhou, Tianshui, Hui county, Tongxiang, Liuyang) of China. The regions were selected according to local disease patterns, exposure to certain risk factors, population stability, quality of death and disease registries, local commitment, and capacity. Potential eligible participants were identified through official residential records. Invitation letters (with study information leaflets) were delivered door-to-door by local community leaders or health workers. The estimated population response rate was~30% (26-38% in the five rural areas and 16~50% in the five urban areas). Overall, a total of 512,715 participants aged 30-79 years old, including 302,510 (59.0%) women were recruited during 2004-2008. All participants had completed a questionnaire and had physical measurements taken.
Incident cases of invasive breast cancer and mortality were identified chiefly through the linkage with the national health insurance claim database and disease registries, supplemented with local residential records and annual active confirmation. The International Classification of Diseases, 10th Revision was used to code all breast cancer (C50) by trained staff who were blinded to baseline information. We excluded women who had missing data for any reproductive factors or who provided implausible data on age at menarche or age at first live birth. We further excluded women who reported previous histories of cancer at baseline or had missing data for body mass index (BMI), leaving 300,824 women in the analysis.

Data for external validation
Independent data from the SWHS was used to externally validate the derived model based on CKB data [17]. In brief, 74,942 women were recruited from seven urban communities in Shanghai, China during 1996China during -2000.
At baseline, all information involved in the current analysis was collected through in-person interviews and anthropometric measures following standard protocol. Incident breast cancer cases (ICD-9 code 174) were identified by a combination of active re-surveys every 2 to 4 years and annual linkage with the Shanghai Cancer Registry and the Shanghai death certificate registry. The cancer diagnosis was verified through home visits and reviews of medical charts obtained from the hospitals where the patients were diagnosed. Applying the same exclusion criteria as the CKB data, 73,203 SWHS participants were included.

Relative risk prediction model
Participants were considered at risk from the enrollment to the diagnose of invasive breast cancer, death, loss to follow-up, or Dec 31, 2016, whichever came first. Cox proportional hazards model was used to estimate the hazard ratios as the metric of relative risk (RR) for each variable in the model, with age as the timescale, and stratified jointly by 10 study sites and age at enrollment in a 5-year interval (i.e., 100 strata to control the confounding by age and study sites).
We initially considered the following variables to construct the model: education, tobacco smoking, alcohol drinking, total physical activity, consumption of soybean, BMI, height, first-degree family history of overall cancer, menopausal status, number of live birth, age at menarche, total duration of breastfeeding, and usage of contraceptives. Because we did not collect information on family history of breast cancer, we used the family history of overall cancer as a surrogate. The continuous variables were converted to categorical variables to reduce overfitting. Cutoffs of BMI were chosen according to the well-established criteria for Chinese [18]. And, the quartile of height was used as cutoffs of height. For other predictors, cutoffs were chosen when the model achieved the smallest Bayesian Information Criterion (BIC). We assessed the proportional hazards assumption by the Schoenfeld residuals. In line with previous studies [19,20], we found only BMI was subject to time-varying effects. Therefore, we further split follow-up time into two age intervals at 50 years and added an interaction term of attained age and BMI. We first assessed all variables with P < 0.05 together in the model. Variable selection was repeated using stepwise backward elimination, which yielded the same result. The variables were converted to ordinal variables if their RRs were proportional to levels and no evidence of nonlinearity was detected using fractional polynomials. All first-order interactions were tested one by one using the likelihood ratio test comparing models with and without the interaction term. For all variables in the final model, the lowest risk category was regarded as the reference group, to facilitate population attributable risk (PAR) computation.
Given the higher incidence rate of breast cancer in urban areas than that in rural areas, we also tempted to build residence (urban/rural)-specific models, i.e., variable selection and predictors coefficients were separately done in urban and rural datasets. Interestingly, we found that the relative risks were similar between urban and rural areas, and there was no significant interaction between area and risk factors (see Additional file 1). Therefore, we used the same set of relative risk estimates for all participants in the CKB to maintain model parsimony and to more reliably estimate hazard ratios.

Absolute risk projection
We used an approach similar to that described by Gail et al. to project absolute risk from initial age to final ag e [5,21]. Briefly, the absolute risk that a woman who is age a and who has risk factors x will develop breast cancer by age a + τ is where h 1 (t, x) is the age-specific hazards of developing breast cancer and h 2 (t) is the age-specific hazards for competing causes at age t. We can estimate h 1 (t, x) = h 10 (t)RR(x) as the product of age-, residence-specific baseline hazards h 10 (t) and relative risks RR(x) from the relative risk model described above. RR(x) are ageconstant for all risk factors x except for BMI, which has two different RR for < 50 and ≥ 50 years old.
To have a robust and generalizable model, we calculated the baseline age-and residence-specific hazards h 10 (t), by multiplying age-specific incidence rates in 2014 from the National Central Cancer Registry of China (NCCR) [22] by one minus population attributable risk (PAR). The PAR was estimated using the formula described by Bruzzi et al. [23] and can be interpreted as the fraction in the incidence of breast cancer that would have been reduced during follow-up if all six predictors in the relative risk model (i.e., education, BMI, height, family history of overall cancer, parity, and age at menarche) took the lowest risk category of predictors. PAR of 1 indicates all breast cancer incidence attribute to the factors, while PAR of 0 indicates no breast cancer incidence attribute to these factors. The distribution of risk factors in four groups defined by attained ages (below/above 50 years old) and residence (urban/ rural) were different, so we estimated the PAR separately in the four above-mentioned groups. Further, death from causes other than breast cancer will prevent the occurrence of breast cancer, of which risk increased with age. To account for the competing risk, we calculated ageand residence-specific mortality rates of non-breast cancer, h 2 (t), as age-and residence-specific all-cause mortality rates in 2014 from Health Statistics Yearbook [24] minus age-and residence-specific breast cancer mortality rates in 2014 from the NCCR. These incidence and mortality rates are listed in Additional file 2.
As a sensitivity analysis, we built an absolute risk model using breast cancer incidence rates and nonbreast cancer mortality rates from the CKB cohort to understand calibration of internal validation. As another sensitivity analysis, we built an absolute risk model using breast cancer incidence rates and non-breast cancer mortality rates from Shanghai in the external validation (calibrated model) to evaluate whether robust local rates, if available, can improve model performance.

Validation
The above development process was first done using whole CKB data and repeated in a random two-thirds of the CKB data (derivation subcohort). We found that the same set of predictors was selected and the RRs for predictors were similar using the above-mentioned two methods (Additional file 3). We used data splitting approach for internal validation, i.e., the model was fitted to random two-thirds of the CKB data and evaluated on the remaining one-third (test subcohort). To have more precise estimations of model parameters, we still used the model developed from the whole CKB dataset for external validation in the SWHS dataset. We assessed calibration by comparing the expected number of breast cancer cases (E) with the observed number (O) overall and for subgroups defined by predictors. The calibration plot was drawn to examine the agreement across deciles of predicted risk in the total population. The projected probability of breast cancer was calculated from the age at enrollment to the younger of either the age at last follow-up or the age on Dec 31, 2016, for the CKB participants or Dec 31, 2014, for the SWHS participants. The 10-year projected risk was also estimated. The 95% confidence intervals (CIs) of E/O ratios were calculated based on Poisson distribution. An E/O ratio above one indicates that the model overestimates cancer risk, and an E/O less than one indicates that the model underestimates cancer risk. Discrimination was quantified by calculating the area under the receiver-operating characteristic curve (AUC), also known as c-statistics, for 10-year risk model. Age-and residence-adjusted AUC was also assessed to eliminate the effect of age and residence. Higher AUC indicates higher discriminating ability, where random classification results in an AUC of 0.5 and perfect discrimination results in 1. To further assess the discriminating accuracy, we estimated the RRs comparing different quintiles of predicted risk. We also estimated a range of performance indices corresponding to a series of cut-offs ranging from 0.4% to 2% in both the CKB and the SWHS. The indices included percent of high-risk population, sensitivity, specificity, positive/ negative predictive value (PPV/NPV), and numbers needed to be screened to confirm one case in the next 10 years (NNS, one divided by the PPV).
The calculation of absolute risk was performed using SAS (version 9.4, SAS Institute Inc.), and all other statistical analyses were performed using Stata (version 14, StataCorp).

Results
Of the 300,824 women in the CKB cohort included in the RR model development, the mean age at recruitment was 51.4 years. Compared with those in rural areas, women in urban areas were older, more educated, more overweight or obese, taller, and were more likely to have positive overall cancer family history, early age at menarche, and less likely to have multiple children (Table 1). Compared with women in urban areas of the CKB, women in the SWHS had similar ages, BMI, and number of live births, but tended to be more educated, taller, to have more relatives diagnosed with cancer, and to have an earlier age at menarche.
During a median of 10.2 years of follow-up in the CKB, 2287 women developed invasive breast cancer. The final age-and study site-stratified model included education, BMI, height, family history of cancer, parity, and age at menarche ( Table 2). The association between BMI and breast cancer risk was nonsignificant in women younger than 50 years and was positive associated in women above this age (test-forinteraction was significant). No other significant interaction between predictors was found. Based on the relative risk model and distribution of risk factors, the PARs estimated in urban areas were 0.74 for women younger than 50 years and 0.76 for women 50 years and older. The corresponding PAR estimates in rural areas were 0.63 and 0.65, reflecting fewer cases were attributed to the six predictors in the relative risk model in the rural areas.
Of the 73,203 women in the SWHS, 1409 were diagnosed with breast cancer during a median of 16.1 years of follow-up. The CKB model predicted 1320 cases in the SWHS, yielding an E/O of 0.94 (95% CI, 0.89 to 0.99). The number of cases was statistically significantly underestimated among women aged 60 years and older, women with lower education, women shorter than 150.2 cm, women without family history of overall cancer, women with multiple live births, and women with age at menarche at 15-16 years. The model statistically significantly overestimated risk for women with 2 or more affected first-degree relatives. For all other categories, there was good agreement between the observed and predicted number of breast cancers ( Table 3). The calibration plot showed agreement across deciles of predicted risk, except for the second-lowest decile (Fig. 1b). We further recalculated the absolute risk using Shanghai local rates and found a better calibration, with an E/O (95% CI) overall of 1.01 (0.96-1.06) (see Additional file 4).
As a reference, we also present calibration results for the test subcohort of the CKB study (Table 3 and Fig.  1a). Overall, the CKB model predicted 760 cases in the CKB test subcohort, yielding an E/O (95% CI) of 1.01 (0.94-1.09). The model statistically significantly overestimated the risk of women in rural areas but underestimated the risk in urban areas. In the sensitivity analysis, we recalculated the absolute risk using CKB rates (see Additional file 4), and found the calibrated E/Os were 1.03 (0.95-1.13) and 0.99 (0.88-1.12) for participants in the urban and rural areas, respectively. Discriminating accuracy of the 10-year risk model is presented in Table 4 and Fig. 1c, d. The overall AUC was 0.658 (95% CI, 0.631-0.684) in the CKB test subcohort and attenuated to 0.634 (95% CI, 0.608-0.661) after adjusting for age and residence. External validation resulted in an overall unadjusted AUC of 0.573 (95% CI, 0.553-0.593) and an age-adjusted AUC of 0.585 (95% CI, 0.564-0.605).
And compared with women in the lowest quintile of 10-year predicted risk, the adjusted RR for women in the highest quintile was 6.74 in the CKB (95% CI, 4.57-9.92) and 2.55 in the SWHS (95% CI, 2.06-3.16) ( Table 5). Larger RRs were observed in women aged 50 years and older and women in urban areas. The stratifying efficiency of our model at different 10-year predicted risk cut-offs in the CKB and SWHS is shown in Additional files 5 and 6.

Discussion
We developed a prediction model for invasive breast cancer among Chinese women aged 30 years and older using data from a large nationwide prospective cohort and validated its performance in an independent cohort in Shanghai. The model includes six factors in the relative risk prediction (education, BMI, height, family history of overall cancer, parity, and age at menarche) and two additional factors in the absolute risk prediction (age and residence area). The model was well-calibrated in both the CKB and SWHS cohorts, though there were under-or overestimation of risk in some risk factor strata. After eliminating the effect of age and residence, we found the adjusted AUC was 0.634 and 0.585 in the CKB and SWHS, respectively, which are comparable with those of some previous externally validated models [9,25].
Overall, our model fits well in the CKB and underestimated (6%) the risk of women in the urban area in the SWHS. To have a good model generalization, we have applied China's national age and residence (urban/rural) rates in the absolute risk calculation, instead of regional rates like previous studies in China [9][10][11][12][13][14][15]. Therefore, the agreement of the national rates with rates in validation datasets may play a major role in the calibration. CKB's cancer incidence and mortality rates were Abbreviations: SD standard deviation, MET metabolic equivalent of task, BMI body mass index consistent with national rates during 2008-2013 [26], resulting in the excellent calibration in the CKB. Despite the overall concordance, the model overestimated the risk of women in rural areas but underestimated the risk in urban areas, reflecting that higher incidence rates in urban areas and lower rates in rural areas in the CKB cohort than the corresponding national rates (see Additional file 2). Interestingly, although SWHS cohort women were recruited around 10 years before the CKB in Shanghai, one of the most developed cities in China, the CKB model can still provide acceptable calibration in the SWHS cohort. The slight underestimation was caused by higher incidence rates of breast cancer in Shanghai. In our sensitivity analyses of recalculating the Abbreviations: BMI body mass index, PY person-year, RRs relative risk, CI confidence interval Cox model was stratified by age at enrollment in 5-year interval (10 groups) and 10 study sites. All predictors above were included in the final model absolute risk using local rates, the above-mentioned calibration errors diminished, confirming that our relative risk model was robust and the errors were solely caused by the mismatch between national rates and local rates (see Additional file 4). A previous meta-analysis showed that the Asian American Breast Cancer Study model (AABCS), or Gail model for Asian Americans, overestimated breast cancer risk for Asian women (pooled E/O = 1.82, 95% CI 1.31-2.51) [7,8]. This overestimation was also observed in a recent cohort study in China (E/ O = 2.39, 95% CI 1.71-3.46) [9]. Similarly, we applied the AABCS model to the CKB and SWHS data and In the external validation, we found a moderate AUC of 0.585, which was better than or equivalent to those of the AABCS model [8,9,25]. Matsuno et al. reported the AUC of the AABCS model (including age at menarche, age at first live birth, number of affected mothers, sisters, and daughters with breast cancer, and number of previous benign biopsies) was 0.614 (95% CI 0.587-0.640) in the validation among Asian-Americans [7], but AUC decreased to 0.54 in two independent validations conducted in China [9] and Korean [25]. We found that the age-and residence-adjusted AUCs of both the original AABCS model and calibrated AABCS model in the CKB and the SWHS data were all around 0.54 (see Additional file 7). To our knowledge, only one model developed in China was externally validated, with higher AUC (0.64, 95% CI 0.55-0.72), but few cases in their validation set and same location of derivation and validation sets limited the robustness of the results [9]. Although several models in China had statistically significantly higher AUC by additionally including genetic information, the lack of external validation precludes direct comparison with our models [11,14,15].
The development of the CKB risk prediction model has several public health implications. First, our model, with the moderate discriminating ability and good calibration, can facilitate allocation of preventive resources under monetary and medical constraints and aid riskbased screening strategies [27]. China's breast cancer 2019 screening guidelines recommended an opportunity for screening for women with average risk aged 40-44 years and biennial screening for women aged 45-69 years, which is mainly done by mammograph and supplemented with breast ultrasonography and magnetic resonance imaging [28]. However, such an age-based screening strategy ignores the large variation in breast cancer risk in the population [29]. Given the limited medical and economic resources in China, it is more cost-effective to adopt a risk-based screening strategy that can allocate resources to do intensive screening for women at high risk, while less frequent screening for women with low risk. Second, at the individual level, our model can be used for individual risk counseling and promote a healthy lifestyle. Knowing their own cancer risk may motivate obese women to lose weight. Third, as described by Gail et al., our model can also aid designing preventive trials and estimating the absolute burden of a specific population [27].
Our study has several strengths. We used data from the largest nationwide prospective cohort study in China to develop the relative risk model, augmented with China national incidence and mortality rates, and validated in another large prospective cohort study. These methods ensure our model to be robust and potentially generalizable to both rural and urban areas in China. Also, all predictors in the model are non-invasive, easy to measure at low cost, which makes the model easily applicable to the general population. We plan to develop an online risk calculator to promote its use.
However, one must be aware of limitations of our study. First, several established risk factors were not included in the model. Although several studies included alcohol [29][30][31], the low prevalence of alcohol intake in the CKB (see Table 1) precluded the inclusion. Additionally, we did not have data on family history of breast cancer, so we used a family history of all cancers as a surrogate to capture the inherited susceptibility of breast cancer as much as possible. This surrogation may not be accurate such that the risk was overestimated in women with two or more family members having cancers. The history of benign breast diseases was not collected in the CKB and we think it might not be reliably collected in the general Chinese population. Overall AUC indicated the discriminating ability of the absolute risk predicted by our full model c Age-and residence-adjusted AUC was estimated by testing the full model while adjusting for residence (urban/rural) and age at entry in a 5-year interval, i.e., the prediction effect of age and residence was removed Second, cumulative evidence showed heterogeneous associations of epidemiological factors with estrogen receptor (ER)-specific breast cancer though some factors are common for both ER-positive and ER-negative breast cancers [32,33]. We did not build ER-specific models due to the lack of information on subtypes of breast cancer in the current database of the CKB cohort. Since the majority of breast cancer in Chinese women was estrogen ER-positive (80.3% in women < 50 years and 76.8% in women 50 or older) [34], our model might primarily apply to ER-positive breast cancer. Finally, we only externally validated our model in urban Shanghai, which has one of the highest incidence rates in China. Therefore, further validation of our model in other regions, especially in rural regions, is still needed.

Conclusions
In summary, we have developed and validated a breast cancer risk prediction model that only relies on nonlaboratory predictors. The model has a good calibration and a moderate discriminating capacity. The model may serve as a useful tool to raise individuals' awareness and to identify women who may benefit from breast cancer screening in China. To improve the model discriminating accuracy, further studies can add genetic and epigenetic predictors for breast cancer, as well as mammographic density. Validation of our model in other regions of China, especially rural areas, is also desirable to evaluate the robustness of the CKB model. Abbreviations: RR relative risk, CI confidence interval, --not applicable Cox model was stratified by age in a 5-year interval in SWHS and additionally stratified by 10 study sites in CKB a 1 refers to the lowest risk group, and 5 refers to the highest risk group