Patients and methods Patients Ethical approvals by the institutional review boards were obtained for this retrospective analysis, and the need to obtain informed consent was waived. From January 1 to February 8, 2020, seventy consecutive patients with COVID-19 admitted in 5 independent hospitals from 4 cities were enrolled in this study (mean age, 42.9 years; range, 16–69 years), including 41 men (mean age, 41.8 years; range, 16–69 years) and 29 women (mean age, 44.5 years; range, 16–66 years). All patients were confirmed with SARS-CoV-2 infection by real-time RT-PCR and next-generation sequencing. Of these patients, 24 were from Huizhou City, 25 from Shantou City, 15 from Yongzhou City, and the rest 6 from Meizhou City. At the same period, another 66 pneumonia patients without COVID-19 from Meizhou People’s Hospital were recruited as controls (mean age, 46.7 years; range, 0.3–93 years), including 43 men (mean age, 46.0 years; range, 0.3–93 years) and 23 women (mean age, 48.0 years; range, 1–86 years). All the controls were confirmed with consecutive negative RT-PCR assays. Figure E1 in the Supplementary Material shows the patient recruitment pathway for the control group, along with the inclusion and exclusion criteria. According to previous studies [19–21], whose sample size is comparable with ours, the ratio between primary and validation cohort is 7:3. In this study, a total of 136 patients were divided into primary (n = 98) and validation (n = 38) cohorts, close to 7:3. A total of 19 COVID-19 patients from two hospitals (6 patients from Meizhou People’s Hospital and 13 patients from the First Affiliated Hospital of Shantou University Medical College) and 19 randomly selected controls from Meizhou City were incorporated into the validation cohort. The rest of the patients are incorporated in the primary cohort, including 51 COVID-19 patients from Huizhou, Yongzhou, and Shantou cities and 47 controls from Meizhou City. The primary cohort was utilized to select the most valuable features and build the predictive model, and the validation cohort was used to evaluate and validate the performance of the model. Image and clinical data collection The chest CT imaging data without contrast material enhancement were obtained from multiple hospitals with different CT systems, including GE CT Discovery 750 HD (General Electric Company), SCENARIA 64 CT (Hitachi Medical), Philips Ingenuity CT (PHILIPS), and Siemens SOMATOM Definition AS (Siemens). All images were reconstructed into 1-mm slices with a slice gap of 0.8 mm. Detailed acquisition parameters were summarized in the Supplementary Material (Table E1). The clinical history, nursing records, and laboratory findings were reviewed for all patients. Clinical characteristics, including demographic information, daily body temperature, blood pressure, heart rate, clinical symptoms, and history of exposure to epidemic centers, were collected. Total white blood cell (WBC) counts, lymphocyte counts, ratio of lymphocyte, neutrophil count, ratio of neutrophil, procalcitonin (PCT), C-reactive protein level (CRP), and erythrocyte sedimentation rate (ESR) were measured. All threshold values chosen for laboratory metrics were based on the normal ranges set by each individual hospital. Image analysis For extraction of radiological semantic features, two senior radiologists (D.L. and X.C., more than 15 years of experience) reached a consensus, blinded to clinical and laboratory findings. The radiological semantic features included both qualitative and quantitative imaging features. The lesions in the outer third of the lung were defined as peripheral, and lesions in the inner two-thirds of the lung were defined as central [22]. The progression of COVID-19 lesions within each lung lobe was evaluated by scoring each lobe from 0 to 4 [7], corresponding to normal, 1~25% infection, 26~50% infection, 51~75% infection, and more than 75% infection, respectively. The scores were combined for all five lobes to provide a total score ranging from 0 to 20. A total of 41 radiological features (26 quantitative and 15 qualitative) were extracted for the analysis. The descriptions of radiological semantic features are listed in the Supplementary Material (Table E2). Figure 1 is one example of the evaluation of CT imaging. Fig. 1 A 23-year-old female with a travel history to Wuhan presenting with fever. Axial noncontrast CT image shows a consolidation with ground-glass opacities in the peripheral region by the right upper lobe. Air bronchogram is found in lesion. The maximum diameter of lesion is 2.8 cm. The right upper lobe score is 1 because of the involved lung parenchyma less than 1/4 Clinical and radiological feature selection To obtain the most valuable clinical and radiological semantic features, statistical analysis, univariate analysis, and the least absolute shrinkage and selection operator (LASSO) method were performed. In statistical analysis, the chi-square test, the Kruskal-Wallis H test, and t test were utilized to compare the radiological semantic and clinical features between COVID-19 and non-COVID-19 groups. The features with p value smaller than 0.05 were selected. Then, univariate analysis was performed for clinical and radiological candidate features to determine the COVID-19 risk factors. The features with p value smaller than 0.05 in univariate analysis were also selected. The least absolute shrinkage and selection operator (LASSO) method [23] was utilized to select the most useful features with penalty parameter tuning that was conducted by 10-fold cross-validation based on minimum criteria. Diagnostic models were then constructed by multivariate logistic regression with the selected features. The flowchart of the feature selection process for these models was presented in the Supplementary Material (Fig. E2). Development and validation of the diagnostic model To develop an optimal model, we evaluated 3 models by analyzing (i) the clinical features model (C model), (ii) radiological semantic features model (R model), and (iii) the combination of clinical and radiological semantic features model (CR model) by multivariate logistic regression analysis. The classification performances of the models were evaluated by the area under the receiver operating characteristic (ROC) curve. The area under the curve (AUC), accuracy, sensitivity, and specificity were also calculated. A decision curve analysis was conducted to determine the clinical usefulness of the diagnostic model by quantifying the net benefits at different threshold probabilities in the validation dataset [24]. The development of decision curve was described in the Supplementary Materials. Figure 2 depicts the flowchart of the proposed analysis pipeline described above. We also built a nomogram, which was a quantitative tool to predict the individual probability of infection by COVID-19, based on the multivariate logistic analysis of the CR model with the primary cohort. Depending on the coefficient of the predictive factors in multivariate logistic regression model, all values of each predictive factor were assigned points. A total point was obtained by summing all the points of each predictive factor. The scale also showed the relationship between the total point and the prediction probability in the nomogram. The corresponding calibration curves of the CR model in the primary cohort and validation cohort are shown in the Supplementary Material (Fig. E3). Fig. 2 Workflow of data process and analysis in this study. Radiological semantic features, including qualitative and quantitative imaging features, are extracted from axial lung CT section. The clinical manifestation and laboratory parameters are provided by electronic case system. Statistical analysis is performed for comparing the different features between COVID-19 and non-COVID-19 patients. Univariate analysis, least absolute shrinkage, and selection operator (LASSO) are further performed to determine the COVID-19 risk factors with p < 0.05 in statistical analysis. Three models based on the selected features are established by multivariate logistic regression. These models include radiological mode (R model), clinical model (C model), and the combination of clinical and radiological model (CR model). The performance and clinical benefits of the prediction model are assessed by the area under a receiver operating characteristic (ROC) curve and the decision curve, respectively Statistical analysis Statistical analysis was conducted with R software (Version: 3.6.4, http: www.r-project.org/). The reported significance levels were all two-sided, and the statistical significance level was set to 0.05. The multivariate logistic regression analysis was performed with the “stats” package. Nomogram construction was performed using the “rms” package. Decision curve analysis was performed using the “dca. R” package.