Discussion Our study demonstrated that the data obtained from Google Trends, Baidu Index and Sina Weibo Index on searches for the keywords ‘coronavirus’ and ‘pneumonia’ correlated with the published NHC data on daily incidence of laboratory-confirmed and suspected cases of COVID-19, with the maximum r > 0.89. We also found that the peak interest for these keywords in Internet search engines and social media data was 10–14 days earlier than the incidence peak of COVID-19 published by the NHC. The lag correlation showed a maximum correlation at 8–12 days for laboratory-confirmed cases and 6–8 days for suspected cases. COVID-19 is a rapidly spreading infectious disease with, at the time of submission, more than 80,000 cases and a mortality so far known to be 3.4% [10]. It is important to predict the development of this outbreak as early and as reliably as possible, in order to take action to prevent its spread. Our data showed that the two popularly used Internet search engines, Google and Baidu, and the social media platform, Sina Weibo, were able to predict the disease outbreak 1–2 weeks earlier than the traditional surveillance systems. The role of Internet surveillance tools in early prediction of other epidemics has been reported previously, including for influenza [4], dengue fever [5], H1N1 [6], Zika [7], measles [8] and Middle East respiratory syndrome [9]. The availability of early information about infectious diseases through Internet search engines and social media will be helpful for making decisions related to disease control and prevention. Internet search data have been shown to enable the monitoring of Middle East respiratory syndrome 3 days before laboratory confirmations [9]. However, our results showed a much longer lag time for reported new laboratory-confirmed and suspected COVID-19 cases compared with digital surveillance data. There are several explanations. Firstly, COVID-19 is a novel disease just recently recognised. The first version of a guideline for diagnosis and management of COVID-19 was announced on 16 January 2020. It took time for the medical professionals to learn about the virus and the disease in order to make correct diagnosis. Secondly, the diagnosis of COVID-19 requires two independent confirmatory laboratory tests, which should be taken at least 1 day apart. Our results showed that the lag correlation is shorter for the suspected than for laboratory-confirmed cases. Thirdly, the supply of laboratory testing kits may have been insufficient in the early stages of the coronavirus outbreak, which would have limited the number of patients that can be confirmed. Finally, the Internet searches and social media mentions are not only initiated by the patients and their family members, but also globally by the general public who are concerned about this rapidly spreading disease. In addition, we found that the data from the Baidu Index and Sina Weibo Index could monitor the number of daily new confirmed and suspected cases from the NHC earlier than the data from Google Trends. A possible explanation is that the Google is not a major search engine used in China, where Baidu and Sina Weibo are widely used. The peak in the Sina Weibo Index was reached earlier than in Google Trends and Baidu Index. This suggests that Sina Weibo, which also serves as a social medium, disseminated the information faster than traditional websites. COVID-19 was firstly reported as ‘pneumonia of unknown aetiology’ or ‘pneumonia of unknown cause’ in late December 2019. On 8 January 2020, a novel coronavirus was identified as the cause of this disease. The disease was first named Novel coronavirus pneumonia by the NHC of China on 8 February and later ‘coronavirus disease 2019’ (abbreviated ‘COVID-19’) on 11 February by the WHO. Our search period was defined from January 16 to February 11. Therefore, we think that the two keywords ‘pneumonia’ and ‘coronavirus’ were sufficient to include most Internet content related to COVID-19 in this period. We also used other terms such as ‘新冠‘ (novel coronavirus), ‘新型冠状病毒肺炎’ (novel coronavirus pneumonia) as keywords but they returned much smaller numbers of queries and posts and we did therefore not include them in the analysis. It is also notable that the strength of correlation was different for different keywords. On Google, the keyword ‘coronavirus’ had the highest correlation coefficient (r = 0.958) with daily new laboratory-confirmed cases, and ‘pneumonia’ had the highest correlation coefficient with daily new suspected cases (r = 0.960). We found the same pattern in the Baidu Index and Sina Weibo Index. An explanation could be that ‘coronavirus’ is linked to the viral pathogen which should be investigated by a laboratory test, while ‘pneumonia’ is a clinical term and should link stronger to the suspected cases that are based on clinical and imaging evidence. A limitation of our study is its retrospective nature. If the Internet search engines and social media data were used in a real-time surveillance system, finding the best lag time would be a challenge because we would not have any training data to calibrate the analysis for a new disease.