PubAnnotation

CORD-19

Collection info

CORD-19 (COVID-19 Open Research Dataset) is a free, open resource for the global research community provided by the Allen Institute for AI: https://pages.semanticscholar.org/coronavirus-research.

As of 2020-03-20, it contains over 29,000 full text articles. This CORD-19 collection at PubAnnotation is prepared for the purpose of collecting annotations to the texts, so that they can be easily accessed and utilized.

If you want to contribute with your annotation,

take the documents in the CORD-19_All_docs project,
produce your annotation to the texts using your annotation system, and
contribute the annotation back to PubAnnotation (HowTo).

All the contributed annotations will become publicly available. Please note that, during uploading your annotation data, you do not need to be worried about slight changes in the text: PubAnnotation will automatically catch them and adjust the positions appropriately.

Once you have uploaded your annotation, please notify it to admin@pubannotation.org admin@pubannotation.org, so that it can be included in this collection, which will make your annotation much easily findable.

Note that as the CORD-19 dataset grows, the documents in this collection also will be updated.

IMPORTANT: CORD-19 License agreement requires that the dataset must be used for text and data mining only.

Maintainer	Jin-Dong Kim

Projects

Name	Description	# Ann.	Maintainer	Updated_at	RDFized_at	Status

1-12 / 12
CORD-19_All_docs	All the documents in the whole CORD-19 dataset. The documents in this project will be updated as the CORD-19 dataset grows. See the COVID DATASET LICENSE AGREEMENT.	0	Jin-Dong Kim	2014-04-07	-	Released
CORD-19_bioRxiv_medRxiv_subset	The bioRxiv/medRxiv subset of the CORD-19 dataset: pre-prints that are not peer reviewed. The documents in this project will be updated as the CORD-19 dataset grows. See the COVID DATASET LICENSE AGREEMENT.	0	Jin-Dong Kim	2014-04-07	-	Released
CORD-19_Commercial_use_subset	The Commercial use subset of the CORD-19 dataset. The documents in this project will be updated as the CORD-19 dataset grows. See the COVID DATASET LICENSE AGREEMENT.	0	Jin-Dong Kim	2014-04-07	-	Released
CORD-19_Custom_license_subset	The Custom license subset of the CORD-19 dataset. The documents in this project will be updated as the CORD-19 dataset grows. See the COVID DATASET LICENSE AGREEMENT.	5.08 M	Jin-Dong Kim	2014-04-07	-	Released
CORD-19_Non-commercial_use_subset	The Non commercial use subset of the CORD-19 dataset. The documents in this project will be updated as the CORD-19 dataset grows. See the COVID DATASET LICENSE AGREEMENT.	0	Jin-Dong Kim	2014-04-07	-	Released
CORD-19-PD-HP	PubDictionaries annotation for HP terms - updated at 2020-04-30 Disease term annotation based on HP. Version 2020-04-20. The terms in HP are loaded in PubDictionaries, with which the annotations in this project are produced. The parameter configuration used for this project is here. Note that it is an automatically generated dictionary-based annotation. It will be updated periodically, as the documents are increased, and the dictionary is improved.	1.15 M	Jin-Dong Kim	2014-04-07	-	Released
CORD-19-PD-MONDO	PubDictionaries annotation for MONDO terms - updated at 2020-04-30 It is disease term annotation based on MONDO. Version 2020-04-20. The terms in MONDO are loaded in PubDictionaries, with which the annotations in this project are produced. The parameter configuration used for this project is here. Note that it is an automatically generated dictionary-based annotation. It will be updated periodically, as the documents are increased, and the dictionary is improved.	6.32 M	Jin-Dong Kim	2014-04-07	-	Released
CORD-19-PD-UBERON	PubDictionaries annotation for UBERON terms - updated at 2020-04-30 It is disease term annotation based on Uberon. The terms in Uberon are uploaded in PubDictionaries (Uberon), with which the annotations in this project are produced. The parameter configuration used for this project is here. Note that it is an automatically generated dictionary-based annotation. It will be updated periodically, as the documents are increased, and the dictionary is improved.	1.42 M	Jin-Dong Kim	2014-04-07	-	Released
CORD-19-SciBite-sentences		11.2 K	Jin-Dong Kim	2014-04-07	-	Testing
CORD-19-Sentences		13.4 M	Jin-Dong Kim	2014-04-07	-	Testing
CORD-PICO	Automatic annotation of the CORD-19 dataset with PICO categories. The corpus was automatically labeled with an LSTM-CRF model trained on human-annotated PubMed abstracts from https://github.com/bepnye/EBM-NLP. Currently, titles and abstracts only are annotated using Population, Intervention and Outcome labels, as well as more fine-grained labels such as Age, Drug, Mortality and others.	69.6 K	ssuster	2014-04-07	-	Developing
Epistemic_Statements	The goal of this work is to identify epistemic statements in the scientific literature. An epistemic statement is a statement of unknowns, hypotheses, speculations, uncertainties, including statements of claims, hypotheses, questions, explanations, future opportunities, surprises, issues, or concerns within a sentence. The unit of an epistemic statement is a sentence automatically parsed. The classification is binary - epistemic statement or not. We will label epistemic statements only and one can assume that if a statement is not labeled, then it is not an epistemic statement. The classifier is a CRF, trained on gold standard annotations of epistemic statements that are currently ongoing. We report an F-measure of 0.91 after 5-fold cross validation on a test set with 914 statements and an F-measure of 0.9 on a held out document with 130 statements. This project is still under development and is submitted to be used for the CovidLit project and associated Hackathon. Please contact Mayla if you have any questions.	1.42 M	mboguslav	2014-04-07	-	Developing