Projects

Name	Description	# Ann.	Author	Maintainer	Updated_at	Status

1 2 3 4 5 ... 16 17 » 1-20 / 323 show all
NCBI-Disease-Train	The NCBI disease corpus is fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community.	5.15 K	Rezarta Islamaj Doğan,Robert Leaman,Zhiyong Lu	Kenkim	2025-01-17	Released
NCBI-Disease-Test	The NCBI disease corpus is fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community.	960	Rezarta Islamaj Doğan,Robert Leaman,Zhiyong Lu	Kenkim	2025-01-17	Released
NCBI-Disease-Develop	The NCBI disease corpus is fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community.	787	Rezarta Islamaj Doğan,Robert Leaman,Zhiyong Lu	Kenkim	2025-01-17	Released
bionlp-st-ge-2016-coref	Coreference annotation to the benchmark data set (reference and test) of BioNLP-ST 2016 GE task. For detailed information, please refer to the benchmark reference data set (bionlp-st-ge-2016-reference) and benchmark test data set (bionlp-st-ge-2016-test).	853	DBCLS	Jin-Dong Kim	2024-06-17	Released
NCBIDiseaseCorpus	The NCBI disease corpus is fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community.	6.85 K	Rezarta Islamaj Doğan,Robert Leaman,Zhiyong Lu	Chih-Hsuan Wei	2023-11-29	Released
RELISH-DB	Abstracts contained in the data of the RELISH-DB (https://relishdb.ict.griffith.edu.au) made available for download here. Data was downloaded from here: https://figshare.com/projects/RELISH-DB/60095 Related publication: https://academic.oup.com/database/article/doi/10.1093/database/baz085/5608006#200722023	0			2023-11-29	Released
craft-ca-core-dev	Development data for CRAFT CA shared task, core concepts only. This project contains the development (training) annotations for the Concept Annotation task of the CRAFT Shared Task 2019. This particular set of concept annotations is the "core" set. See the task description for details, but this set contains only annotations to concepts that appear in the original 10 Open Biomedical Ontologies used for annotation. (That is to say, it does not contain any annotations to extension classes).	59.8 K	University of Colorado Anschutz Medical Campus	craft-st	2023-11-29	Released
bionlp-st-ge-2016-test	It is the benchmark test data set of the BioNLP-ST 2016 GE task. It includes Genia-style event annotations to 14 full paper articles which are about NFκB proteins. For testing purpose, however, annotations are all blinded, which means users cannot see the annotations in this project. Instead, annotations in any other project can be compared to the hidden annotations in this project, then the annotations in the project will be automatically evaluated based on the comparison. A participant of GE task can get the evaluation of his/her result of automatic annotation, through following process: Create a new project. Import documents from the project, bionlp-st-2016-test-proteins to your project. Import annotations from the project, bionlp-st-2016-test-proteins to your project. At this point, you may want to compare you project to this project, the benchmark data set. It will show that protein annotations in your project is 100% correct, but other annotations, e.g., events, are 0%. Produce event annotations, using your system, upon the protein annotations. Upload your event annotations to your project. Compare your project to this project, to get evaluation. GE 2016 benchmark data set is provided as multi-layer annotations which include: bionlp-st-ge-2016-reference: benchmark reference data set bionlp-st-ge-2016-test: benchmark test data set (this project) bionlp-st-ge-2016-test-proteins: protein annotation to the benchmark test data set Following is supporting resources: bionlp-st-ge-2016-coref: coreference annotation bionlp-st-ge-2016-uniprot: Protein annotation with UniProt IDs. pmc-enju-pas: dependency parsing result produced by Enju UBERON-AE: annotation for anatomical entities as defined in UBERON ICD10: annotation for disease names as defined in ICD10 GO-BP: annotation for biological process names as defined in GO GO-CC: annotation for cellular component names as defined in GO A SPARQL-driven search interface is provided at http://bionlp.dbcls.jp/sparql.	7.99 K	DBCLS	Jin-Dong Kim	2023-11-29	Released
BioLarkPubmedHPO	228 abstracts manually annotated with Human Phenotype Ontology (HPO) concepts and harmonized by three curators, which can be used as a reference standard for free text annotation of human phenotypes. For more info, please see Groza et al. "Automatic concept recognition using the human phenotype ontology reference and test suite corpora", 2015.	7.16 K	Tudor Groza	simon	2023-11-29	Released
AnEM_abstracts	250 documents selected randomly from citation abstracts Entity types: organism subdivision, anatomical system, organ, multi-tissue structure, tissue, cell, developing anatomical structure, cellular component, organism substance, immaterial anatomical entity and pathological formation Together with AnEM_full-texts, it is probably the largest manually annotated corpus on anatomical entities.	1.91 K	NaCTeM	Yue Wang	2023-11-29	Released
GENIAcorpus	multi_cell (1,782) mono_cell (222) virus (2,136) protein_family_or_group (8,002) protein_complex (2,394) protein_molecule (21,290) protein_subunit (942) protein_substructure (129) protein_domain_or_region (1,044) protein_other (97) peptide (521) amino_acid_monomer (784) DNA_family_or_group (332) DNA_molecule (664) DNA_substructure (2) DNA_domain_or_region (39) DNA_other (16) RNA_family_or_group (1,545) RNA_molecule (554) RNA_substructure (106) RNA_domain_or_region (8,237) RNA_other (48) polynucleotide (259) nucleotide (243) lipid (2,375) carbohydrate (99) other_organic_compound (4,113) body_part (461) tissue (706) cell_type (7,473) cell_component (679) cell_line (4,129) other_artificial_source (211) inorganic (258) atom (342) other (21,056)	78.9 K	GENIA Project	Yue Wang	2023-11-29	Released
LocText	The manually annotated corpus consists of 100 PubMed abstracts annotated for proteins, subcellular localizations, organisms and relations between them. The focus of the corpus is on annotation of proteins and their subcellular localizations.	2.29 K	Goldberg et al	Shrikant Vinchurkar	2023-11-29	Released
LitCovid-v1-docs	A comprehensive literature resource on the subject of Covid-19 is collected by NCBI: https://www.ncbi.nlm.nih.gov/research/coronavirus/ The LitCovid project@PubAnnotation is a collection of the titles and abstracts of the LitCovid dataset, for the people who want to perform text mining analysis. Please note that if you produce some annotation to the documents in this project, and contribute the annotation back to PubAnnotation, it will become publicly available together with contribution from other people. If you want to contribute your annotation to PubAnnotation, please refer to the documentation page: http://www.pubannotation.org/docs/submit-annotation/ The list of the PMID is sourced from here The 6 entries of the following PMIDs could not be included because they were not available from PubMed:32161394, 32104909, 32090470, 32076224, 32161394 32188956, 32238946. Below is a notice from the original LitCovid dataset: PUBLIC DOMAIN NOTICE National Center for Biotechnology Information This software/database is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the author's official duties as a United States Government employee and thus cannot be copyrighted. This software/database is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction. Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose. Please cite the authors in any work or product based on this material : Chen Q, Allot A, & Lu Z. (2020) Keep up with the latest coronavirus research, Nature 579:193	0		Jin-Dong Kim	2023-11-29	Released
RDoCTask2SampleData	Each annotation file contains an annotated abstract with the most relevant sentence. The relevant sentence is annotated with the RDoC category name. The annotation data are formatted as json files. Please refer to the following page for a more detailed description of the json format http://www.pubannotation.org/docs/annotation-format/.	10		mmanani1s	2023-11-29	Released
RDoCTask1SampleData	Each annotation file contains an annotated abstract with an RDoC category. Each title span in these sample data is annotated with the corresponding related RDoC construct, although the RDoC category would apply for the entire abstract. The annotation data are formatted as json files. Please refer to the following page for a more detailed description of the json format http://www.pubannotation.org/docs/annotation-format/.	20		mmanani1s	2023-11-29	Released
PIR-corpus2	The protein tag was used to tag proteins, or protein-associated or -related objects, such as domains, pathways, expression of gene. Annotation guideline: http://pir.georgetown.edu/pirwww/about/doc/manietal.pdf	5.52 K	University of Delaware and Georgetown University Medical Center	Yue Wang	2023-11-29	Released
Wangshuguang	HZAU_bioinformatics_competition	603	wangshuguang	wangshuguang	2023-11-29	Released
bionlp-st-ge-2016-reference	It is the benchmark reference data set of the BioNLP-ST 2016 GE task. It includes Genia-style event annotations to 20 full paper articles which are about NFκB proteins. The task is to develop an automatic annotation system which can produce annotation similar to the annotation in this data set as much as possible. For evaluation of the performance of a participating system, the system needs to produce annotations to the documents in the benchmark test data set (bionlp-st-ge-2016-test). GE 2016 benchmark data set is provided as multi-layer annotations which include: bionlp-st-ge-2016-reference: benchmark reference data set (this project) bionlp-st-ge-2016-test: benchmark test data set (annotations are blined) bionlp-st-ge-2016-test-proteins: protein annotation to the benchmark test data set Following is supporting resources: bionlp-st-ge-2016-coref: coreference annotation bionlp-st-ge-2016-uniprot: Protein annotation with UniProt IDs. pmc-enju-pas: dependency parsing result produced by Enju UBERON-AE: annotation for anatomical entities as defined in UBERON ICD10: annotation for disease names as defined in ICD10 GO-BP: annotation for biological process names as defined in GO GO-CC: annotation for cellular component names as defined in GO A SPARQL-driven search interface is provided at http://bionlp.dbcls.jp/sparql.	14.4 K	DBCLS	Jin-Dong Kim	2023-11-29	Released
bionlp-st-epi-2011-training	The training dataset from the Epigenetics and Post-translational Modifications (EPI) task in the BioNLP Shared Task 2011. The core entities of the task are genes and gene products (RNA and proteins), identified in the data simply as "Protein" annotations.	7.59 K	GENIA	Yue Wang	2023-11-29	Released
bionlp-st-2016-SeeDev-test	Entities annotations from the test set of the BioNLP-ST 2016 SeeDev task. SeeDev task focuses on seed storage and reserve accumulation on the model organism, Arabidopsis thaliana. The SeeDev task is based on the knowledge model Gene Regulation Network for Arabidopsis (GRNA) that meets the needs of text-mining (i.e. manual annotation of texts and automatic information extraction), experimental data indexing and retrieval and reuse in other plant systems. It is also expected to meet the requirements of the integration of the text knowledge with knowledge derived from experimental data in view of modeling in systems biology. GRNA model defines 16 different types of entities, and 22 types of event (in five sets of event types) that may be combined in complex events. For more information, please refer to the task website All annotations : Train set Development set Test set (without events)	184		EstelleChaix	2023-11-29	Released

Name	# Ann.	Author	Maintainer	Updated_at	Status

1 2 3 4 5 ... 16 17 » 1-20 / 323 show all
NCBI-Disease-Train	5.15 K	Rezarta Islamaj Doğan,Robert Leaman,Zhiyong Lu	Kenkim	2025-01-17	Released
NCBI-Disease-Test	960	Rezarta Islamaj Doğan,Robert Leaman,Zhiyong Lu	Kenkim	2025-01-17	Released
NCBI-Disease-Develop	787	Rezarta Islamaj Doğan,Robert Leaman,Zhiyong Lu	Kenkim	2025-01-17	Released
bionlp-st-ge-2016-coref	853	DBCLS	Jin-Dong Kim	2024-06-17	Released
NCBIDiseaseCorpus	6.85 K	Rezarta Islamaj Doğan,Robert Leaman,Zhiyong Lu	Chih-Hsuan Wei	2023-11-29	Released
RELISH-DB	0			2023-11-29	Released
craft-ca-core-dev	59.8 K	University of Colorado Anschutz Medical Campus	craft-st	2023-11-29	Released
bionlp-st-ge-2016-test	7.99 K	DBCLS	Jin-Dong Kim	2023-11-29	Released
BioLarkPubmedHPO	7.16 K	Tudor Groza	simon	2023-11-29	Released
AnEM_abstracts	1.91 K	NaCTeM	Yue Wang	2023-11-29	Released
GENIAcorpus	78.9 K	GENIA Project	Yue Wang	2023-11-29	Released
LocText	2.29 K	Goldberg et al	Shrikant Vinchurkar	2023-11-29	Released
LitCovid-v1-docs	0		Jin-Dong Kim	2023-11-29	Released
RDoCTask2SampleData	10		mmanani1s	2023-11-29	Released
RDoCTask1SampleData	20		mmanani1s	2023-11-29	Released
PIR-corpus2	5.52 K	University of Delaware and Georgetown University Medical Center	Yue Wang	2023-11-29	Released
Wangshuguang	603	wangshuguang	wangshuguang	2023-11-29	Released
bionlp-st-ge-2016-reference	14.4 K	DBCLS	Jin-Dong Kim	2023-11-29	Released
bionlp-st-epi-2011-training	7.59 K	GENIA	Yue Wang	2023-11-29	Released
bionlp-st-2016-SeeDev-test	184		EstelleChaix	2023-11-29	Released