> top > docs > PMC:3475482 > spans > 5920-9410

PMC:3475482 / 5920-9410 JSON TXT

Annnotations TAB JSON ListView MergeView

BLAH6-GNI-CORPUS

{"project":"BLAH6-GNI-CORPUS","denotations":[{"id":"T144","span":{"begin":1190,"end":1193},"obj":"protein"},{"id":"T145","span":{"begin":933,"end":934},"obj":"protein"},{"id":"T146","span":{"begin":915,"end":916},"obj":"protein"},{"id":"T147","span":{"begin":869,"end":871},"obj":"protein"},{"id":"T148","span":{"begin":859,"end":861},"obj":"protein"},{"id":"T149","span":{"begin":779,"end":801},"obj":"protein"},{"id":"T150","span":{"begin":708,"end":709},"obj":"protein"},{"id":"T151","span":{"begin":690,"end":692},"obj":"protein"},{"id":"T152","span":{"begin":612,"end":614},"obj":"protein"},{"id":"T153","span":{"begin":607,"end":610},"obj":"protein"},{"id":"T154","span":{"begin":462,"end":496},"obj":"protein"},{"id":"T155","span":{"begin":312,"end":322},"obj":"DNA"},{"id":"T156","span":{"begin":230,"end":239},"obj":"DNA"},{"id":"T157","span":{"begin":205,"end":207},"obj":"protein"},{"id":"T158","span":{"begin":200,"end":203},"obj":"protein"},{"id":"T159","span":{"begin":3415,"end":3417},"obj":"protein"},{"id":"T160","span":{"begin":3181,"end":3188},"obj":"protein"},{"id":"T161","span":{"begin":3062,"end":3064},"obj":"protein"},{"id":"T162","span":{"begin":2984,"end":2996},"obj":"DNA"},{"id":"T163","span":{"begin":2975,"end":2977},"obj":"DNA"},{"id":"T164","span":{"begin":2966,"end":2968},"obj":"DNA"},{"id":"T165","span":{"begin":2900,"end":2907},"obj":"protein"},{"id":"T166","span":{"begin":2746,"end":2748},"obj":"DNA"},{"id":"T167","span":{"begin":2731,"end":2733},"obj":"DNA"},{"id":"T168","span":{"begin":2723,"end":2725},"obj":"DNA"},{"id":"T169","span":{"begin":2624,"end":2636},"obj":"DNA"},{"id":"T170","span":{"begin":2615,"end":2617},"obj":"DNA"},{"id":"T171","span":{"begin":2606,"end":2608},"obj":"DNA"},{"id":"T172","span":{"begin":2405,"end":2407},"obj":"protein"},{"id":"T173","span":{"begin":2334,"end":2336},"obj":"protein"},{"id":"T193","span":{"begin":1381,"end":1385},"obj":"protein"},{"id":"T194","span":{"begin":1338,"end":1350},"obj":"DNA"},{"id":"T195","span":{"begin":1329,"end":1331},"obj":"DNA"},{"id":"T196","span":{"begin":1320,"end":1322},"obj":"DNA"},{"id":"T174","span":{"begin":2323,"end":2330},"obj":"protein"},{"id":"T175","span":{"begin":2222,"end":2224},"obj":"protein"},{"id":"T176","span":{"begin":2172,"end":2175},"obj":"protein"},{"id":"T177","span":{"begin":1932,"end":1934},"obj":"protein"},{"id":"T178","span":{"begin":1922,"end":1924},"obj":"protein"},{"id":"T179","span":{"begin":1912,"end":1914},"obj":"protein"},{"id":"T180","span":{"begin":1902,"end":1904},"obj":"protein"},{"id":"T181","span":{"begin":1889,"end":1891},"obj":"protein"},{"id":"T182","span":{"begin":1803,"end":1818},"obj":"protein"},{"id":"T183","span":{"begin":1768,"end":1770},"obj":"protein"},{"id":"T184","span":{"begin":1687,"end":1699},"obj":"DNA"},{"id":"T185","span":{"begin":1655,"end":1657},"obj":"DNA"},{"id":"T186","span":{"begin":1646,"end":1648},"obj":"DNA"},{"id":"T187","span":{"begin":1452,"end":1454},"obj":"DNA"},{"id":"T188","span":{"begin":1447,"end":1451},"obj":"protein"},{"id":"T189","span":{"begin":1440,"end":1444},"obj":"protein"},{"id":"T190","span":{"begin":1430,"end":1432},"obj":"DNA"},{"id":"T191","span":{"begin":1424,"end":1428},"obj":"protein"},{"id":"T192","span":{"begin":1402,"end":1404},"obj":"DNA"}],"text":"Concepts and definitions\nLet Σ = {A, C, G, T} be a set of DNA alphabets, where A, C, G, and T are called DNA characters or bases. A DNA sequence S is an ordered list of DNA alphabets. S is denoted by 〈c1, c2, ..., cl〉, where ci ∈ Σ and |S| denotes the length of sequence S. A sequence with length n is called an n-sequence. A sequence database D is a set of tuples 〈sid, S〉, where sid is a sequence ID. The sum of the lengths of all sequences in D is denoted as |D|=|S1|+|S2|...|Sn|\n\nDefinition 1 (Patter)\nA pattern is a contiguous sub-sequence of DNA sequence S drawn from Σ = {A, C, G, T}. A sequence α = 〈a1, a2, ..., an〉 is called a contiguous sub-sequence of another sequence β = 〈b1, b2, ..., bm〉, and β is a contiguous super-sequence of α, denoted as α⊆β, if there exists integers 1 ≤ j1 ≤ j2 ≤ ... ≤ jn ≤ m and ji+1 = ji + 1 for 1 ≤ i ≤ n-1 such that a1 = bj1, a2 = bj2, ..., an = bjn. We can also say that α is contained by β. In our paper, we use the term \"(sub)-sequence\" to describe \"contiguous (sub)-sequence\" in brief.\n\nDefinition 2 (Support)\nGiven a pattern P and a sequence S, the number of occurrences of P in S is called the support of pattern P in sequence S, denoted as Sup(P, Si). For DNA sequence database D, the support of P in D is defined as .\n\nDefinition 3 (Confidence)\nGiven a pattern P = P1, P2, ..., Pq and a DNA sequence database D, the confidence of P1P2 with respect to P1 is defined as Conf(P1P2, P1) = Sup(P1P2,D)/Sup(P1,D).\nFor example, the character \"A\" occurs 10 times and \"AT\" occurs 7 times, and in database D in Table 1, Conf(AT,A) = 7/10 = 0.7.\n\nDefinition 4 (Pattern probability)\nGiven a pattern P = P1, P2, ..., Pq (Pi is a DNA alphabet) and a DNA sequence database D, the pattern probability of P in D is defined as , where Pr(Pi, D) = # of occurrences of an alphabet Pi/|D|.\nFor example, the pattern probability of pattern \"ATCG\" in Table 1 is Pr(ATCG,D) = Pr(A,D) × Pr(T,D) × Pr(C,D) × Pr(G,D) = (10/55) × (18/55) × (12/55) × (15/55) = 0.182 × 0.372 × 0.218 × 0.273 = 0.00403.\n\nDefinition 5 (Information)\nThe information carried by a DNA character or base in DNA sequence database D is defined as I(c) = -log|C|Pr(c,D), where |C| is the number of distinct characters in D and Pr(c) is the probability of c occurs in D.\nFor example, the occurrence probability of character A in Table 1 is Pr(A,D) = # of occurence(A)/|D|. So, the probability of character A is Pr(A,D) = 10/55 = 0.182 in our example database. Then, the information of character A in D is, I(A) = -log|C|Pr(A,D) = -log4(0.182) = 1.228.\n\nDefinition 6 (Pattern information)\nGiven a pattern P = P1, P2, ..., Pq and a DNA sequence database D, the pattern information of P in D is defined as I(P) = -log|C|Pr(P,D) = I(P1) + I(P2) +......+ I(Pq).\nFor example, the pattern information of pattern \"ATCG\" is I(ATCG,D) = I(A,D) + I(T,D) + I(C,D) + I(G,D) = 1.228 + 0.713 + 1.098 + 0.9365 = 3.9755 in Table 1.\n\nDefinition 7 (Information gain)\nGiven a pattern P = P1, P2, ..., Pq and a DNA sequence database D, the pattern information gain of P in D is defined as IG(P) = I(P) × Support(P).\nFor example, the information gain of pattern \"ATCG\" is IG(ATCG,D) = 3.9755 * 5 = 19.8775 in Table 1.\n\nDefinition 8 (Finding interesting patterns)\nGiven a sequence database D and user-specified min_conf and min_in_gain, the problem of finding interesting patterns is to find the complete set of interesting patterns, such that IG(P) and Conf(P) are greater than min_in_gain and min_conf , respectively."}

BLAH6-GNI-CORPUS2

{"project":"BLAH6-GNI-CORPUS2","denotations":[{"id":"T29","span":{"begin":3415,"end":3417},"obj":"protein"},{"id":"T30","span":{"begin":3376,"end":3379},"obj":"DNA"},{"id":"T31","span":{"begin":3243,"end":3260},"obj":"DNA"},{"id":"T32","span":{"begin":3181,"end":3188},"obj":"protein"},{"id":"T33","span":{"begin":3147,"end":3151},"obj":"protein"},{"id":"T34","span":{"begin":3144,"end":3146},"obj":"protein"},{"id":"T35","span":{"begin":3135,"end":3139},"obj":"protein"},{"id":"T36","span":{"begin":3062,"end":3064},"obj":"protein"},{"id":"T37","span":{"begin":2984,"end":2996},"obj":"DNA"},{"id":"T38","span":{"begin":2975,"end":2977},"obj":"DNA"},{"id":"T39","span":{"begin":2966,"end":2968},"obj":"DNA"},{"id":"T40","span":{"begin":2962,"end":2964},"obj":"DNA"},{"id":"T41","span":{"begin":2962,"end":2968},"obj":"protein"},{"id":"T42","span":{"begin":2900,"end":2907},"obj":"protein"},{"id":"T43","span":{"begin":2841,"end":2842},"obj":"protein"},{"id":"T44","span":{"begin":2811,"end":2815},"obj":"protein"},{"id":"T45","span":{"begin":2800,"end":2804},"obj":"protein"},{"id":"T46","span":{"begin":2746,"end":2748},"obj":"DNA"},{"id":"T47","span":{"begin":2731,"end":2733},"obj":"DNA"},{"id":"T48","span":{"begin":2723,"end":2725},"obj":"DNA"},{"id":"T49","span":{"begin":2711,"end":2713},"obj":"protein"},{"id":"T50","span":{"begin":2709,"end":2710},"obj":"protein"},{"id":"T51","span":{"begin":2624,"end":2636},"obj":"DNA"},{"id":"T52","span":{"begin":2615,"end":2617},"obj":"DNA"},{"id":"T53","span":{"begin":2606,"end":2608},"obj":"DNA"},{"id":"T54","span":{"begin":2602,"end":2604},"obj":"DNA"},{"id":"T55","span":{"begin":2602,"end":2608},"obj":"protein"},{"id":"T56","span":{"begin":2514,"end":2516},"obj":"protein"},{"id":"T57","span":{"begin":2512,"end":2513},"obj":"protein"},{"id":"T58","span":{"begin":2500,"end":2504},"obj":"protein"},{"id":"T59","span":{"begin":2415,"end":2417},"obj":"protein"},{"id":"T60","span":{"begin":2405,"end":2407},"obj":"protein"},{"id":"T61","span":{"begin":2334,"end":2336},"obj":"protein"},{"id":"T62","span":{"begin":2323,"end":2330},"obj":"protein"},{"id":"T63","span":{"begin":2250,"end":2251},"obj":"protein"},{"id":"T64","span":{"begin":2225,"end":2226},"obj":"protein"},{"id":"T65","span":{"begin":2222,"end":2224},"obj":"protein"},{"id":"T66","span":{"begin":2173,"end":2174},"obj":"protein"},{"id":"T67","span":{"begin":2160,"end":2161},"obj":"protein"},{"id":"T68","span":{"begin":2157,"end":2159},"obj":"protein"},{"id":"T69","span":{"begin":2155,"end":2156},"obj":"protein"},{"id":"T70","span":{"begin":2145,"end":2146},"obj":"protein"},{"id":"T71","span":{"begin":2105,"end":2117},"obj":"DNA"},{"id":"T72","span":{"begin":2080,"end":2083},"obj":"DNA"},{"id":"T73","span":{"begin":1943,"end":1945},"obj":"protein"},{"id":"T74","span":{"begin":1932,"end":1934},"obj":"protein"},{"id":"T75","span":{"begin":1925,"end":1926},"obj":"protein"},{"id":"T76","span":{"begin":1922,"end":1924},"obj":"protein"},{"id":"T77","span":{"begin":1912,"end":1914},"obj":"protein"},{"id":"T78","span":{"begin":1902,"end":1904},"obj":"protein"},{"id":"T79","span":{"begin":1892,"end":1896},"obj":"protein"},{"id":"T80","span":{"begin":1889,"end":1891},"obj":"protein"},{"id":"T81","span":{"begin":1878,"end":1885},"obj":"protein"},{"id":"T82","span":{"begin":1869,"end":1873},"obj":"protein"},{"id":"T83","span":{"begin":1803,"end":1817},"obj":"protein"},{"id":"T84","span":{"begin":1771,"end":1773},"obj":"DNA"},{"id":"T85","span":{"begin":1771,"end":1773},"obj":"protein"},{"id":"T86","span":{"begin":1768,"end":1770},"obj":"protein"},{"id":"T87","span":{"begin":1687,"end":1699},"obj":"DNA"},{"id":"T88","span":{"begin":1667,"end":1670},"obj":"DNA"},{"id":"T89","span":{"begin":1659,"end":1661},"obj":"DNA"},{"id":"T90","span":{"begin":1659,"end":1661},"obj":"protein"},{"id":"T91","span":{"begin":1655,"end":1657},"obj":"DNA"},{"id":"T92","span":{"begin":1646,"end":1648},"obj":"DNA"},{"id":"T93","span":{"begin":1642,"end":1644},"obj":"DNA"},{"id":"T94","span":{"begin":1642,"end":1648},"obj":"protein"},{"id":"T95","span":{"begin":1576,"end":1578},"obj":"protein"},{"id":"T96","span":{"begin":1552,"end":1559},"obj":"protein"},{"id":"T97","span":{"begin":1497,"end":1499},"obj":"protein"},{"id":"T98","span":{"begin":1452,"end":1454},"obj":"DNA"},{"id":"T99","span":{"begin":1448,"end":1451},"obj":"protein"},{"id":"T100","span":{"begin":1440,"end":1444},"obj":"protein"},{"id":"T101","span":{"begin":1436,"end":1439},"obj":"protein"},{"id":"T102","span":{"begin":1430,"end":1432},"obj":"DNA"},{"id":"T103","span":{"begin":1424,"end":1428},"obj":"protein"},{"id":"T104","span":{"begin":1402,"end":1404},"obj":"DNA"},{"id":"T105","span":{"begin":1381,"end":1385},"obj":"protein"},{"id":"T106","span":{"begin":1338,"end":1350},"obj":"DNA"},{"id":"T107","span":{"begin":1329,"end":1331},"obj":"DNA"},{"id":"T108","span":{"begin":1320,"end":1322},"obj":"DNA"},{"id":"T109","span":{"begin":1316,"end":1318},"obj":"DNA"},{"id":"T110","span":{"begin":1316,"end":1322},"obj":"protein"},{"id":"T111","span":{"begin":1206,"end":1218},"obj":"DNA"},{"id":"T112","span":{"begin":1190,"end":1193},"obj":"protein"},{"id":"T113","span":{"begin":996,"end":1021},"obj":"DNA"},{"id":"T114","span":{"begin":961,"end":965},"obj":"protein"},{"id":"T115","span":{"begin":869,"end":871},"obj":"protein"},{"id":"T116","span":{"begin":859,"end":861},"obj":"protein"},{"id":"T117","span":{"begin":845,"end":848},"obj":"protein"},{"id":"T118","span":{"begin":779,"end":799},"obj":"protein"},{"id":"T119","span":{"begin":690,"end":692},"obj":"protein"},{"id":"T120","span":{"begin":686,"end":688},"obj":"protein"},{"id":"T121","span":{"begin":648,"end":660},"obj":"DNA"},{"id":"T122","span":{"begin":612,"end":614},"obj":"protein"},{"id":"T123","span":{"begin":608,"end":610},"obj":"protein"},{"id":"T124","span":{"begin":579,"end":586},"obj":"protein"},{"id":"T125","span":{"begin":548,"end":560},"obj":"DNA"},{"id":"T126","span":{"begin":532,"end":544},"obj":"DNA"},{"id":"T127","span":{"begin":484,"end":496},"obj":"protein"},{"id":"T128","span":{"begin":472,"end":474},"obj":"protein"},{"id":"T129","span":{"begin":467,"end":469},"obj":"protein"},{"id":"T130","span":{"begin":429,"end":432},"obj":"protein"},{"id":"T131","span":{"begin":351,"end":354},"obj":"DNA"},{"id":"T132","span":{"begin":326,"end":343},"obj":"DNA"},{"id":"T133","span":{"begin":312,"end":322},"obj":"DNA"},{"id":"T134","span":{"begin":225,"end":227},"obj":"protein"},{"id":"T135","span":{"begin":205,"end":207},"obj":"protein"},{"id":"T136","span":{"begin":201,"end":203},"obj":"protein"},{"id":"T137","span":{"begin":169,"end":172},"obj":"DNA"},{"id":"T138","span":{"begin":132,"end":144},"obj":"DNA"},{"id":"T139","span":{"begin":105,"end":108},"obj":"DNA"},{"id":"T140","span":{"begin":79,"end":86},"obj":"protein"},{"id":"T141","span":{"begin":58,"end":61},"obj":"DNA"},{"id":"T142","span":{"begin":51,"end":54},"obj":"DNA"},{"id":"T143","span":{"begin":34,"end":41},"obj":"protein"}],"text":"Concepts and definitions\nLet Σ = {A, C, G, T} be a set of DNA alphabets, where A, C, G, and T are called DNA characters or bases. A DNA sequence S is an ordered list of DNA alphabets. S is denoted by 〈c1, c2, ..., cl〉, where ci ∈ Σ and |S| denotes the length of sequence S. A sequence with length n is called an n-sequence. A sequence database D is a set of tuples 〈sid, S〉, where sid is a sequence ID. The sum of the lengths of all sequences in D is denoted as |D|=|S1|+|S2|...|Sn|\n\nDefinition 1 (Patter)\nA pattern is a contiguous sub-sequence of DNA sequence S drawn from Σ = {A, C, G, T}. A sequence α = 〈a1, a2, ..., an〉 is called a contiguous sub-sequence of another sequence β = 〈b1, b2, ..., bm〉, and β is a contiguous super-sequence of α, denoted as α⊆β, if there exists integers 1 ≤ j1 ≤ j2 ≤ ... ≤ jn ≤ m and ji+1 = ji + 1 for 1 ≤ i ≤ n-1 such that a1 = bj1, a2 = bj2, ..., an = bjn. We can also say that α is contained by β. In our paper, we use the term \"(sub)-sequence\" to describe \"contiguous (sub)-sequence\" in brief.\n\nDefinition 2 (Support)\nGiven a pattern P and a sequence S, the number of occurrences of P in S is called the support of pattern P in sequence S, denoted as Sup(P, Si). For DNA sequence database D, the support of P in D is defined as .\n\nDefinition 3 (Confidence)\nGiven a pattern P = P1, P2, ..., Pq and a DNA sequence database D, the confidence of P1P2 with respect to P1 is defined as Conf(P1P2, P1) = Sup(P1P2,D)/Sup(P1,D).\nFor example, the character \"A\" occurs 10 times and \"AT\" occurs 7 times, and in database D in Table 1, Conf(AT,A) = 7/10 = 0.7.\n\nDefinition 4 (Pattern probability)\nGiven a pattern P = P1, P2, ..., Pq (Pi is a DNA alphabet) and a DNA sequence database D, the pattern probability of P in D is defined as , where Pr(Pi, D) = # of occurrences of an alphabet Pi/|D|.\nFor example, the pattern probability of pattern \"ATCG\" in Table 1 is Pr(ATCG,D) = Pr(A,D) × Pr(T,D) × Pr(C,D) × Pr(G,D) = (10/55) × (18/55) × (12/55) × (15/55) = 0.182 × 0.372 × 0.218 × 0.273 = 0.00403.\n\nDefinition 5 (Information)\nThe information carried by a DNA character or base in DNA sequence database D is defined as I(c) = -log|C|Pr(c,D), where |C| is the number of distinct characters in D and Pr(c) is the probability of c occurs in D.\nFor example, the occurrence probability of character A in Table 1 is Pr(A,D) = # of occurence(A)/|D|. So, the probability of character A is Pr(A,D) = 10/55 = 0.182 in our example database. Then, the information of character A in D is, I(A) = -log|C|Pr(A,D) = -log4(0.182) = 1.228.\n\nDefinition 6 (Pattern information)\nGiven a pattern P = P1, P2, ..., Pq and a DNA sequence database D, the pattern information of P in D is defined as I(P) = -log|C|Pr(P,D) = I(P1) + I(P2) +......+ I(Pq).\nFor example, the pattern information of pattern \"ATCG\" is I(ATCG,D) = I(A,D) + I(T,D) + I(C,D) + I(G,D) = 1.228 + 0.713 + 1.098 + 0.9365 = 3.9755 in Table 1.\n\nDefinition 7 (Information gain)\nGiven a pattern P = P1, P2, ..., Pq and a DNA sequence database D, the pattern information gain of P in D is defined as IG(P) = I(P) × Support(P).\nFor example, the information gain of pattern \"ATCG\" is IG(ATCG,D) = 3.9755 * 5 = 19.8775 in Table 1.\n\nDefinition 8 (Finding interesting patterns)\nGiven a sequence database D and user-specified min_conf and min_in_gain, the problem of finding interesting patterns is to find the complete set of interesting patterns, such that IG(P) and Conf(P) are greater than min_in_gain and min_conf , respectively."}