PMC:3543929 / 940-3945
Annnotations
{"target":"https://pubannotation.org/docs/sourcedb/PMC/sourceid/3543929","sourcedb":"PMC","sourceid":"3543929","source_url":"https://www.ncbi.nlm.nih.gov/pmc/3543929","text":"Introduction\nIn formal language theory a language is simply a set of strings of characters drawn from some alphabet, where the alphabet (terminal) is a set of symbols. When we consider biological sequences simply as a language in the context of formal language theory (treating DNA, RNA, or protein sequences just as strings of alphabets of four nucleotides or 20 amino acids), a grammatical inference method based on formal language theory can be applied [1-3].\nNevill-Manning and Witten [4] pioneered the attempt to produce the context-free grammarof biological sequences in an automatic way. This task can be formalized as the problem of finding the smallest context-free grammar by recursively replacing the repeats by a new symbol. The algorithm builds a hierarchy of phrases by forming a new rule out of existing pairs of symbols, including non-terminal symbols.\nFor example, if we consider the string \"atattattatt,\" the simplest way to represent the string by context-free grammar is the following:\n\u003cGrammar 0\u003e\nS → atattattatt\nThe most frequently occurring sequence in the string is \"at,\" which occurs four times. Thus, introducing a new nonterminal symbol, 'A,' and creating a new rule for this yields the following modified grammar:\n\u003cGrammar 1\u003e\nS → AAtAtAt\nA → at,\nwhere the grammar consists of a start symbol (i.e., S), two terminal symbols (i.e., a, t) represented by lowercase letters, two non-terminal symbols (i.e., S, A) represented by uppercase letters, and two production rules (i.e., S → AAtAtAt, A → at) with a left- and a right-hand side consisting of a sequence of these symbols.\nRepeatedly replacing the frequently occurring patterns \"At,\" again to a new nonterminal symbol, B, gives the following modified grammar:\n\u003cGrammar 2\u003e\nS → ABBB\nA → at\nB→ At,\nwhere the grammar consists of a start symbol (i.e., S), two terminal symbols (i.e., a, t), three non-terminal symbols (i.e., S, A, B), and three production rules (i.e., S → ABBB, A → at, B → At).\nBy applying the three production rules by replacing an occurrence of the nonterminals on the left-hand side of the production rule with those that appear on the right-hand side, the string \"atattattatt\" can be derived from the non-terminal S by constantly applying a series of derivations: S → ABBB → atBBB → atAtBB → atattBB → atattAtB → atattattB → atattattAt → atattattatt.\nBased on this concept, grammar-based compression algorithms have shown some success for various applications [4-7]. Especially, grammar can capture distant repetitions occurring far apart, which was a limitation of sliding window approaches. However, grammar-based compression algorithms at this moment do not show the best performance for compression itself [6]. Thus, our motivation of this study is not to develop a new algorithm or find the most efficient way to compress biological sequences for storage purposes. Our sole purpose of developing a new tool is to investigate any grammatical traits of biological sequences, based on formal language theory.","divisions":[{"label":"title","span":{"begin":0,"end":12}},{"label":"p","span":{"begin":13,"end":462}},{"label":"p","span":{"begin":463,"end":868}},{"label":"p","span":{"begin":869,"end":1005}},{"label":"p","span":{"begin":1006,"end":1033}},{"label":"p","span":{"begin":1006,"end":1017}},{"label":"p","span":{"begin":1018,"end":1033}},{"label":"p","span":{"begin":1034,"end":1241}},{"label":"p","span":{"begin":1242,"end":1273}},{"label":"p","span":{"begin":1242,"end":1253}},{"label":"p","span":{"begin":1254,"end":1265}},{"label":"p","span":{"begin":1266,"end":1273}},{"label":"p","span":{"begin":1274,"end":1600}},{"label":"p","span":{"begin":1601,"end":1737}},{"label":"p","span":{"begin":1738,"end":1772}},{"label":"p","span":{"begin":1738,"end":1749}},{"label":"p","span":{"begin":1750,"end":1758}},{"label":"p","span":{"begin":1759,"end":1765}},{"label":"p","span":{"begin":1766,"end":1772}},{"label":"p","span":{"begin":1773,"end":1968}},{"label":"p","span":{"begin":1969,"end":2345}}],"tracks":[{"project":"2_test","denotations":[{"id":"23346041-16013753-44836246","span":{"begin":457,"end":458},"obj":"16013753"}],"attributes":[{"subj":"23346041-16013753-44836246","pred":"source","obj":"2_test"}]}],"config":{"attribute types":[{"pred":"source","value type":"selection","values":[{"id":"2_test","color":"#ec93e8","default":true}]}]}}