Sequence analysis
cDNA sequences were base-called and quality-trimmed using phred (trim_cutoff = 0.05) [47], and vector sequences were removed using cross_match [48]. Any sequences of less than 50 bp after trimming were discarded. 3' UTR lengths were estimated by combining approximate insert sizes determined by PCR with 5' sequence data where possible (if the 5' sequence did not extend into the coding region we could not estimate 3' UTR size). We counted cDNAs from a given gene as showing alternative polyadenylation site usage if 3' UTR length estimates varied by at least 400 bp - smaller variation could be real, but may not be distinguishable from error in our size estimates.
To assign cDNAs to their corresponding olfactory receptor genes, we first defined a genomic 'territory' for each gene, with the following attributes: strand, start position (100 kb upstream of the start codon or 1 kb after the previous gene upstream on the same strand, whichever is closer) and end position (1 kb downstream of the stop codon). Trimmed sequences were compared with genomic sequences using sim4 [30] (settings P = 1 to remove polyA tails and N = 1 to perform an intensive search for small exons). The sim4 algorithm uses splice-site consensus sequences to refine alignments. Only matches of 96% or greater nucleotide identity were considered. RepeatMasked sequences [49] were also compared to genomic sequences; cDNA:genomic sequence pairings not found in both masked and unmasked alignments were rejected. Coordinates from the unmasked alignment were used for further analysis. Any cDNA sequence matching entirely within a territory was assigned to that gene. If a cDNA matched more than one gene territory, the best match was chosen (that is, the one with highest 'score', where score is the total of all exons' lengths multiplied by their respective percent identities). We found 27 cDNAs that spanned a larger genomic range than one gene territory and flagged them for more careful analysis. Of these, six cDNAs showed unusual splicing within the 3' UTR, but the remaining 'territory violators' were found to be artifacts of the analysis process which fell into three types. These included: cDNAs where the insert appeared to be cloned in the reverse orientation (six cDNAs); sequences from recently duplicated gene pairs, where sim4 assigned coding region and upstream exons to different members of the pair, although exons could equally well have been aligned closer to one another (six cDNAs); and artifacts due to use of sim4's N = 1 parameter (nine cDNAs). This parameter instructs the program to make extra effort to match small upstream exons, allowing a greater total length of EST sequence to be matched. However, occasionally the N = 1 parameter caused the program to assign very small sequences (1-4 bp) to distant upstream exons, when they probably match nearer to the corresponding coding sequence.
The expected distribution shown in Figure 2 was calculated using the equation P(x) = e-μμx/x!, where P(x) is the Poisson probability of observing x cDNAs per gene, and μ is the mean number of cDNAs observed per gene (μ = 1,176/983: 1,176 cDNAs matching olfactory receptor genes in our dataset and 983 intact class II olfactory receptors). In our analysis of expressed pseudogenes, we ignored two olfactory receptor pseudogenes found very near the ends of genomic sequences and thus likely to be error-prone. A protein sequence alignment of intact mouse olfactory receptors was generated using CLUSTALW [50], edited by hand, and used to produce the phylogenetic tree shown in Figure 1 using PAUP's neighbor-joining algorithm (v4.0b6 Version 4, Sinauer Associates, Sunderland, MA). The tree was colored using a custom script. Information content (the measure of sequence conservation shown in Figure 6) was calculated for each position in the alignment using alpro [51].
To determine the number of transcriptional isoforms for each gene, we examined the sim4 output for every matching cDNA in decreasing order of number of exons. The first cDNA was counted as one splice form, and for each subsequent cDNA, we determined whether exon structure was mutually exclusive to isoforms already counted. We were conservative in our definition of mutually exclusive, and thus our count represents the minimum number of isoforms represented in the cDNA collection.