Prediction of transcriptional units A set of Perl scripts was used to implement the algorithm described below. Genomic clones were ordered and oriented using the fingerprint map and draft assembly. Within unfinished clones, sequence contigs were further ordered and oriented according to Ensembl's assembly [83]. This mapping produced the positional context necessary for consolidating fragmented exon units. Where necessary, small sequencing gaps (100 bp or fewer) were ignored and genomic clones were considered contiguous except where a large gap was indicated in the draft assembly (> 50 kb). ORFs were determined using the program getorf [84]. The exon index is an integrated table generated by a Sybase relational database, consisting of chromosome number, fingerprinted contig (FPC) ID, FPC contig order, BAC contig ID, BAC contig order, BAC contig orientation, starting position of exon on BAC contig, end position of exon on BAC contig, exon orientation, transcript orientation (available from GenBank, IMAGE, UniGene, HINT and dbEST), evidence (transcript, protein, gene prediction, ORF, Pfam), database name (Table 1), feature (poly(A) signal, CpG island, Genscan boundary), starting position on exon (or feature), end position on exon (or feature), score (BlastN, BlastX). The index was first ordered and oriented for the individual BAC contigs according to Ensembl or UCSC [85] maps. The resulting contigs were then ordered and oriented according to the FPC order and orientation information in the UCSC genome assembly, resulting in a numeric sorting order for all the individual contigs. In addition, large gap information (> 50 kb) available from the UCSC assembly was incorporated into the same index, where no overlapping information was available between presumably adjacent BACs. The consolidation algorithm follows a hierarchy in which unit boundaries are respected for the highest-ranking feature. The features in descending ranking were: UTRs based on known UTR indices; exons containing no ORFs or incomplete ORFs; boundaries of known full-length cDNAs (HTDB-based indices); EST orientation information (5' or 3' origin from the original IMAGE, UniGene/HINT, and dbEST databases); and Genscan-predicted poly(A) signals. When clear boundary indicators were not available, information from the transcript indices HINT (assembled from UniGene) [19] and EG [17] were used directly as secondary evidence for potential gene boundaries. The rationale is that each UniGene cluster has at least one known gene, or two sets of ESTs representing both the 5' and 3' termini of a gene, or at least one EST containing a poly(A) signal [81]. Similar stringent criteria were used in Ewing and Green's EST assemblies [17]. Multiple exons not residing in intact ORFs were consolidated until the occurrence of exons in a partial or complete ORF. Multiple in-frame exons in a continuous ORF were always considered part of a single gene. To prevent any overconsolidation as a result of lack of transitional exons (in partial or complete ORFs) for adjacent genes, CpG islands, large gaps (> 50 kb) between exons and Genscan prediction were used as gene boundaries when higher-ranking boundary information was unavailable. In such instances, HINT and EG index identity was respected. Although a variety of criteria were used for determining transcriptional unit boundaries, the vast majority of the consolidation was achieved on the basis of terminal information from gene indices and transition and continuation of open reading frames.