PMC:548683 / 5735-30801
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"15676076-10702227-10449211","span":{"begin":337,"end":338},"obj":"10702227"},{"id":"15676076-10702227-10449212","span":{"begin":394,"end":395},"obj":"10702227"},{"id":"15676076-12135661-10449213","span":{"begin":396,"end":397},"obj":"12135661"},{"id":"15676076-12135661-10449214","span":{"begin":3562,"end":3563},"obj":"12135661"},{"id":"15676076-15042337-10449215","span":{"begin":5854,"end":5855},"obj":"15042337"},{"id":"15676076-12850442-10449216","span":{"begin":5856,"end":5858},"obj":"12850442"}],"text":"Results\n\nIdentification of differentially expressed genes by representational difference analysis (RDA)\nTotal RNA was prepared from EML cells grown in the presence of SCF alone (0 hour) or in medium supplemented with IL-3 and atRA for 72 hours. The RNA was converted to cDNA and subjected to three rounds of RDA as previously described [6]. Six differentially-expressed clones were identified [6,8]. Clone number 1623 was chosen for further analysis and the differential expression of this gene was confirmed by Northern blot analysis. 1623 mRNA was essentially undetectable in the 0 hour sample but readily detectable in the 72 hour sample (data not shown and see figure 10).\n\nSequence analysis of clone 1623\nThe initial cDNA isolated by RDA was a 273 bp fragment that appeared to contain the coding sequence of the C-terminus of a protein that was not present at that time in public databases. To identify the full open reading frame of this cDNA, we first performed rapid amplification of cDNA ends (RACE) in both the 5' and 3' directions. Extension in the 3' direction revealed the presence of a consensus polyadenylation signal located 154 nucleotides downstream of the putative translation stop codon. Extension in the 5' direction yielded an additional 443 bp of sequence containing a contiguous ORF. Comparison of this extended sequence to public databases identified a cDNA (NM_017565) that was identical to clone 1623 in the region of overlap. This cDNA was isolated from a mouse mammary tumor but no functional analysis had been performed. We designed PCR primers based on the published sequence and confirmed that the cDNA isolated from EML cells was identical to the published sequence. The full length ORF was 1623 bp in length and encoded a protein of 541 amino acids (Figure 1A). The protein did not contain any recognizable motifs when examined using domain mapping software such as SMART or Profilescan (see Methods). However, a putative amino terminal signal sequence was identified using the SignalP analysis program. The cDNA mapped to an 11 exon gene located on mouse chromosome 11E1 (Figure 1B). The gene spanned approximately 60,000 bp of genomic sequence and the relatively large size of the gene is primarily due to the fact that the first intron is greater than 44,000 bp in length (Figure 1C).\nFigure 1 Characterization of the full length mouse 1623 (Fam20a) cDNA and genomic sequence. A. The full length cDNA derived from the original RDA clone was isolated using a combination of 5' and 3' rapid amplification of cDNA ends (RACE) procedures, comparisons to public databases, and amplification of putative full length clones by PCR. The full open reading frame was 1623 bp in length and encoded a 541 amino acid protein. The locations of regions conserved within the subsequently identified FAM20 family are indicated using underlines. Eight cysteine residues that are also conserved within the family are indicated in bold and four putative N-glycosylation sites are indicated in red type. B. The distribution of the 11 exons of the mouse Fam20a gene is shown with the exons indicated using numbers. A consensus polyadenylation signal is located downstream of the terminal exon. C. The sizes of the 11 exons and 10 introns of the Fam20a gene are shown.\n\nIdentification of a family of related genes\nThe full length cDNA and the encoded protein were compared to sequences in public databases. We reported previously a weak similarity to a protein named Fjx1, which is the mouse orthologue of a Drosophila protein named four-jointed [8]. However, the degree of sequence identity between these proteins was low (16%) and thus the search was extended to include uncharacterized proteins. We first identified two other mouse proteins that displayed significant similarity to, but were distinct from, the query sequence. One protein (Accession number NP_663388 aka Riken C530043G21) was 409 amino acids in length and displayed 27% identity to the query sequence while the second (NP_085042) was a truncated version of a 579 amino acid protein that displayed 40% identity to the query sequence. We have subsequently discovered that these proteins are members of a highly related family and this family has received the official name \"family with sequence similarity 20\" (FAM20) from the Human Genome Organization Gene Nomenclature Committee. The protein derived from our original 1623 cDNA is named Fam20a in mouse and the other two family members are named Fam20b and Fam20c, respectively. Continued searches of public databases revealed the existence of related proteins in several other species. Each mammalian genome contains genes encoding three members that are orthologous to the three mouse proteins mentioned above. The accession numbers for the relevant cDNAs in human and rat are listed in Table 1 and we have also identified the same number of related sequences in other mammalian genomes, including the pig, cow and dog (data not shown). However, most of these sequences are incomplete and will not be described in detail here.\nTable 1 Accession numbers for vertebrate FAM20 family members1 1. Accession numbers refer to nucleotide sequences in GenBank except in the Fugu and Zebrafish listing where the accession numbers refer either to predicted peptides in the Ensembl database (entries beginning SINFRUP or ENSDARP) or Genbank (entry beginning CAI).\nEntries beginning with BK refer to predicted sequences in the Third Party Annotation database arising from this study. ND: None detected. In order to gain further information concerning the origin of the FAM20 family, we searched for related sequences in several other invertebrate and vertebrate organisms. The ascidian Ciona intestinalis is a model for a basal chordate organism and has emerged as a powerful model for evolutionary and developmental studies [9,10]. In particular, many gene families or subfamilies are represented by single members in C. intestinalis and thus the identification of an orthologue in this organism can provide useful information for evaluating the evolutionary origin of the members of a gene family. Consistent with this concept, we identified a single cDNA and the corresponding genomic locus in C. intestinalis that displayed significant sequence similarity to the mammalian FAM20 genes and proteins (Table 1). Complete genome sequences are also available for several invertebrate species and two related sequences were identified in Drosophila melanogaster and Anopheles gambiae with one family member in Caenorhabditis elegans (Table 2). Finally, analysis of genomic cDNA and protein databases for the pufferfish (Fugu rubripes) and zebrafish (Danio rerio) revealed the presence of six and five family representatives, respectively (Table 1). The gene numbers in these various species are listed on an idealized evolutionary tree in Figure 2 and suggest that the FAM20 gene family has undergone a complex set of gene duplications in both the invertebrate and chordate lineages.\nTable 2 Accession numbers for invertebrate FAM20 family members1 1. Accession numbers refer to nucleotide sequences in GenBank. ND: None detected.\nFigure 2 Evolutionary distribution of FAM20 gene number. An idealized evolutionary tree (modified from [10]) is shown with the number of FAM20 genes identified in several genomes as described in the text. The gene numbers are supportive of a single gene duplication event occurring in invertebrates (at least in insects) and multiple gene duplication events occurring in higher vertebrates.\n\nAssignment of subfamily relationships\nTo elucidate the nature of these putative gene duplications, we sought to assign the individual sequences from the various species into subfamilies based on protein sequence and gene structure, specifically using the number and size of exons in the latter case. Initially, the exon distribution of each of the Fam20 members in three mammalian species (human, mouse and rat) was compared and revealed obvious inter and intra orthologue similarities (figure 3A). Each FAM20A gene contained 11 exons and exon sizes were identical in these three species. Likewise, each FAM20B gene contained 7 exons that were identical in size in human, mouse and rat. The FAM20C genes each contained 10 exons and only exon 1 displayed any variation in size amongst these three species. In intra-orthologue comparisons, the exons in FAM20B and FAM20C genes clearly aligned with exons in the FAM20A genes, with small variations (in multiples of three bases) in the size of the internal exons in FAM20B. FAM20B lacks exons corresponding to exons 2–4 of FAM20A while FAM20C lacks exon 3. In addition, exons 8 and 9 in FAM20A and FAM20C are represented by a single exon in FAM20B that is identical in size to the combined exons in the other two genes. Thus, the three mammalian genes are highly evolutionarily related and presumably are derived from a common ancestral gene.\nFigure 3 Assignment of FAM20 family members to subfamilies. A. Exon size and distribution of mammalian FAM20 members. The exons within each FAM20 gene in human, mouse and rat are indicated with the number of base pairs indicated within each exon. The sizes of exons that differ in size from the FAM20A genes are indicated. B. A dendrogram showing the relationships between FAM20 proteins from human (Hs), mouse (Mm), rat (Rn), Fugu rubripes (Fr), Danio rerio (Dr), D. melanogaster (Dm), A. gambiae (Ag), C. intestinalis (Ci) and C. elegans. The accession numbers of the cDNA sequences from which each protein sequence was derived are shown in parentheses except in the case of the mosquito family members where the accession number is used as the gene/protein name. Accession numbers for zebrafish peptide sequences are listed in Table 1. The FAM20 nomenclature has not been extended to the invertebrate sequences and the previous gene names have been used for Drosophila and C. elegans family members. The subfamily assignment of each family member is shown on the right. C. Exon number and size distribution within Fugu Fam20 members. The accession number of each sequence within the Third Party Annotation database is shown at left and family assignment based on dendrogram position and exon distribution is shown on the right. To assign the genes identified in other species to these three subfamilies, we performed a global comparison of the peptide sequences derived from 25 of the identified family members listed in Tables 1 and 2. One zebrafish protein (from FAM20A) was omitted as its sequence is incomplete. A dendrogram showing the results of this comparison is presented in Figure 3B. As expected, the mammalian orthologues clustered together and thus defined the subfamilies. All of the invertebrate proteins and the single protein identified in C. intestinalis clustered with FAM20B proteins, suggesting that this represents the ancestral branch of the FAM20 family. A single protein from Fugu and zebrafish clustered with the FAM20A and FAM20B family members while two Fugu and two zebrafish proteins clustered with FAM20C members. However, two Fugu proteins and one zebrafish protein clustered on a separate branch between FAM20A and FAM20B. In order to determine the subfamily to which these proteins belonged, we made use of the high degree of conservation of exon size and number noted in the mammalian genes (Figure 3C). The exon number and size of the Fugu and zebrafish genes encoding the two proteins assigned to FAM20A and FAM20B were consistent with their membership in these families. The only variations noted were a slightly larger exon 2 in the Fugu Fam20a gene and the division of exon 1 into two exons in the Fugu Fam20b gene. As in the mammalian family members, the sizes of the terminal exons varied more than the internal exons. The other four Fugu genes displayed exon distributions consistent with membership in FAM20C, despite the clustering of two of the encoded proteins between FAM20A and FAM20B. We have assigned each of these proteins to FAM20C with number suffixes (c1, c2, etc.) to designate individual genes and proteins. Each of these genes maps to distinct genomic loci and thus represents independent genes and not splicing variants of a smaller number of genes (data not shown). The gene structures of the three zebrafish family members were also consistent with this family assignment (data not shown). Comparisons of the derived protein sequences within each subfamily are shown in figures 4,5,6.\nFigure 4 Sequence alignment of FAM20A protein sequences. The complete protein sequences of FAM20A members were compared using the AlignX component of the VectorNTI sequence analysis suite of programs. Identical amino acids are outlined in yellow, and similar residues are indicates in light blue. Conserved regions 1, 2 and 3 are underlined (see below). Gaps are indicated with dashes and the sequences are from human (H), mouse (M), rat (R) and puff erfish (F).\nFigure 5 Sequence alignment of FAM20B protein sequences. The complete protein sequences of FAM20B members are presented as described in figure 4. The sequences are from human (H), mouse (M), rat (R), pufferfish (F), zebrafish (D) and C. intestinalis (Ci).\nFigure 6 Sequence alignment of FAM20C protein sequences. The complete protein sequences of FAM20C members are presented as described in figure 4. The sequences are from human (H), mouse (M), rat (R), pufferfish (Fcl-4) and zebrafish (Dcl-3).\n\nFeatures of FAM20 proteins\nAll of the identified FAM20 protein sequences contain putative signal sequences at their amino termini but no other functional domains were unambiguously detected using several different annotation search software programs. In order to search for potential functional domains, we compared the sequences of all family members. These comparisons revealed that the greatest similarity was located within the carboxy-terminal two thirds of each protein (Figure 7A). We have named this region the conserved C-terminal domain (CCD) and it overlaps with a domain listed in the CDD database at NCBI as DUF1193. The CCD contains three distinct regions that are more highly conserved within all members of the family than the surrounding sequences (named conserved regions 1, 2 and 3 in figure 7A) and the consensus sequences for each conserved region were derived (figure 7B). Amino acids that are essentially invariant in all family members have been indicated in bold type and the heptapeptide DRHHYE in CR2 is the longest contiguous sequence that is conserved in all members of the family. A set of eight cysteine residues is also perfectly conserved within the CCD of each family member that may participate in inter-or intramolecular disulphide bond formation.\nFigure 7 Schematic representation of the structural features of FAM20 family members. A. Structural features of FAM20A showing domains and residues conserved within the entire family. Key: SS: signal sequence; CCD: conserved C-terminal domain; CR: conserved region; Cys: cysteine residues conserved within CCD (indicated with asterisk). B. Consensus sequences were derived for CR1, CR2 and CR3 using a global comparison of all the family members listed in Tables 1 and 2. Residues that are invariant or only differ in one sequence are indicated in bold. Non-conserved residues are indicated with an x and positions with more than one common residue are shown below the main sequence.\n\nFam20a is a secreted protein\nAs the putative signal sequence was the only known domain identified in all family members, we next tested whether this sequence is functional. Signal sequences are commonly found on proteins that are directed to the endoplasmic reticulum (ER) and either retained there or processed and transported into the Golgi apparatus and secreted from the cell. Many proteins are glycosylated during their transit through the ER and Golgi apparatus and the mouse Fam20a protein contains four potential sites for N-glycosylation (indicated in red type in figure 1). As Fam20a does not contain an ER retention signal, we predicted that it should be detected in the medium of expressing cells. A mammalian expression vector was constructed that contained the full length mouse Fam20a coding sequence fused to a C-terminal Myc epitope tag and a hexahistidine sequence to permit purification. The plasmid was transfected into monkey kidney COS-1 cells and total protein was isolated from both the cells and the cell medium. Proteins in the cell medium were first processed on a Nickel column to isolate and concentrate the recombinant Fam20a protein and both protein samples were analyzed by immunoblotting using an antiserum specific for the Myc epitope. The predicted molecular weights of the full length and processed forms of Fam20a are 61,500 and 57,500, respectively, and a recombinant form of the protein synthesized in rabbit reticulocyte lysates was run alongside as a molecular size marker. The recombinant protein migrated just below the 62,000 mol.wt. size marker (Figure 8A and 8B, lane 5); however, the proteins detected in both the cell medium and cell extract migrated slower (lane 3). To test whether this slower migrating band represented a glycosylated form of Fam20a, the protein samples were treated with the enzyme N-glycosidase F (PNGaseF). The protein detected after enzyme treatment migrated more rapidly than the untreated protein and comigrated with the recombinant form of the protein (compare lanes 3, 4 and 5). We noted a second band that migrated slightly more slowly than the recombinant protein in the PNGase F treated cell extracts that may represent an alternatively modified form of Fam20a (Figure 8B, lane 4). To confirm that Fam20a is a secreted protein, we also exposed Fam20a-expressing cells to Brefeldin A, a fungal metabolite that specifically blocks transport from the ER to the Golgi apparatus, and examined the effects on Fam20a secretion. Brefeldin A treatment resulted in a consistent decrease in the amount of Fam20a detected in the cell medium (Figure 8C, compare lanes 5 and 6). Thus, Fam20a is a secreted glycoprotein.\nFigure 8 Fam20a is a secreted protein. COS-1 cells were transfected with either an empty expression vector (-) or one encoding mouse Fam20a with a C-terminal myc epitope tag and proteins were isolated from either the medium (panel A) or the cells (panel B). The proteins were analyzed by immunoblotting using a Myc tag-specific antiserum. Samples in lanes 2 and 4 of each blot were pre-treated with protein N-glycosidase prior to analysis to remove glycosyl groups. A recombinant form of Fam20a synthesized in rabbit reticulocyte lysates (TnT) was included on each gel as a size marker. The position of glycosylated and deglycosylated Fam20a is indicated using arrowheads and cross reacting material detected in the medium is indicated using asterisks. The location of molecular size markers is shown on the left of each gel. C. Protein samples from the medium of transfected cells that were untreated or treated with Brefeldin A were analyzed by immunoblotting using the Myc tag-specific antiserum. As Brefeldin A was resuspended in DMSO, the untreated cells were exposed to DMSO alone as a vehicle control. The amount of Fam20a detected in the medium of Brefeldin A treated cells was consistently lower than that observed in untreated cells (indicated using an arrowhead).\n\nFam20a secretion requires a functional signal sequence\nWe next tested whether the integrity of the signal sequence was required for Fam20a secretion. Signal sequences typically contain a high proportion of hydrophobic amino acids and 19 of the first 34 amino acids of Fam20a are hydrophobic (Figure 9A). Therefore, we expressed a Fam20a protein lacking the first 23 amino acids (FAM20a(Δ23)) in COS-1 cells and examined secreted and intracellular proteins by immunoblotting (Figure 9B). Glycosylated FAM20a(Δ23) protein was not detected in the medium (compare lanes 2 and 3) and immunoreactivity that comigrated with the unglycosylated recombinant protein was detected in the cell extract (lane 6). We also compared the subcellular location of the FAM20a(Δ23) protein to the wild type protein using GFP fusion proteins (figure 9C). The wild type Fam20a-GFP proteins displayed perinuclear and cytoplasmic staining consistent with ER localization. In contrast, the Fam20a (Δ23)-GFP protein was absent from the cytoplasm and appeared to be exclusively localized within the nucleus. To ensure that this effect was not a consequence of a gross change in protein structure due to the deletion of 23 amino acids, we also constructed an expression vector encoding a Fam20a protein with a two amino acid substitution within the putative signal sequence (Figure 9A). These changes (Leu14–Leu15 to Asp-Glu) were predicted to disrupt the signal sequence without grossly altering the protein structure. Again the mutant protein displayed nuclear staining and was absent from the ER (Figure 9D). These results confirm that an intact signal sequence was necessary for secretion of Fam20a and that secretion was accompanied by prominent localization of the protein to the ER.\nFigure 9 Secretion of Fam20a requires an intact signal sequence. A. Schematic representation of the putative signal sequence of Fam20a. The predicted cleavage site is indicated with a red arrowhead. The two amino acid substitutions introduced in the SSmut construct and the sequence remaining in the Δ23 mutant construct are shown. B. Immunoblot analysis of Fam20a and Fam20a(Δ23) protein levels in transfected COS-1 cells. The position of the glycosylated form of Fam20a (which is absent in Fam20a(Δ23) transfected cells is indicated with an arrowhead. C. Fluorescence images of COS-1 cells expressing either Fam20a-GFP or Fam20a(Δ23)-GFP. The wild type protein was observed within the cytoplasm, predominantly in a structure that is likely to be the ER. The mutant protein was primarily localized to the nucleus. D. Immunofluorescence images of Fam20a and Fam20a (SSmut) proteins as detected by antiserum directed against the C-terminal Myc epitope. The wild type protein was again detected in the ER and the mutant protein primarily in the nucleus. The cells have been counterstained with DAPI to delineate the nucleus.\nFigure 10 RT-PCR analysis of mouse Fam20 mRNA levels during differentiation of EML and MPRO cells. Total RNAs were prepared from EML (panel A) or MPRO (panel B) cells at the indicated timepoints during myeloid and granulocytic differentiation. cDNAs prepared from each sample were amplified using primer pairs specific to each mouse family member. The PCR products were analyzed by agarose gel electrophoresis and stained using Gelstar SYBR Green DNA stain. GAPDH was used as a loading control.\n\nExpression patterns of FAM20 genes during myeloid differentiation\nWe originally identified Fam20a as a differentially expressed mRNA in EML cells induced to differentiate along the myeloid lineage. To determine whether Fam20b and Fam20c are also expressed during hematopoiesis, we performed RT-PCR analysis of cDNAs prepared at various times during experimentally-induced differentiation of EML and MPRO cells using primers specific to each family member. Fam20a mRNA levels were low in uninduced EML cells maintained in the presence of SCF and increased during the subsequent 72 hours of incubation in atRA and IL-3 (Figure 10A). EML cells mature to the promyelocyte stage of neutrophil differentiation under these conditions and can subsequently be differentiated into neutrophils by adding GM-CSF in place of SCF and IL-3. Fam20a mRNA levels decreased during terminal neutrophil differentiation in EML cells and also in MPRO cells induced to undergo the same differentiation process in the presence of atRA (Figure 10A and 10B). Fam20b and Fam20c mRNAs were readily detected in both cell lines and their levels did not vary dramatically during the differentiation process in either cell line (Figure 10A and 10B).\n\nExpression patterns Of FAM20 genes in human tissues\nAlthough we originally isolated Fam20a from a hematopoietic cell line, cDNAs and ESTs derived from each of the FAM20 family members have been isolated from non-hematopoietic tissues (data not shown). Therefore, we examined the expression patterns of the three genes in a panel of cDNAs derived from various human tissues. FAM20A displayed the most restricted expression pattern with high levels in lung and liver and intermediate levels in thymus and ovary (Figure 11). Low levels of FAM20A mRNA were detected in several other tissues. FAM20B and FAM20C were expressed in a wider variety of tissues and their expression patterns were very similar.\nFigure 11 RT-PCR analysis of human FAM20 mRNA levels in human tissues. A panel of commercially available human cDNAs prepared from the indicated tissues was analyzed by PCR using primer pairs specific for each of the human FAM20 family members. GAPDH was again used as a loading control although large variations were observed in the GAPDH signal in the different tissues.\n\nDi"}