PMC:3514855 JSON TXT

Microevolution in Cyanobacteria: Re-sequencing a Motile Substrain of Synechocystis sp. PCC 6803 Abstract Synechocystis sp. PCC 6803 is a widely used model cyanobacterium for studying photosynthesis, phototaxis, the production of biofuels and many other aspects. Here we present a re-sequencing study of the genome and seven plasmids of one of the most widely used Synechocystis sp. PCC 6803 substrains, the glucose tolerant and motile Moscow or ‘PCC-M’ strain, revealing considerable evidence for recent microevolution. Seven single nucleotide polymorphisms (SNPs) specifically shared between ‘PCC-M’ and the ‘PCC-N and PCC-P’ substrains indicate that ‘PCC-M’ belongs to the ‘PCC’ group of motile strains. The identified indels and SNPs in ‘PCC-M’ are likely to affect glucose tolerance, motility, phage resistance, certain stress responses as well as functions in the primary metabolism, potentially relevant for the synthesis of alkanes. Three SNPs in intergenic regions could affect the promoter activities of two protein-coding genes and one cis-antisense RNA. Two deletions in ‘PCC-M’ affect parts of clustered regularly interspaced short palindrome repeats-associated spacer-repeat regions on plasmid pSYSA, in one case by an unusual recombination between spacer sequences. 1. Introduction With currently >4000 publications available from PubMedCentral alone, ‘Synechocystis’ is the most widely used photoautotrophic prokaryotic model organism. Synechocystis sp. PCC 6803 is a unicellular cyanobacterium that was isolated from a freshwater pond in Oakland, California.1 The high popularity of Synechocystis sp. PCC 6803 stems from the two facts that it was the first phototrophic and the third organism overall, for which a complete genome sequence was determined,2 and that it easily takes up exogenous DNA and integrates it into its chromosome by homologous recombination.3–5 Synechocystis sp. PCC6803 is known to occur in several distinct substrains, all going back to the same isolate deposited in the Pasteur Culture Collection.6 Indeed, several studies reported the differences between the genome sequence of Synechocystis sp. PCC 6803 published in 1996 (called here the ‘GT-Kazusa’ substrain) and the actual sequence found in different laboratories.7–10 A strain history has been proposed by Ikeuchi and Tabata8 with an early branching into the motile PCC strain and the non-motile ATCC 27184 strain. The latter lost motility due to a 1-bp insertion in the spkA gene coding for a eukaryotic-type Ser/Thr protein kinase11 and represents the origin of the glucose-tolerant (GT) strains5 to which also the ‘GT-Kazusa’ substrain belongs. For decades, Synechocystis sp. PCC 6803 has served as a simple model in photosynthesis research and to solve fundamental questions in microbial and plant physiology. More recently, cyanobacteria are increasingly being recognized as a promising resource for the production of biofuels such as hydrogen,12 ethanol,13 isobutyraldehyde and isobutanol,14 ethylene15 and alkanes.16 Synechocystis sp. PCC 6803 is being developed further as a model in these biotechnology- and systems biology-oriented studies. These facts as well as the search for motility-associated genes prompted several re-sequencing studies of Synechocystis sp. PCC 6803 substrains, namely of the substrains GT-S,10 PCC-P, PCC-N, GT-I9 and YF.17 However, these studies have not included the widely used GT and motile ‘Moscow’ substrain, which we here suggest to call ‘PCC-M’. Furthermore, thus far no attention has been paid to the possible sequence variations in the seven plasmids, which constitute a total sequence length of 383 486 bp almost 10% of the total coding capacity of Synechocystis sp. PCC 6803. This analysis provides new and reliable sequence data for the Synechocystis sp. PCC 6803 substrain ‘PCC-M’, revealing several differences from the published sequence that can be interpreted as the traces of microevolution during cultivation in the laboratory. 2. Materials and methods 2.1. Origin of strain, isolation of DNA and PCR analysis Synechocystis sp. PCC 6803 substrains ‘Moscow’ here called ‘PCC-M, Kazusa (GT-Kazusa) and Vermaas’ (GT-V) were cultivated by Prof. Annegret Wilde (University of Freiburg, Germany) and maintained as frozen stocks. The ‘PCC-M’ substrain was originally obtained from the laboratory of S. Shestakov (Moscow State University) in 1993 and over the years carefully propagated for motile colonies. The ‘GT-V’ strain originates from the laboratory of W. Vermaas (Arizona State University). Genomic DNA for deep sequencing analysis was isolated from 80 ml cultures harvested on a glass microfiber filter (GF/C, 47 mm i.d. Whatman) by vacuum filtration. The frozen filter was ground in a mixer mill (Dismembrator MM301, Retsch, Germany) and the powder transferred into 1 ml SET buffer on ice (25% (w/v) sucrose, 1 mM EDTA, 50 mM Tris pH 7.5). One-fourth volume of 0.5 M EDTA, 2% SDS and 1.5 mg proteinase K (Sigma) were added for cell lysis at 50°C overnight. Following phenol/chloroform extraction, one volume of 2-propanol (Roth, Germany) was added for precipitating the DNA at room temperature for 30 min. The precipitate was washed once in H2O/2-propanol 1:1 and once in 2-propanol, followed by 10 min centrifugation at 10 000 g, 4°C. The pellet was washed with 70% EtOH, dried for 10 min and re-suspended in 50 µl H2O. One microlitre of RNase A (Sigma) was added and the tube incubated at 37°C and 260 rpm overnight. RNase was removed by another round of phenolic extraction and precipitation as described above. The DNA was re-suspended in 75 µl H2O, concentration was measured photometrically and DNA quality checked on a gel (0.8% agarose). Genomic DNA for PCR was isolated from the cell pellet of 1 ml Synechocystis liquid culture. The pellet was washed once with a 1:10 dilution of TE buffer (10 mM Tris HCl pH 8; 1 mM EDTA) and re-suspended in 70 µl of the same buffer. Cells were broken by incubation at 98°C for 10 min. After centrifugation at 14 000 g and 4°C for 5 min, the supernatant was collected and kept on ice. Two microlitres of it were used for PCR. For PCR reactions, Phusion® DNA polymerase (Finnzymes, New England Biolabs) was used according to the manufacturer's instructions. To verify single nucleotide polymorphisms (SNPs) between the different substrains, ∼500 bp fragments containing the SNP position were amplified. PCR products were excised from an agarose gel, purified (illustra GFX PCR DNA and Gel Band Purification Kit, GE Healthcare) and sent for Sanger sequencing to GATC Biotech (Konstanz, Germany). For sequencing of the small plasmids, several PCR reactions were performed to get overlapping sequences and contigs were assembled using the software ContigExpress (Vector NTI Advance 11, Invitrogen). Alignments of the sequences were performed using AlignX (Vector NTI Advance 11, Invitrogen). 2.2. Sequencing methods and mapping Sequencing of genomic DNA was carried out on an Illumina Genome Analyzer IIx system. Prior to sequencing, the DNA was sheared by ultrasonication (Covaris, Woburn, MA, USA), resulting in fragments of 300 bp length on average. For these fragments paired-end sequencing according to the manufacturer's protocol was carried out, resulting in 42 143 495 million 101 nt long reads. These reads were analysed with two methods in order to identify SNPs, deletions and insertions. For the first approach, we used the DNA sequence data assembler algorithm MIRA (Mimicking Intelligent Read Assembly)18 to perform an assembly of the reads using the ‘GT-Kazusa’ genome as the reference. In the assembly process, MIRA generates tables of candidate SNPs, insertions and deletions. We verified these results independently by mapping all sequencing reads to the assembled chromosome and plasmid sequences. This was done using segemehl,19 requiring at least 85% accuracy and reporting only the best hit. It should be noted that segemehl reports co-optimal best hits. 3. Results 3.1. Overview Sequencing of the Synechocystis sp. PCC 6803 ‘Moscow’ substrain ‘PCC-M’ by Illumina (Solexa) yielded an average 1100-fold coverage of the chromosome and five of the seven plasmids. The existence of the two remaining plasmids was verified individually by PCR. Following assembly of sequences, mapping to the reference strain sequences and annotation, the obtained genome and plasmid sequences were deposited in the GenBank database with the accession numbers CP003265–CP003272. Altogether, we found 45 differences (36 SNPs and 9 indels >1 bp) between the investigated substrain ‘PCC-M’ and the published sequences of the ‘GT-Kazusa’ chromosome2 and plasmids20 used here as references (Table 1). From these differences, 41 are located in the chromosome and four in the plasmids pSYSA, pSYSM and pCA2.4. For verification, about one-third of these differences were randomly chosen and confirmed independently by PCR and Sanger sequencing of the respective regions in substrain ‘PCC-M’, but no misidentified mutations were found. These DNA regions were, in addition, amplified and compared with the sequences from substrains ‘GT-Kazusa’ and ‘GT-V’ for control and comparison, respectively. The GT ‘GT-V’ was chosen for comparison as is widely used for the dissection and analysis of photosynthetic mutants. Fully segregated PSI, PSII and Chl biosynthesis mutants were successfully generated in this genetic background21,22 and some of these mutants could not be obtained in other substrains.23 Table 1. Location and effects of SNPs and indels found in ‘PCC-M’ compared with the nucleotide sequence of ‘GT-Kazusa’ in the database The events are numbered (column #), the type of mutation (M) is indicated as S, substitution, D, deletion or I, insertion, together with the respective start and end positions in the ‘GT-Kazusa’ reference sequence. For each event the respective nucleotide change is indicated on the forward strand, together with the resulting codon modification (Ref. → Mut) and amino acid change, if any. Highlighted in italics are four instances of missing ISY203 copies and in bold all SNPs affecting intergenic spacer regions (IGR). aIndicate errors in the database. The number of differences between ‘PCC-M’ and ‘GT-Kazusa’ are almost twice as many as reported by Tajima et al.10 for the GT (GT-S) ‘Kazusa’ strain, where a total of 22 differences from the published sequence were found.10 All but 3 of those 22 differences were also detected in the ‘PCC-M’ strain studied here. The three unique differences in the ‘GT-S’ and 26 differences between ‘PCC-M’ and ‘GT-Kazusa’ underline the existence of lineage splitting in the Synechocystis substrains. Moreover, we found seven SNPs (#5, 13, 15, 16, 27, 32 and 33 in Tables 1 and 2) and one larger indel (#6 in Tables 1 and 2) specifically shared between the ‘PCC-M’ and the ‘PCC-N and PCC-P’ substrains, indicating that ‘PCC-M’ belongs to the ‘PCC’ group of motile substrains.9 ‘PCC-M and PCC-P’ are strains that both exhibit the native positive phototaxis, whereas ‘PCC-N’ strain shows negative phototaxis.24 Table 2. Comparison of SNPs and indels found in the chromosome of ‘PCC-M’ with sequences from other substrains All events are numbered (column #) as in Table 1. The presence of the respective ‘PCC-M’ mutation in the different substrains is indicated by the check marks. aThe deletion of 0.6 kb in the gene slr1753 compared with the reference was also verified here in ‘GT-Kazusa’. 3.2. SNPs in protein-coding genes Of the total of 36 SNPs in ‘PCC-M’ compared with ‘GT-Kazusa’, all except 1 are located in the chromosome. The single base substitution that was found on the plasmid pCA2.4 within the repA gene (#42 in Table 1) seems to be no mutation but an error in the published sequence of ‘GT-Kazusa’, since in our PCR-control experiments, the sequence was identical in the three strains ‘GT-Kazusa’, ‘PCC-M’ and ‘GT-V’. Of the 35 chromosomal SNPs compared with ‘GT-Kazusa’, 5 are silent base substitutions, 14 substitutions lead to amino acid substitutions, in 6 cases a single basepair is deleted and in 2 cases (#23 and #28) one basepair was inserted within an ORF, causing a frameshift mutation. Furthermore, five substitutions, two single basepair insertions and one single basepair deletion were observed in intergenic regions (IGR) of ‘PCC-M’ compared with the reference (Table 1). Seven SNPs are specifically shared between the ‘PCC-M’, ‘PCC-N and PCC-P’ substrains. These are in slr1865 (#13), encoding a hypothetical protein, in sll1951 (#15), encoding a haemolysin-like protein, in slr1983 (#16), encoding a two-component hybrid sensor and regulator protein, in slr0222 (#27), encoding the histidine kinase Hik25, a silent mutation in slr0302 (#32), encoding a PAS/PAC and GAF sensors-containing diguanylate cyclase, one missing basepair, leaving the spkA gene intact (#5) and, finally, in ssr1176 (#33), encoding a transposase (Tables 1 and 2). The gene for a cell surface-localized haemolysin-like protein, HlyA (sll1951), reported to function as a barrier against the adsorption of toxic compounds,25,26 is lacking one nucleotide in ‘PCC-M’ compared with the reference (difference #15). In the ‘GT-Kazusa’, ‘GT-V’ as well as the ‘GT-I’ and ‘GT-S’ strains,9 the presence of the additional A leads to the fusion of two ORFs that are separate in ‘PCC-M’, ‘PCC-N’ and ‘PCC-P’ substrains.9 As a result, Sll1951 is 1741 amino acids in the former and only 1437 residues in the latter. In our data, some other previously published mutations8,10 are confirmed. For instance, spkA (sll1574; #5), a regulator of cellular motility via phosphorylation of membrane proteins,11,27 is disrupted by a 1-bp insertion in the non-motile ‘GT-Kazusa’ and ‘GT-V’ strains, whereas it is intact in the motile ‘PCC-M’ strain (Table 1). Similarly, the pilC gene (slr0162/3) required for pili assembly has been reported to carry a frameshift mutation in the ‘GT-Kazusa’ and ‘GT-S’ sequences.8,10,28 We found an intact pilC gene in ‘PCC-M’ (#20), as well as in the ‘GT-V’ substrain. Another SNP (G–A) exists in psaA (slr1834; #9), encoding the photosynthetic P700 apoprotein subunit Ia; however, in accordance with Tajima et al.10, we believe this is an annotation error in the database as we found an A in the respective position in all three strains dealt with in this work (Table 1). Similarly, ycf22 (sll0751; #26) is here suggested to be fused to the downstream reading frame sll0752. Indeed, in blastp comparisons, both proteins together match against a single, widely distributed, larger protein of 452 amino acids. This protein possesses a Ttg2C domain (COG1463), which is found in an ABC-type transport system involved in resistance to organic solvents. The acronym ycf stands for hypothetical chloroplast reading frames, meaning proteins conserved in chloroplasts and also cyanobacteria. The 1-bp shorter version, which is splitted into sll0751/sll0752, is a database error in the case of ‘GT-Kazusa’ as well. 3.2.1. SNPs unique to ‘PCC-M’ Six of the 10 SNPs unique to ‘PCC-M’ are located within coding regions and cause amino acid substitutions or alter the length of the respective reading frame. A single basepair transversion in the gene sigF (slr1564; #39 in Table 1) is leading to a M231K substitution within the −35 element DNA-binding region29 of a group 3 sigma factor required for phototactic movement30 and salt-stress response.31 This SNP cannot lead to impaired motility as ‘PCC-M’ is motile but it might influence the DNA–protein interaction of SigF because positively charged residues such as lysine located in this part of the σ4.2 region can directly interact with DNA.29 Another transversion, in argB (slr1898; #8 in Table 1), leads to an S2N amino acid substitution in N-acetylglutamate kinase, the enzyme performing the first committed step of Arg biosynthesis. Transitions in sll1359 and slr1609 (#11 and #3 in Table 1) result in an N–K substitution at a very conserved position within a predicted cytochrome and an L608S (L548S) substitution in the long-chain acyl-CoA-synthetase Slr1609 that has been found crucial for fatty acid activation and the biosynthesis of alkanes.32 Interestingly, an unrelated SNP exists at position 488 923 within the slr1609 coding sequence in a strain ‘YF’, leading to a G546L (G486L) substitution.17 It should be noted that the slr1609 reading frame has been annotated 60 codons shorter (636 instead of 696 amino acids) during recent re-sequencing analyses,9,10 compared with the original annotation of ‘GT-Kazusa’ (numbers in brackets). The shorter Slr1609 protein of 636 amino acids is also consistent with the mapped start site of transcription at position 487 352,33 located 115 nt upstream of the revised start codon. A transition in slr0753 (#41 in Table 1) leads to a P113L substitution in a putative chloride efflux transport protein involved in maintaining the chloride ion concentration homoeostasis as required for a functional photosystem II.34 A single basepair deletion in sll1496 (#38 in Table 1), encoding mannose-1-phosphate guanyltransferase, causes a frameshift and premature stop of the gene in ‘PCC-M’. The resulting protein is with 515 instead of 643 amino acids severely truncated and may be rendered function-less. 3.3. Point mutations in IGRs Compared with the reference, eight SNPs are located in IGRs, three of these (#7, 24 and 36) are ‘PCC-M’ specific. One of these (#36 in Table 1) SNPs is predicted to affect one of the recently reported cis-antisense RNAs.33 The additional A between positions 3194022 and 3194023 is located in the IGR between genes slr0533 and slr0534, encoding histidine kinase 10 (Hik10) and the soluble lytic transglycosylase Slt. On the reverse strand, the additional T falls within the predicted −10 element of the slr0534_as3 promoter. Instead of the high-scoring CATAAT,33 the motif is changed to ATTAAT. Hence, a modulation of slr0534_as3 expression compared with the reference is possible. In contrast to its designation, this cis-antisense RNA overlaps the 3′ end of genes slr0533 and hik10 (due to an error in the annotation used as the reference). In microarray analyses, slr0534_as3 of strain ‘PCC-M’ was found to be moderately to highly expressed under four tested conditions. Compared with the accumulation of the hik10 mRNA, it appeared even stronger.33 A function for Hik10 has been found in the perception of salt stress or transduction of the signal.35 The slr0534_as3 transcript may play a silencing role with regard to hik10 under non-inducing conditions. Mutation of its promoter element may hence cause a physiological effect in the salt stress response. Two other SNPs (at positions 831 647 and 2 400 722; #7 and #24 in Table 1) could have an impact on the promoter strength or the regulation of the genes infA and glcP. For glcP, the initiation site of transcription was mapped to position 2 400 66633 and for infA to position 831 635 (unpublished). Thus, these two SNPs are located 12 and 56 nt upstream of the respective initiation site of transcription. In the case of the infA promoter, the transition replaces a nucleotide within the putative −10 element, changing it from TGTGAT to TATGAT, a much more typical motif for a −10 element in Synechocystis.33 The mutation 56 nt upstream of the initiation site of transcription of glcP might be functionally relevant as well. The gene product, a glucose transporter, is directly relevant for the physiological ability to use glucose; its gene expression is affected by mutation of the gene for the AbrB-type transcription factor Sll0822.36 The region at position −56 might well be part of the recognized sequence. 3.4. Larger indels and plasmids In addition to this relatively large number of SNPs, only seven larger deletions were found on the chromosome and two plasmids. Compared with the reference, a deletion of 0.6 kb exists in the gene slr1753 (#4 in Table 1), which encodes, according to our data, a giant protein comprising 1549 amino acids that probably is transported to the cell surface. However, we found this deletion in our verification also in ‘GT-Kazusa’ and ‘GT-V’. Moreover, the deleted/inserted region consists of long series of DNA repeats (Fig. 1), an evidence for a possible assembly or annotation error in the original sequence analysis. Figure 1. Alignment of the possible indel region in gene slr1753. The sequence obtained in the verification experiment is aligned with that of the ‘GT-Kazusa’ reference. Two types of DNA repeats are indicated by the filled and non-filled lozenges. Given the very scarce available information concerning biological functions of the plasmids in Synechocystis sp. PCC 6803, it was interesting that all seven plasmids were detected during our analysis. Two, pCC5.2 and pCB2.4, were initially not found. However, as they were amplified easily by PCR, we re-inspected the unmapped sequencing reads, but still could not detect a single read matching these plasmids. This observation may relate to a lower copy number of these compared with the other plasmids, but this was not tested in the current study. Analysing the plasmid sequences, we observed a remarkable genetic stability. In addition to a single-base substitution in the plasmid pCA2.4 that might rather constitute an error in the reference sequence37 (see above) and a missing mobile element on the plasmid pSYSM, two mutations were observed, both in the plasmid pSYSA. Two major mutations affect the clustered regularly interspaced short palindrome repeats-CRISPR-associated proteins (CRISPR-Cas) system, located on the plasmid pSYSA. CRISPR-Cas systems provide in many archaea and bacteria an adaptive immunity against invading DNA.38–44 The plasmid pSYSA encodes the three independent systems CRISPR1, CRISPR2 and CRISPR3. A 2399-bp deletion encompassing the spacer-repeat regions 15–47 of CRISPR1 was detected in ‘PCC-M’ (#43), which also eliminated the relatively short genes ssr7018, ssl7019, ssl7020 and ssl7021, annotated within the spacer-repeat array of CRISPR1. However, the theoretical protein sequences of these gene products show no conservation at all and might not constitute real genes. Nevertheless, the deletion of spacer-repeat regions 15–47 of CRISPR1 is severe, since compared with the reference, it has eliminated two-thirds, 33 of its 49 spacer-repeat units. The sequence analysis suggests that the recombination events leading to the deletion of spacer-repeat regions 15–47 must have occurred within the direct repeats. Thus, this recombination is in agreement with previous observations that the downstream ends of the repeat clusters are conserved such that deletions and recombination events occur internally.45 A very different type of deletion was noticed for the CRISPR2 system located on the same plasmid. In this case, 159 bp were deleted (event #44 in Table 1). These 159 deleted bases correspond to positions 71 499–71 657 in the reference. The deletion encompasses two repeats including the spacer 41 in between. It is very surprising that the recombination did not occur within the repeat sections but in the adjacent spacers 40 and 42, thus generating a new ‘hybrid’ spacer 40 at positions 69 082–69 111 in the pSYSA plasmid of ‘PCC-M’ (Fig. 2). As a result, spacers 40, 41 and 42 of the original sequence are missing and became replaced by this hybrid sequence. The vast majority of described deletions in the CRISPR system occur between the direct repeats.45 Non-homologous recombination between two different spacers is rare, the deletion observed here in CRISPR2 of the plasmid pSYSA is generating additional sequence diversity in the CRISPR system. Due to the two deletions in the plasmid pSYSA, we determined its total length as 100 749 bp, compared with 103 307 bp for the reference. Figure 2. Non-homologous recombination in the plasmid pSYSA affecting spacers 40, 41 and 42 of CRISPR2. As a result of the 159-bp deletion in ‘PCC-M’ compared with ‘GT-Kazusa’, a novel hybrid spacer 40 was generated. The direct repeats are presented as squares and the nucleotide positions in the ‘GT-Kazusa’ are given according to the GenBank file NC_005230. 3.5. Mobile elements As can be seen in Tables 1 and 2 (differences #12, 17, 40 and 45), the ‘PCC-M’ substrain lacks four insertion elements of the ISY203 type present in ‘GT-Kazusa’.7 These elements are ISY203b, e and g on the chromosome and ISY203j on the plasmid pSYSM. These four indels have the exact same size of 1183 bp, only one is 1185 bp. In the ‘GT-S’ substrain re-sequenced by Tajima et al.10 one of these four elements, ISY203e, is already present, placing this strain (in accordance with Ikeuchi and Tabata)8 before ‘GT-Kazusa’ in the strain history. The absence of ISY203b, e and g in ‘PCC-M’ is further shared with the strains ‘GT-I’, ‘PCC-N’ and ‘PCC-P’,9 whereas no statement is possible with regard to the possible presence of ISY203j on the plasmid pSYSM in the latter. With respect to the described mobile elements, ‘PCC-M’ appears as one of the least-derived substrains. 4. Discussion 4.1. Strain history ‘PCC-M’ shows sequence differences in several genes compared with the reference sequence of ‘GT-Kazusa’ and also to the recently sequenced ‘GT-S’ strain. Kanesaki et al.9 concluded that 15 differences between the resequenced strains and the published GT-Kazusa sequence were annotation errors in the latter due to sequencing artefacts, a list to which we add two more putative errors in the database, differences #4 and #42 in Table 1. According to the proposed strain history in Ikeuchi and Tabata,8 the early division of Synechocystis sp. PCC 6803 into two branches occurred due to an insertion in spkA. Thus, our data suggest that the motile ‘PCC-M’ strain belongs to the motile PCC 6803 branch, whereas the non-motile ‘GT-Kazusa’, ‘GT-S’ and ‘GT-V’ strains are more closely related to each other and belong to the ATCC 27 184 branch. However, the 1-bp insertion in the pilC leading to ‘GT-Kazusa’ as described in the proposed strain history8 is not present in either ‘GT-S’ or ‘GT-V’, characterizing ‘GT-Kazusa’ as a more derived substrain. That ‘PCC-M’ belongs to the motile PCC 6803 branch is further reinforced by our finding of six SNPs specifically shared between the ‘PCC-M’ and the ‘PCC-N and PCC-P’ substrains (Tables 1 and 2).9 These six SNPs are in slr1865, in sll1951, encoding a haemolysin-like protein, in ssr1176, encoding a transposase and, interestingly, in genes encoding sensor and/or regulatory proteins (slr1983, slr0222 and slr0302) (Tables 1 and 2) and must already have been present in the progenitor strain to ‘PCC-M’, ‘PCC-N’ and ‘PCC-P’. Additional support comes from the analysis of two larger indels (#2 and #6 in Table 1). The preceding paper, Kanesaki et al.,9 described difficulties in finding indels between direct repeat sequences such as slr1084 and slr2031 by short read type re-sequencing data. Therefore, these two regions were analysed by PCR and Sanger sequencing in addition to the re-sequencing analysis. Indeed, the finding of indels between direct repeat sequences in genes slr1084 and slr2031 turned out as not been straightforward in our analysis as well. Compared with the reference, we found in both cases the additional sequences of 102 and 154 bp to be present in ‘PCC-M’. This result is relevant for lineage relationships among substrains. The additional 102 bp in gene slr1084 are shared between ‘PCC-M’ and the other substrains ‘PCC-P’, ‘PCC-N’ and ‘GT-I’. Therefore, this must be a deletion in the lineage leading to GT-Kazusa and GT-S. In contrast, the additional 154 bp within and upstream of gene slr2031 are shared between ‘PCC-M’, ‘PCC-P’ and ‘PCC-N’ and are absent from all studied GT substrains. These 154 bp comprise the conserved start codon of slr2031 and extend the gene by 29 codons compared with ‘GT-Kazusa’. Hence, the lack of these 154 bp in GT strains indicate a functionally adverse deletion there. In fact, the 154-bp deletion in GT substrains was noticed before,46 as well as the activity of slr2031 in the original Synechocystis sp. PCC 6803 substrains.47 From these considerations, the tree shown in Fig. 3 can be derived. In this tree, ‘GT-Kazusa’ is displayed as the strain with the longest evolutionary distance from the original isolate, whereas the ‘PCC-M’ substrain belongs to the ‘PCC’ group of substrains and is probably close to the original characteristics. All strains belonging to the ‘PCC’ group of substrains exhibit twitching motility as was shown also for the original PCC strain deposited in the Pasteur Culture Collection6 with variations in the motility behaviour.48,49 Since ‘PCC-M’ shows motility and is tolerant to glucose, it appears physiologically as a sort of intermediate between the two major branches: the motile and GT branches, consistent with its characterization as being close to the original characteristics. Figure 3. Visualization of phylogenetic relationships between various strains of Synechocystis sp. PCC 6803. The occurrence of the identified SNPs and other known events are indicated along the branches. The eight events separating the ‘GT’ and ‘PCC’ strains from each other are given at the branch point where these two lineages split or on the respective branches where they occurred. Putative insertions and deletions are labelled ‘Ins’. and ‘Del’., respectively. 4.2. Re-sequencing studies of Synechocystis sp. PCC 6803 The analysis of genome sequences of cyanobacteria has had a large impact on photosynthesis, ecology and biotechnology research.50 The present re-sequencing project delivers the new and complete sequence of the Synechocystis sp. PCC 6803 ‘PCC-M’, a substrain used in many laboratories and in several aspects close to the original isolate. Altogether, there are now chromosomal sequences for seven substrains of Synechocystis sp. PCC 6803 available: ‘PCC-M’ (this study); ‘PCC-P’ (positive phototaxis) and ‘PCC-N’ (negative phototaxis), both based on single colonies isolated from the PCC strain and designated according to their direction of phototactic movement;24 ‘GT-I’, the standard strain in Dr Ikeuchi's group;8 ‘YF’17 and ‘GT-S’,10 a current derivative of the original stock of Synechocystis sp. PCC 6803 from which the chromosomal reference sequence for ‘GT-Kazusa’ was determined in 19962 and for the large plasmids in 2003,20 whereas the three small plasmids had been sequenced already before.37,51,52 4.3. Mutations potentially linked to phenotype It is likely that most of the identified differences between the sequenced substrains result from distinct differences in the cultivation conditions in the different laboratories that have selected for fixing one or the other mutation. That also implies that the majority of identified mutations are not silent but linked to a certain effect. Indeed, most mutations in coding regions are not silent as might be expected but lead to frameshifts, amino acid substitutions or the truncation of reading frames. Similarly, SNPs in non-coding regions are probably biologically meaningful, too. This idea received support here by linking three ‘PCC-M’-specific SNPs in IGRs to the promoter regions controlling the expression of two protein-coding and one antisense RNA. For all these reasons, it appears likely that several of the mutations specific to ‘PCC-M’ or shared with ‘PCC-P’ and ‘PCC-N’ may be related to the known phenotypes of these strains. For example, the truncation of sll1951 (haemolysin) and possible truncation of slr1753 (surface protein) may contribute to a stress-induced clumping phenotype. Several other mutations might cause alterations in glucose tolerance or phototactic behaviour of these substrains. Differences at other loci may affect the phage resistance, stress response or functions in the primary metabolism, potentially relevant for the synthesis of alkanes or the N and C metabolism. The absence of ISY203g in the sll1473–5 regions in PCC substrains leads to an intact photoreceptor that regulates the expression of an alternative phycobilisome linker gene.53 Regarding phenotypic differences among motile PCC substrains, it might be noteworthy that ‘PCC-M’, despite its general ability to be motile, is not phototactic towards blue light (see direct comparison of strains in Fig. 1 of Fiedler et al.48). Here, the SNP #39 in the sigF gene, known to be involved in the control of phototactic movement30 might be considered, as the resulting M231K substitution could influence the DNA–protein interaction of this group 3 sigma factor in a very subtle way. For sure, the subtle differences in genome sequences have to be considered when choosing a particular substrain for certain experiments and when comparing phenotypes of mutant lines from different laboratories with the wild-type strain. Information on the re-sequenced genome and plasmid sequences including precisely annotated SNPs can be found in the eight sequence files available from GenBank under the accession numbers CP003265–CP003272. Funding The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7-ENERGY-2010-1) under grant agreement no. [256808] and from the German Research Foundation (DFG) project FOR1680 ‘Unravelling the Prokaryotic Immune System’ (grant HE 2544/8-1) to WRH and from grant AL 892/1-4 to SAB.

Document structure show

article-title	Microevolution in Cyanobacteria: Re-sequencing a Motile Substrain of Synechocystis sp. PCC 6803
abstract	Synechocystis sp. PCC 6803 is a widely used model cyanobacterium for studying photosynthesis, phototaxis, the production of biofuels and many other aspects. Here we present a re-sequencing study of the genome and seven plasmids of one of the most widely used Synechocystis sp. PCC 6803 substrains, the glucose tolerant and motile Moscow or ‘PCC-M’ strain, revealing considerable evidence for recent microevolution. Seven single nucleotide polymorphisms (SNPs) specifically shared between ‘PCC-M’ and the ‘PCC-N and PCC-P’ substrains indicate that ‘PCC-M’ belongs to the ‘PCC’ group of motile strains. The identified indels and SNPs in ‘PCC-M’ are likely to affect glucose tolerance, motility, phage resistance, certain stress responses as well as functions in the primary metabolism, potentially relevant for the synthesis of alkanes. Three SNPs in intergenic regions could affect the promoter activities of two protein-coding genes and one cis-antisense RNA. Two deletions in ‘PCC-M’ affect parts of clustered regularly interspaced short palindrome repeats-associated spacer-repeat regions on plasmid pSYSA, in one case by an unusual recombination between spacer sequences.
p	Synechocystis sp. PCC 6803 is a widely used model cyanobacterium for studying photosynthesis, phototaxis, the production of biofuels and many other aspects. Here we present a re-sequencing study of the genome and seven plasmids of one of the most widely used Synechocystis sp. PCC 6803 substrains, the glucose tolerant and motile Moscow or ‘PCC-M’ strain, revealing considerable evidence for recent microevolution. Seven single nucleotide polymorphisms (SNPs) specifically shared between ‘PCC-M’ and the ‘PCC-N and PCC-P’ substrains indicate that ‘PCC-M’ belongs to the ‘PCC’ group of motile strains. The identified indels and SNPs in ‘PCC-M’ are likely to affect glucose tolerance, motility, phage resistance, certain stress responses as well as functions in the primary metabolism, potentially relevant for the synthesis of alkanes. Three SNPs in intergenic regions could affect the promoter activities of two protein-coding genes and one cis-antisense RNA. Two deletions in ‘PCC-M’ affect parts of clustered regularly interspaced short palindrome repeats-associated spacer-repeat regions on plasmid pSYSA, in one case by an unusual recombination between spacer sequences.
body	1. Introduction With currently >4000 publications available from PubMedCentral alone, ‘Synechocystis’ is the most widely used photoautotrophic prokaryotic model organism. Synechocystis sp. PCC 6803 is a unicellular cyanobacterium that was isolated from a freshwater pond in Oakland, California.1 The high popularity of Synechocystis sp. PCC 6803 stems from the two facts that it was the first phototrophic and the third organism overall, for which a complete genome sequence was determined,2 and that it easily takes up exogenous DNA and integrates it into its chromosome by homologous recombination.3–5 Synechocystis sp. PCC6803 is known to occur in several distinct substrains, all going back to the same isolate deposited in the Pasteur Culture Collection.6 Indeed, several studies reported the differences between the genome sequence of Synechocystis sp. PCC 6803 published in 1996 (called here the ‘GT-Kazusa’ substrain) and the actual sequence found in different laboratories.7–10 A strain history has been proposed by Ikeuchi and Tabata8 with an early branching into the motile PCC strain and the non-motile ATCC 27184 strain. The latter lost motility due to a 1-bp insertion in the spkA gene coding for a eukaryotic-type Ser/Thr protein kinase11 and represents the origin of the glucose-tolerant (GT) strains5 to which also the ‘GT-Kazusa’ substrain belongs. For decades, Synechocystis sp. PCC 6803 has served as a simple model in photosynthesis research and to solve fundamental questions in microbial and plant physiology. More recently, cyanobacteria are increasingly being recognized as a promising resource for the production of biofuels such as hydrogen,12 ethanol,13 isobutyraldehyde and isobutanol,14 ethylene15 and alkanes.16 Synechocystis sp. PCC 6803 is being developed further as a model in these biotechnology- and systems biology-oriented studies. These facts as well as the search for motility-associated genes prompted several re-sequencing studies of Synechocystis sp. PCC 6803 substrains, namely of the substrains GT-S,10 PCC-P, PCC-N, GT-I9 and YF.17 However, these studies have not included the widely used GT and motile ‘Moscow’ substrain, which we here suggest to call ‘PCC-M’. Furthermore, thus far no attention has been paid to the possible sequence variations in the seven plasmids, which constitute a total sequence length of 383 486 bp almost 10% of the total coding capacity of Synechocystis sp. PCC 6803. This analysis provides new and reliable sequence data for the Synechocystis sp. PCC 6803 substrain ‘PCC-M’, revealing several differences from the published sequence that can be interpreted as the traces of microevolution during cultivation in the laboratory. 2. Materials and methods 2.1. Origin of strain, isolation of DNA and PCR analysis Synechocystis sp. PCC 6803 substrains ‘Moscow’ here called ‘PCC-M, Kazusa (GT-Kazusa) and Vermaas’ (GT-V) were cultivated by Prof. Annegret Wilde (University of Freiburg, Germany) and maintained as frozen stocks. The ‘PCC-M’ substrain was originally obtained from the laboratory of S. Shestakov (Moscow State University) in 1993 and over the years carefully propagated for motile colonies. The ‘GT-V’ strain originates from the laboratory of W. Vermaas (Arizona State University). Genomic DNA for deep sequencing analysis was isolated from 80 ml cultures harvested on a glass microfiber filter (GF/C, 47 mm i.d. Whatman) by vacuum filtration. The frozen filter was ground in a mixer mill (Dismembrator MM301, Retsch, Germany) and the powder transferred into 1 ml SET buffer on ice (25% (w/v) sucrose, 1 mM EDTA, 50 mM Tris pH 7.5). One-fourth volume of 0.5 M EDTA, 2% SDS and 1.5 mg proteinase K (Sigma) were added for cell lysis at 50°C overnight. Following phenol/chloroform extraction, one volume of 2-propanol (Roth, Germany) was added for precipitating the DNA at room temperature for 30 min. The precipitate was washed once in H2O/2-propanol 1:1 and once in 2-propanol, followed by 10 min centrifugation at 10 000 g, 4°C. The pellet was washed with 70% EtOH, dried for 10 min and re-suspended in 50 µl H2O. One microlitre of RNase A (Sigma) was added and the tube incubated at 37°C and 260 rpm overnight. RNase was removed by another round of phenolic extraction and precipitation as described above. The DNA was re-suspended in 75 µl H2O, concentration was measured photometrically and DNA quality checked on a gel (0.8% agarose). Genomic DNA for PCR was isolated from the cell pellet of 1 ml Synechocystis liquid culture. The pellet was washed once with a 1:10 dilution of TE buffer (10 mM Tris HCl pH 8; 1 mM EDTA) and re-suspended in 70 µl of the same buffer. Cells were broken by incubation at 98°C for 10 min. After centrifugation at 14 000 g and 4°C for 5 min, the supernatant was collected and kept on ice. Two microlitres of it were used for PCR. For PCR reactions, Phusion® DNA polymerase (Finnzymes, New England Biolabs) was used according to the manufacturer's instructions. To verify single nucleotide polymorphisms (SNPs) between the different substrains, ∼500 bp fragments containing the SNP position were amplified. PCR products were excised from an agarose gel, purified (illustra GFX PCR DNA and Gel Band Purification Kit, GE Healthcare) and sent for Sanger sequencing to GATC Biotech (Konstanz, Germany). For sequencing of the small plasmids, several PCR reactions were performed to get overlapping sequences and contigs were assembled using the software ContigExpress (Vector NTI Advance 11, Invitrogen). Alignments of the sequences were performed using AlignX (Vector NTI Advance 11, Invitrogen). 2.2. Sequencing methods and mapping Sequencing of genomic DNA was carried out on an Illumina Genome Analyzer IIx system. Prior to sequencing, the DNA was sheared by ultrasonication (Covaris, Woburn, MA, USA), resulting in fragments of 300 bp length on average. For these fragments paired-end sequencing according to the manufacturer's protocol was carried out, resulting in 42 143 495 million 101 nt long reads. These reads were analysed with two methods in order to identify SNPs, deletions and insertions. For the first approach, we used the DNA sequence data assembler algorithm MIRA (Mimicking Intelligent Read Assembly)18 to perform an assembly of the reads using the ‘GT-Kazusa’ genome as the reference. In the assembly process, MIRA generates tables of candidate SNPs, insertions and deletions. We verified these results independently by mapping all sequencing reads to the assembled chromosome and plasmid sequences. This was done using segemehl,19 requiring at least 85% accuracy and reporting only the best hit. It should be noted that segemehl reports co-optimal best hits. 3. Results 3.1. Overview Sequencing of the Synechocystis sp. PCC 6803 ‘Moscow’ substrain ‘PCC-M’ by Illumina (Solexa) yielded an average 1100-fold coverage of the chromosome and five of the seven plasmids. The existence of the two remaining plasmids was verified individually by PCR. Following assembly of sequences, mapping to the reference strain sequences and annotation, the obtained genome and plasmid sequences were deposited in the GenBank database with the accession numbers CP003265–CP003272. Altogether, we found 45 differences (36 SNPs and 9 indels >1 bp) between the investigated substrain ‘PCC-M’ and the published sequences of the ‘GT-Kazusa’ chromosome2 and plasmids20 used here as references (Table 1). From these differences, 41 are located in the chromosome and four in the plasmids pSYSA, pSYSM and pCA2.4. For verification, about one-third of these differences were randomly chosen and confirmed independently by PCR and Sanger sequencing of the respective regions in substrain ‘PCC-M’, but no misidentified mutations were found. These DNA regions were, in addition, amplified and compared with the sequences from substrains ‘GT-Kazusa’ and ‘GT-V’ for control and comparison, respectively. The GT ‘GT-V’ was chosen for comparison as is widely used for the dissection and analysis of photosynthetic mutants. Fully segregated PSI, PSII and Chl biosynthesis mutants were successfully generated in this genetic background21,22 and some of these mutants could not be obtained in other substrains.23 Table 1. Location and effects of SNPs and indels found in ‘PCC-M’ compared with the nucleotide sequence of ‘GT-Kazusa’ in the database The events are numbered (column #), the type of mutation (M) is indicated as S, substitution, D, deletion or I, insertion, together with the respective start and end positions in the ‘GT-Kazusa’ reference sequence. For each event the respective nucleotide change is indicated on the forward strand, together with the resulting codon modification (Ref. → Mut) and amino acid change, if any. Highlighted in italics are four instances of missing ISY203 copies and in bold all SNPs affecting intergenic spacer regions (IGR). aIndicate errors in the database. The number of differences between ‘PCC-M’ and ‘GT-Kazusa’ are almost twice as many as reported by Tajima et al.10 for the GT (GT-S) ‘Kazusa’ strain, where a total of 22 differences from the published sequence were found.10 All but 3 of those 22 differences were also detected in the ‘PCC-M’ strain studied here. The three unique differences in the ‘GT-S’ and 26 differences between ‘PCC-M’ and ‘GT-Kazusa’ underline the existence of lineage splitting in the Synechocystis substrains. Moreover, we found seven SNPs (#5, 13, 15, 16, 27, 32 and 33 in Tables 1 and 2) and one larger indel (#6 in Tables 1 and 2) specifically shared between the ‘PCC-M’ and the ‘PCC-N and PCC-P’ substrains, indicating that ‘PCC-M’ belongs to the ‘PCC’ group of motile substrains.9 ‘PCC-M and PCC-P’ are strains that both exhibit the native positive phototaxis, whereas ‘PCC-N’ strain shows negative phototaxis.24 Table 2. Comparison of SNPs and indels found in the chromosome of ‘PCC-M’ with sequences from other substrains All events are numbered (column #) as in Table 1. The presence of the respective ‘PCC-M’ mutation in the different substrains is indicated by the check marks. aThe deletion of 0.6 kb in the gene slr1753 compared with the reference was also verified here in ‘GT-Kazusa’. 3.2. SNPs in protein-coding genes Of the total of 36 SNPs in ‘PCC-M’ compared with ‘GT-Kazusa’, all except 1 are located in the chromosome. The single base substitution that was found on the plasmid pCA2.4 within the repA gene (#42 in Table 1) seems to be no mutation but an error in the published sequence of ‘GT-Kazusa’, since in our PCR-control experiments, the sequence was identical in the three strains ‘GT-Kazusa’, ‘PCC-M’ and ‘GT-V’. Of the 35 chromosomal SNPs compared with ‘GT-Kazusa’, 5 are silent base substitutions, 14 substitutions lead to amino acid substitutions, in 6 cases a single basepair is deleted and in 2 cases (#23 and #28) one basepair was inserted within an ORF, causing a frameshift mutation. Furthermore, five substitutions, two single basepair insertions and one single basepair deletion were observed in intergenic regions (IGR) of ‘PCC-M’ compared with the reference (Table 1). Seven SNPs are specifically shared between the ‘PCC-M’, ‘PCC-N and PCC-P’ substrains. These are in slr1865 (#13), encoding a hypothetical protein, in sll1951 (#15), encoding a haemolysin-like protein, in slr1983 (#16), encoding a two-component hybrid sensor and regulator protein, in slr0222 (#27), encoding the histidine kinase Hik25, a silent mutation in slr0302 (#32), encoding a PAS/PAC and GAF sensors-containing diguanylate cyclase, one missing basepair, leaving the spkA gene intact (#5) and, finally, in ssr1176 (#33), encoding a transposase (Tables 1 and 2). The gene for a cell surface-localized haemolysin-like protein, HlyA (sll1951), reported to function as a barrier against the adsorption of toxic compounds,25,26 is lacking one nucleotide in ‘PCC-M’ compared with the reference (difference #15). In the ‘GT-Kazusa’, ‘GT-V’ as well as the ‘GT-I’ and ‘GT-S’ strains,9 the presence of the additional A leads to the fusion of two ORFs that are separate in ‘PCC-M’, ‘PCC-N’ and ‘PCC-P’ substrains.9 As a result, Sll1951 is 1741 amino acids in the former and only 1437 residues in the latter. In our data, some other previously published mutations8,10 are confirmed. For instance, spkA (sll1574; #5), a regulator of cellular motility via phosphorylation of membrane proteins,11,27 is disrupted by a 1-bp insertion in the non-motile ‘GT-Kazusa’ and ‘GT-V’ strains, whereas it is intact in the motile ‘PCC-M’ strain (Table 1). Similarly, the pilC gene (slr0162/3) required for pili assembly has been reported to carry a frameshift mutation in the ‘GT-Kazusa’ and ‘GT-S’ sequences.8,10,28 We found an intact pilC gene in ‘PCC-M’ (#20), as well as in the ‘GT-V’ substrain. Another SNP (G–A) exists in psaA (slr1834; #9), encoding the photosynthetic P700 apoprotein subunit Ia; however, in accordance with Tajima et al.10, we believe this is an annotation error in the database as we found an A in the respective position in all three strains dealt with in this work (Table 1). Similarly, ycf22 (sll0751; #26) is here suggested to be fused to the downstream reading frame sll0752. Indeed, in blastp comparisons, both proteins together match against a single, widely distributed, larger protein of 452 amino acids. This protein possesses a Ttg2C domain (COG1463), which is found in an ABC-type transport system involved in resistance to organic solvents. The acronym ycf stands for hypothetical chloroplast reading frames, meaning proteins conserved in chloroplasts and also cyanobacteria. The 1-bp shorter version, which is splitted into sll0751/sll0752, is a database error in the case of ‘GT-Kazusa’ as well. 3.2.1. SNPs unique to ‘PCC-M’ Six of the 10 SNPs unique to ‘PCC-M’ are located within coding regions and cause amino acid substitutions or alter the length of the respective reading frame. A single basepair transversion in the gene sigF (slr1564; #39 in Table 1) is leading to a M231K substitution within the −35 element DNA-binding region29 of a group 3 sigma factor required for phototactic movement30 and salt-stress response.31 This SNP cannot lead to impaired motility as ‘PCC-M’ is motile but it might influence the DNA–protein interaction of SigF because positively charged residues such as lysine located in this part of the σ4.2 region can directly interact with DNA.29 Another transversion, in argB (slr1898; #8 in Table 1), leads to an S2N amino acid substitution in N-acetylglutamate kinase, the enzyme performing the first committed step of Arg biosynthesis. Transitions in sll1359 and slr1609 (#11 and #3 in Table 1) result in an N–K substitution at a very conserved position within a predicted cytochrome and an L608S (L548S) substitution in the long-chain acyl-CoA-synthetase Slr1609 that has been found crucial for fatty acid activation and the biosynthesis of alkanes.32 Interestingly, an unrelated SNP exists at position 488 923 within the slr1609 coding sequence in a strain ‘YF’, leading to a G546L (G486L) substitution.17 It should be noted that the slr1609 reading frame has been annotated 60 codons shorter (636 instead of 696 amino acids) during recent re-sequencing analyses,9,10 compared with the original annotation of ‘GT-Kazusa’ (numbers in brackets). The shorter Slr1609 protein of 636 amino acids is also consistent with the mapped start site of transcription at position 487 352,33 located 115 nt upstream of the revised start codon. A transition in slr0753 (#41 in Table 1) leads to a P113L substitution in a putative chloride efflux transport protein involved in maintaining the chloride ion concentration homoeostasis as required for a functional photosystem II.34 A single basepair deletion in sll1496 (#38 in Table 1), encoding mannose-1-phosphate guanyltransferase, causes a frameshift and premature stop of the gene in ‘PCC-M’. The resulting protein is with 515 instead of 643 amino acids severely truncated and may be rendered function-less. 3.3. Point mutations in IGRs Compared with the reference, eight SNPs are located in IGRs, three of these (#7, 24 and 36) are ‘PCC-M’ specific. One of these (#36 in Table 1) SNPs is predicted to affect one of the recently reported cis-antisense RNAs.33 The additional A between positions 3194022 and 3194023 is located in the IGR between genes slr0533 and slr0534, encoding histidine kinase 10 (Hik10) and the soluble lytic transglycosylase Slt. On the reverse strand, the additional T falls within the predicted −10 element of the slr0534_as3 promoter. Instead of the high-scoring CATAAT,33 the motif is changed to ATTAAT. Hence, a modulation of slr0534_as3 expression compared with the reference is possible. In contrast to its designation, this cis-antisense RNA overlaps the 3′ end of genes slr0533 and hik10 (due to an error in the annotation used as the reference). In microarray analyses, slr0534_as3 of strain ‘PCC-M’ was found to be moderately to highly expressed under four tested conditions. Compared with the accumulation of the hik10 mRNA, it appeared even stronger.33 A function for Hik10 has been found in the perception of salt stress or transduction of the signal.35 The slr0534_as3 transcript may play a silencing role with regard to hik10 under non-inducing conditions. Mutation of its promoter element may hence cause a physiological effect in the salt stress response. Two other SNPs (at positions 831 647 and 2 400 722; #7 and #24 in Table 1) could have an impact on the promoter strength or the regulation of the genes infA and glcP. For glcP, the initiation site of transcription was mapped to position 2 400 66633 and for infA to position 831 635 (unpublished). Thus, these two SNPs are located 12 and 56 nt upstream of the respective initiation site of transcription. In the case of the infA promoter, the transition replaces a nucleotide within the putative −10 element, changing it from TGTGAT to TATGAT, a much more typical motif for a −10 element in Synechocystis.33 The mutation 56 nt upstream of the initiation site of transcription of glcP might be functionally relevant as well. The gene product, a glucose transporter, is directly relevant for the physiological ability to use glucose; its gene expression is affected by mutation of the gene for the AbrB-type transcription factor Sll0822.36 The region at position −56 might well be part of the recognized sequence. 3.4. Larger indels and plasmids In addition to this relatively large number of SNPs, only seven larger deletions were found on the chromosome and two plasmids. Compared with the reference, a deletion of 0.6 kb exists in the gene slr1753 (#4 in Table 1), which encodes, according to our data, a giant protein comprising 1549 amino acids that probably is transported to the cell surface. However, we found this deletion in our verification also in ‘GT-Kazusa’ and ‘GT-V’. Moreover, the deleted/inserted region consists of long series of DNA repeats (Fig. 1), an evidence for a possible assembly or annotation error in the original sequence analysis. Figure 1. Alignment of the possible indel region in gene slr1753. The sequence obtained in the verification experiment is aligned with that of the ‘GT-Kazusa’ reference. Two types of DNA repeats are indicated by the filled and non-filled lozenges. Given the very scarce available information concerning biological functions of the plasmids in Synechocystis sp. PCC 6803, it was interesting that all seven plasmids were detected during our analysis. Two, pCC5.2 and pCB2.4, were initially not found. However, as they were amplified easily by PCR, we re-inspected the unmapped sequencing reads, but still could not detect a single read matching these plasmids. This observation may relate to a lower copy number of these compared with the other plasmids, but this was not tested in the current study. Analysing the plasmid sequences, we observed a remarkable genetic stability. In addition to a single-base substitution in the plasmid pCA2.4 that might rather constitute an error in the reference sequence37 (see above) and a missing mobile element on the plasmid pSYSM, two mutations were observed, both in the plasmid pSYSA. Two major mutations affect the clustered regularly interspaced short palindrome repeats-CRISPR-associated proteins (CRISPR-Cas) system, located on the plasmid pSYSA. CRISPR-Cas systems provide in many archaea and bacteria an adaptive immunity against invading DNA.38–44 The plasmid pSYSA encodes the three independent systems CRISPR1, CRISPR2 and CRISPR3. A 2399-bp deletion encompassing the spacer-repeat regions 15–47 of CRISPR1 was detected in ‘PCC-M’ (#43), which also eliminated the relatively short genes ssr7018, ssl7019, ssl7020 and ssl7021, annotated within the spacer-repeat array of CRISPR1. However, the theoretical protein sequences of these gene products show no conservation at all and might not constitute real genes. Nevertheless, the deletion of spacer-repeat regions 15–47 of CRISPR1 is severe, since compared with the reference, it has eliminated two-thirds, 33 of its 49 spacer-repeat units. The sequence analysis suggests that the recombination events leading to the deletion of spacer-repeat regions 15–47 must have occurred within the direct repeats. Thus, this recombination is in agreement with previous observations that the downstream ends of the repeat clusters are conserved such that deletions and recombination events occur internally.45 A very different type of deletion was noticed for the CRISPR2 system located on the same plasmid. In this case, 159 bp were deleted (event #44 in Table 1). These 159 deleted bases correspond to positions 71 499–71 657 in the reference. The deletion encompasses two repeats including the spacer 41 in between. It is very surprising that the recombination did not occur within the repeat sections but in the adjacent spacers 40 and 42, thus generating a new ‘hybrid’ spacer 40 at positions 69 082–69 111 in the pSYSA plasmid of ‘PCC-M’ (Fig. 2). As a result, spacers 40, 41 and 42 of the original sequence are missing and became replaced by this hybrid sequence. The vast majority of described deletions in the CRISPR system occur between the direct repeats.45 Non-homologous recombination between two different spacers is rare, the deletion observed here in CRISPR2 of the plasmid pSYSA is generating additional sequence diversity in the CRISPR system. Due to the two deletions in the plasmid pSYSA, we determined its total length as 100 749 bp, compared with 103 307 bp for the reference. Figure 2. Non-homologous recombination in the plasmid pSYSA affecting spacers 40, 41 and 42 of CRISPR2. As a result of the 159-bp deletion in ‘PCC-M’ compared with ‘GT-Kazusa’, a novel hybrid spacer 40 was generated. The direct repeats are presented as squares and the nucleotide positions in the ‘GT-Kazusa’ are given according to the GenBank file NC_005230. 3.5. Mobile elements As can be seen in Tables 1 and 2 (differences #12, 17, 40 and 45), the ‘PCC-M’ substrain lacks four insertion elements of the ISY203 type present in ‘GT-Kazusa’.7 These elements are ISY203b, e and g on the chromosome and ISY203j on the plasmid pSYSM. These four indels have the exact same size of 1183 bp, only one is 1185 bp. In the ‘GT-S’ substrain re-sequenced by Tajima et al.10 one of these four elements, ISY203e, is already present, placing this strain (in accordance with Ikeuchi and Tabata)8 before ‘GT-Kazusa’ in the strain history. The absence of ISY203b, e and g in ‘PCC-M’ is further shared with the strains ‘GT-I’, ‘PCC-N’ and ‘PCC-P’,9 whereas no statement is possible with regard to the possible presence of ISY203j on the plasmid pSYSM in the latter. With respect to the described mobile elements, ‘PCC-M’ appears as one of the least-derived substrains. 4. Discussion 4.1. Strain history ‘PCC-M’ shows sequence differences in several genes compared with the reference sequence of ‘GT-Kazusa’ and also to the recently sequenced ‘GT-S’ strain. Kanesaki et al.9 concluded that 15 differences between the resequenced strains and the published GT-Kazusa sequence were annotation errors in the latter due to sequencing artefacts, a list to which we add two more putative errors in the database, differences #4 and #42 in Table 1. According to the proposed strain history in Ikeuchi and Tabata,8 the early division of Synechocystis sp. PCC 6803 into two branches occurred due to an insertion in spkA. Thus, our data suggest that the motile ‘PCC-M’ strain belongs to the motile PCC 6803 branch, whereas the non-motile ‘GT-Kazusa’, ‘GT-S’ and ‘GT-V’ strains are more closely related to each other and belong to the ATCC 27 184 branch. However, the 1-bp insertion in the pilC leading to ‘GT-Kazusa’ as described in the proposed strain history8 is not present in either ‘GT-S’ or ‘GT-V’, characterizing ‘GT-Kazusa’ as a more derived substrain. That ‘PCC-M’ belongs to the motile PCC 6803 branch is further reinforced by our finding of six SNPs specifically shared between the ‘PCC-M’ and the ‘PCC-N and PCC-P’ substrains (Tables 1 and 2).9 These six SNPs are in slr1865, in sll1951, encoding a haemolysin-like protein, in ssr1176, encoding a transposase and, interestingly, in genes encoding sensor and/or regulatory proteins (slr1983, slr0222 and slr0302) (Tables 1 and 2) and must already have been present in the progenitor strain to ‘PCC-M’, ‘PCC-N’ and ‘PCC-P’. Additional support comes from the analysis of two larger indels (#2 and #6 in Table 1). The preceding paper, Kanesaki et al.,9 described difficulties in finding indels between direct repeat sequences such as slr1084 and slr2031 by short read type re-sequencing data. Therefore, these two regions were analysed by PCR and Sanger sequencing in addition to the re-sequencing analysis. Indeed, the finding of indels between direct repeat sequences in genes slr1084 and slr2031 turned out as not been straightforward in our analysis as well. Compared with the reference, we found in both cases the additional sequences of 102 and 154 bp to be present in ‘PCC-M’. This result is relevant for lineage relationships among substrains. The additional 102 bp in gene slr1084 are shared between ‘PCC-M’ and the other substrains ‘PCC-P’, ‘PCC-N’ and ‘GT-I’. Therefore, this must be a deletion in the lineage leading to GT-Kazusa and GT-S. In contrast, the additional 154 bp within and upstream of gene slr2031 are shared between ‘PCC-M’, ‘PCC-P’ and ‘PCC-N’ and are absent from all studied GT substrains. These 154 bp comprise the conserved start codon of slr2031 and extend the gene by 29 codons compared with ‘GT-Kazusa’. Hence, the lack of these 154 bp in GT strains indicate a functionally adverse deletion there. In fact, the 154-bp deletion in GT substrains was noticed before,46 as well as the activity of slr2031 in the original Synechocystis sp. PCC 6803 substrains.47 From these considerations, the tree shown in Fig. 3 can be derived. In this tree, ‘GT-Kazusa’ is displayed as the strain with the longest evolutionary distance from the original isolate, whereas the ‘PCC-M’ substrain belongs to the ‘PCC’ group of substrains and is probably close to the original characteristics. All strains belonging to the ‘PCC’ group of substrains exhibit twitching motility as was shown also for the original PCC strain deposited in the Pasteur Culture Collection6 with variations in the motility behaviour.48,49 Since ‘PCC-M’ shows motility and is tolerant to glucose, it appears physiologically as a sort of intermediate between the two major branches: the motile and GT branches, consistent with its characterization as being close to the original characteristics. Figure 3. Visualization of phylogenetic relationships between various strains of Synechocystis sp. PCC 6803. The occurrence of the identified SNPs and other known events are indicated along the branches. The eight events separating the ‘GT’ and ‘PCC’ strains from each other are given at the branch point where these two lineages split or on the respective branches where they occurred. Putative insertions and deletions are labelled ‘Ins’. and ‘Del’., respectively. 4.2. Re-sequencing studies of Synechocystis sp. PCC 6803 The analysis of genome sequences of cyanobacteria has had a large impact on photosynthesis, ecology and biotechnology research.50 The present re-sequencing project delivers the new and complete sequence of the Synechocystis sp. PCC 6803 ‘PCC-M’, a substrain used in many laboratories and in several aspects close to the original isolate. Altogether, there are now chromosomal sequences for seven substrains of Synechocystis sp. PCC 6803 available: ‘PCC-M’ (this study); ‘PCC-P’ (positive phototaxis) and ‘PCC-N’ (negative phototaxis), both based on single colonies isolated from the PCC strain and designated according to their direction of phototactic movement;24 ‘GT-I’, the standard strain in Dr Ikeuchi's group;8 ‘YF’17 and ‘GT-S’,10 a current derivative of the original stock of Synechocystis sp. PCC 6803 from which the chromosomal reference sequence for ‘GT-Kazusa’ was determined in 19962 and for the large plasmids in 2003,20 whereas the three small plasmids had been sequenced already before.37,51,52 4.3. Mutations potentially linked to phenotype It is likely that most of the identified differences between the sequenced substrains result from distinct differences in the cultivation conditions in the different laboratories that have selected for fixing one or the other mutation. That also implies that the majority of identified mutations are not silent but linked to a certain effect. Indeed, most mutations in coding regions are not silent as might be expected but lead to frameshifts, amino acid substitutions or the truncation of reading frames. Similarly, SNPs in non-coding regions are probably biologically meaningful, too. This idea received support here by linking three ‘PCC-M’-specific SNPs in IGRs to the promoter regions controlling the expression of two protein-coding and one antisense RNA. For all these reasons, it appears likely that several of the mutations specific to ‘PCC-M’ or shared with ‘PCC-P’ and ‘PCC-N’ may be related to the known phenotypes of these strains. For example, the truncation of sll1951 (haemolysin) and possible truncation of slr1753 (surface protein) may contribute to a stress-induced clumping phenotype. Several other mutations might cause alterations in glucose tolerance or phototactic behaviour of these substrains. Differences at other loci may affect the phage resistance, stress response or functions in the primary metabolism, potentially relevant for the synthesis of alkanes or the N and C metabolism. The absence of ISY203g in the sll1473–5 regions in PCC substrains leads to an intact photoreceptor that regulates the expression of an alternative phycobilisome linker gene.53 Regarding phenotypic differences among motile PCC substrains, it might be noteworthy that ‘PCC-M’, despite its general ability to be motile, is not phototactic towards blue light (see direct comparison of strains in Fig. 1 of Fiedler et al.48). Here, the SNP #39 in the sigF gene, known to be involved in the control of phototactic movement30 might be considered, as the resulting M231K substitution could influence the DNA–protein interaction of this group 3 sigma factor in a very subtle way. For sure, the subtle differences in genome sequences have to be considered when choosing a particular substrain for certain experiments and when comparing phenotypes of mutant lines from different laboratories with the wild-type strain. Information on the re-sequenced genome and plasmid sequences including precisely annotated SNPs can be found in the eight sequence files available from GenBank under the accession numbers CP003265–CP003272. Funding The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7-ENERGY-2010-1) under grant agreement no. [256808] and from the German Research Foundation (DFG) project FOR1680 ‘Unravelling the Prokaryotic Immune System’ (grant HE 2544/8-1) to WRH and from grant AL 892/1-4 to SAB.
sec	1. Introduction With currently >4000 publications available from PubMedCentral alone, ‘Synechocystis’ is the most widely used photoautotrophic prokaryotic model organism. Synechocystis sp. PCC 6803 is a unicellular cyanobacterium that was isolated from a freshwater pond in Oakland, California.1 The high popularity of Synechocystis sp. PCC 6803 stems from the two facts that it was the first phototrophic and the third organism overall, for which a complete genome sequence was determined,2 and that it easily takes up exogenous DNA and integrates it into its chromosome by homologous recombination.3–5 Synechocystis sp. PCC6803 is known to occur in several distinct substrains, all going back to the same isolate deposited in the Pasteur Culture Collection.6 Indeed, several studies reported the differences between the genome sequence of Synechocystis sp. PCC 6803 published in 1996 (called here the ‘GT-Kazusa’ substrain) and the actual sequence found in different laboratories.7–10 A strain history has been proposed by Ikeuchi and Tabata8 with an early branching into the motile PCC strain and the non-motile ATCC 27184 strain. The latter lost motility due to a 1-bp insertion in the spkA gene coding for a eukaryotic-type Ser/Thr protein kinase11 and represents the origin of the glucose-tolerant (GT) strains5 to which also the ‘GT-Kazusa’ substrain belongs. For decades, Synechocystis sp. PCC 6803 has served as a simple model in photosynthesis research and to solve fundamental questions in microbial and plant physiology. More recently, cyanobacteria are increasingly being recognized as a promising resource for the production of biofuels such as hydrogen,12 ethanol,13 isobutyraldehyde and isobutanol,14 ethylene15 and alkanes.16 Synechocystis sp. PCC 6803 is being developed further as a model in these biotechnology- and systems biology-oriented studies. These facts as well as the search for motility-associated genes prompted several re-sequencing studies of Synechocystis sp. PCC 6803 substrains, namely of the substrains GT-S,10 PCC-P, PCC-N, GT-I9 and YF.17 However, these studies have not included the widely used GT and motile ‘Moscow’ substrain, which we here suggest to call ‘PCC-M’. Furthermore, thus far no attention has been paid to the possible sequence variations in the seven plasmids, which constitute a total sequence length of 383 486 bp almost 10% of the total coding capacity of Synechocystis sp. PCC 6803. This analysis provides new and reliable sequence data for the Synechocystis sp. PCC 6803 substrain ‘PCC-M’, revealing several differences from the published sequence that can be interpreted as the traces of microevolution during cultivation in the laboratory.
title	Introduction
p	With currently >4000 publications available from PubMedCentral alone, ‘Synechocystis’ is the most widely used photoautotrophic prokaryotic model organism. Synechocystis sp. PCC 6803 is a unicellular cyanobacterium that was isolated from a freshwater pond in Oakland, California.1 The high popularity of Synechocystis sp. PCC 6803 stems from the two facts that it was the first phototrophic and the third organism overall, for which a complete genome sequence was determined,2 and that it easily takes up exogenous DNA and integrates it into its chromosome by homologous recombination.3–5
p	Synechocystis sp. PCC6803 is known to occur in several distinct substrains, all going back to the same isolate deposited in the Pasteur Culture Collection.6 Indeed, several studies reported the differences between the genome sequence of Synechocystis sp. PCC 6803 published in 1996 (called here the ‘GT-Kazusa’ substrain) and the actual sequence found in different laboratories.7–10 A strain history has been proposed by Ikeuchi and Tabata8 with an early branching into the motile PCC strain and the non-motile ATCC 27184 strain. The latter lost motility due to a 1-bp insertion in the spkA gene coding for a eukaryotic-type Ser/Thr protein kinase11 and represents the origin of the glucose-tolerant (GT) strains5 to which also the ‘GT-Kazusa’ substrain belongs.
p	For decades, Synechocystis sp. PCC 6803 has served as a simple model in photosynthesis research and to solve fundamental questions in microbial and plant physiology. More recently, cyanobacteria are increasingly being recognized as a promising resource for the production of biofuels such as hydrogen,12 ethanol,13 isobutyraldehyde and isobutanol,14 ethylene15 and alkanes.16 Synechocystis sp. PCC 6803 is being developed further as a model in these biotechnology- and systems biology-oriented studies. These facts as well as the search for motility-associated genes prompted several re-sequencing studies of Synechocystis sp. PCC 6803 substrains, namely of the substrains GT-S,10 PCC-P, PCC-N, GT-I9 and YF.17 However, these studies have not included the widely used GT and motile ‘Moscow’ substrain, which we here suggest to call ‘PCC-M’. Furthermore, thus far no attention has been paid to the possible sequence variations in the seven plasmids, which constitute a total sequence length of 383 486 bp almost 10% of the total coding capacity of Synechocystis sp. PCC 6803. This analysis provides new and reliable sequence data for the Synechocystis sp. PCC 6803 substrain ‘PCC-M’, revealing several differences from the published sequence that can be interpreted as the traces of microevolution during cultivation in the laboratory.
sec	2. Materials and methods 2.1. Origin of strain, isolation of DNA and PCR analysis Synechocystis sp. PCC 6803 substrains ‘Moscow’ here called ‘PCC-M, Kazusa (GT-Kazusa) and Vermaas’ (GT-V) were cultivated by Prof. Annegret Wilde (University of Freiburg, Germany) and maintained as frozen stocks. The ‘PCC-M’ substrain was originally obtained from the laboratory of S. Shestakov (Moscow State University) in 1993 and over the years carefully propagated for motile colonies. The ‘GT-V’ strain originates from the laboratory of W. Vermaas (Arizona State University). Genomic DNA for deep sequencing analysis was isolated from 80 ml cultures harvested on a glass microfiber filter (GF/C, 47 mm i.d. Whatman) by vacuum filtration. The frozen filter was ground in a mixer mill (Dismembrator MM301, Retsch, Germany) and the powder transferred into 1 ml SET buffer on ice (25% (w/v) sucrose, 1 mM EDTA, 50 mM Tris pH 7.5). One-fourth volume of 0.5 M EDTA, 2% SDS and 1.5 mg proteinase K (Sigma) were added for cell lysis at 50°C overnight. Following phenol/chloroform extraction, one volume of 2-propanol (Roth, Germany) was added for precipitating the DNA at room temperature for 30 min. The precipitate was washed once in H2O/2-propanol 1:1 and once in 2-propanol, followed by 10 min centrifugation at 10 000 g, 4°C. The pellet was washed with 70% EtOH, dried for 10 min and re-suspended in 50 µl H2O. One microlitre of RNase A (Sigma) was added and the tube incubated at 37°C and 260 rpm overnight. RNase was removed by another round of phenolic extraction and precipitation as described above. The DNA was re-suspended in 75 µl H2O, concentration was measured photometrically and DNA quality checked on a gel (0.8% agarose). Genomic DNA for PCR was isolated from the cell pellet of 1 ml Synechocystis liquid culture. The pellet was washed once with a 1:10 dilution of TE buffer (10 mM Tris HCl pH 8; 1 mM EDTA) and re-suspended in 70 µl of the same buffer. Cells were broken by incubation at 98°C for 10 min. After centrifugation at 14 000 g and 4°C for 5 min, the supernatant was collected and kept on ice. Two microlitres of it were used for PCR. For PCR reactions, Phusion® DNA polymerase (Finnzymes, New England Biolabs) was used according to the manufacturer's instructions. To verify single nucleotide polymorphisms (SNPs) between the different substrains, ∼500 bp fragments containing the SNP position were amplified. PCR products were excised from an agarose gel, purified (illustra GFX PCR DNA and Gel Band Purification Kit, GE Healthcare) and sent for Sanger sequencing to GATC Biotech (Konstanz, Germany). For sequencing of the small plasmids, several PCR reactions were performed to get overlapping sequences and contigs were assembled using the software ContigExpress (Vector NTI Advance 11, Invitrogen). Alignments of the sequences were performed using AlignX (Vector NTI Advance 11, Invitrogen). 2.2. Sequencing methods and mapping Sequencing of genomic DNA was carried out on an Illumina Genome Analyzer IIx system. Prior to sequencing, the DNA was sheared by ultrasonication (Covaris, Woburn, MA, USA), resulting in fragments of 300 bp length on average. For these fragments paired-end sequencing according to the manufacturer's protocol was carried out, resulting in 42 143 495 million 101 nt long reads. These reads were analysed with two methods in order to identify SNPs, deletions and insertions. For the first approach, we used the DNA sequence data assembler algorithm MIRA (Mimicking Intelligent Read Assembly)18 to perform an assembly of the reads using the ‘GT-Kazusa’ genome as the reference. In the assembly process, MIRA generates tables of candidate SNPs, insertions and deletions. We verified these results independently by mapping all sequencing reads to the assembled chromosome and plasmid sequences. This was done using segemehl,19 requiring at least 85% accuracy and reporting only the best hit. It should be noted that segemehl reports co-optimal best hits.
title	Materials and methods
sec	2.1. Origin of strain, isolation of DNA and PCR analysis Synechocystis sp. PCC 6803 substrains ‘Moscow’ here called ‘PCC-M, Kazusa (GT-Kazusa) and Vermaas’ (GT-V) were cultivated by Prof. Annegret Wilde (University of Freiburg, Germany) and maintained as frozen stocks. The ‘PCC-M’ substrain was originally obtained from the laboratory of S. Shestakov (Moscow State University) in 1993 and over the years carefully propagated for motile colonies. The ‘GT-V’ strain originates from the laboratory of W. Vermaas (Arizona State University). Genomic DNA for deep sequencing analysis was isolated from 80 ml cultures harvested on a glass microfiber filter (GF/C, 47 mm i.d. Whatman) by vacuum filtration. The frozen filter was ground in a mixer mill (Dismembrator MM301, Retsch, Germany) and the powder transferred into 1 ml SET buffer on ice (25% (w/v) sucrose, 1 mM EDTA, 50 mM Tris pH 7.5). One-fourth volume of 0.5 M EDTA, 2% SDS and 1.5 mg proteinase K (Sigma) were added for cell lysis at 50°C overnight. Following phenol/chloroform extraction, one volume of 2-propanol (Roth, Germany) was added for precipitating the DNA at room temperature for 30 min. The precipitate was washed once in H2O/2-propanol 1:1 and once in 2-propanol, followed by 10 min centrifugation at 10 000 g, 4°C. The pellet was washed with 70% EtOH, dried for 10 min and re-suspended in 50 µl H2O. One microlitre of RNase A (Sigma) was added and the tube incubated at 37°C and 260 rpm overnight. RNase was removed by another round of phenolic extraction and precipitation as described above. The DNA was re-suspended in 75 µl H2O, concentration was measured photometrically and DNA quality checked on a gel (0.8% agarose). Genomic DNA for PCR was isolated from the cell pellet of 1 ml Synechocystis liquid culture. The pellet was washed once with a 1:10 dilution of TE buffer (10 mM Tris HCl pH 8; 1 mM EDTA) and re-suspended in 70 µl of the same buffer. Cells were broken by incubation at 98°C for 10 min. After centrifugation at 14 000 g and 4°C for 5 min, the supernatant was collected and kept on ice. Two microlitres of it were used for PCR. For PCR reactions, Phusion® DNA polymerase (Finnzymes, New England Biolabs) was used according to the manufacturer's instructions. To verify single nucleotide polymorphisms (SNPs) between the different substrains, ∼500 bp fragments containing the SNP position were amplified. PCR products were excised from an agarose gel, purified (illustra GFX PCR DNA and Gel Band Purification Kit, GE Healthcare) and sent for Sanger sequencing to GATC Biotech (Konstanz, Germany). For sequencing of the small plasmids, several PCR reactions were performed to get overlapping sequences and contigs were assembled using the software ContigExpress (Vector NTI Advance 11, Invitrogen). Alignments of the sequences were performed using AlignX (Vector NTI Advance 11, Invitrogen).
title	Origin of strain, isolation of DNA and PCR analysis
p	Synechocystis sp. PCC 6803 substrains ‘Moscow’ here called ‘PCC-M, Kazusa (GT-Kazusa) and Vermaas’ (GT-V) were cultivated by Prof. Annegret Wilde (University of Freiburg, Germany) and maintained as frozen stocks. The ‘PCC-M’ substrain was originally obtained from the laboratory of S. Shestakov (Moscow State University) in 1993 and over the years carefully propagated for motile colonies. The ‘GT-V’ strain originates from the laboratory of W. Vermaas (Arizona State University). Genomic DNA for deep sequencing analysis was isolated from 80 ml cultures harvested on a glass microfiber filter (GF/C, 47 mm i.d. Whatman) by vacuum filtration. The frozen filter was ground in a mixer mill (Dismembrator MM301, Retsch, Germany) and the powder transferred into 1 ml SET buffer on ice (25% (w/v) sucrose, 1 mM EDTA, 50 mM Tris pH 7.5). One-fourth volume of 0.5 M EDTA, 2% SDS and 1.5 mg proteinase K (Sigma) were added for cell lysis at 50°C overnight. Following phenol/chloroform extraction, one volume of 2-propanol (Roth, Germany) was added for precipitating the DNA at room temperature for 30 min. The precipitate was washed once in H2O/2-propanol 1:1 and once in 2-propanol, followed by 10 min centrifugation at 10 000 g, 4°C. The pellet was washed with 70% EtOH, dried for 10 min and re-suspended in 50 µl H2O. One microlitre of RNase A (Sigma) was added and the tube incubated at 37°C and 260 rpm overnight. RNase was removed by another round of phenolic extraction and precipitation as described above. The DNA was re-suspended in 75 µl H2O, concentration was measured photometrically and DNA quality checked on a gel (0.8% agarose).
p	Genomic DNA for PCR was isolated from the cell pellet of 1 ml Synechocystis liquid culture. The pellet was washed once with a 1:10 dilution of TE buffer (10 mM Tris HCl pH 8; 1 mM EDTA) and re-suspended in 70 µl of the same buffer. Cells were broken by incubation at 98°C for 10 min. After centrifugation at 14 000 g and 4°C for 5 min, the supernatant was collected and kept on ice. Two microlitres of it were used for PCR. For PCR reactions, Phusion® DNA polymerase (Finnzymes, New England Biolabs) was used according to the manufacturer's instructions. To verify single nucleotide polymorphisms (SNPs) between the different substrains, ∼500 bp fragments containing the SNP position were amplified. PCR products were excised from an agarose gel, purified (illustra GFX PCR DNA and Gel Band Purification Kit, GE Healthcare) and sent for Sanger sequencing to GATC Biotech (Konstanz, Germany). For sequencing of the small plasmids, several PCR reactions were performed to get overlapping sequences and contigs were assembled using the software ContigExpress (Vector NTI Advance 11, Invitrogen). Alignments of the sequences were performed using AlignX (Vector NTI Advance 11, Invitrogen).
sec	2.2. Sequencing methods and mapping Sequencing of genomic DNA was carried out on an Illumina Genome Analyzer IIx system. Prior to sequencing, the DNA was sheared by ultrasonication (Covaris, Woburn, MA, USA), resulting in fragments of 300 bp length on average. For these fragments paired-end sequencing according to the manufacturer's protocol was carried out, resulting in 42 143 495 million 101 nt long reads. These reads were analysed with two methods in order to identify SNPs, deletions and insertions. For the first approach, we used the DNA sequence data assembler algorithm MIRA (Mimicking Intelligent Read Assembly)18 to perform an assembly of the reads using the ‘GT-Kazusa’ genome as the reference. In the assembly process, MIRA generates tables of candidate SNPs, insertions and deletions. We verified these results independently by mapping all sequencing reads to the assembled chromosome and plasmid sequences. This was done using segemehl,19 requiring at least 85% accuracy and reporting only the best hit. It should be noted that segemehl reports co-optimal best hits.
title	Sequencing methods and mapping
p	Sequencing of genomic DNA was carried out on an Illumina Genome Analyzer IIx system. Prior to sequencing, the DNA was sheared by ultrasonication (Covaris, Woburn, MA, USA), resulting in fragments of 300 bp length on average. For these fragments paired-end sequencing according to the manufacturer's protocol was carried out, resulting in 42 143 495 million 101 nt long reads. These reads were analysed with two methods in order to identify SNPs, deletions and insertions. For the first approach, we used the DNA sequence data assembler algorithm MIRA (Mimicking Intelligent Read Assembly)18 to perform an assembly of the reads using the ‘GT-Kazusa’ genome as the reference. In the assembly process, MIRA generates tables of candidate SNPs, insertions and deletions. We verified these results independently by mapping all sequencing reads to the assembled chromosome and plasmid sequences. This was done using segemehl,19 requiring at least 85% accuracy and reporting only the best hit. It should be noted that segemehl reports co-optimal best hits.
sec	3. Results 3.1. Overview Sequencing of the Synechocystis sp. PCC 6803 ‘Moscow’ substrain ‘PCC-M’ by Illumina (Solexa) yielded an average 1100-fold coverage of the chromosome and five of the seven plasmids. The existence of the two remaining plasmids was verified individually by PCR. Following assembly of sequences, mapping to the reference strain sequences and annotation, the obtained genome and plasmid sequences were deposited in the GenBank database with the accession numbers CP003265–CP003272. Altogether, we found 45 differences (36 SNPs and 9 indels >1 bp) between the investigated substrain ‘PCC-M’ and the published sequences of the ‘GT-Kazusa’ chromosome2 and plasmids20 used here as references (Table 1). From these differences, 41 are located in the chromosome and four in the plasmids pSYSA, pSYSM and pCA2.4. For verification, about one-third of these differences were randomly chosen and confirmed independently by PCR and Sanger sequencing of the respective regions in substrain ‘PCC-M’, but no misidentified mutations were found. These DNA regions were, in addition, amplified and compared with the sequences from substrains ‘GT-Kazusa’ and ‘GT-V’ for control and comparison, respectively. The GT ‘GT-V’ was chosen for comparison as is widely used for the dissection and analysis of photosynthetic mutants. Fully segregated PSI, PSII and Chl biosynthesis mutants were successfully generated in this genetic background21,22 and some of these mutants could not be obtained in other substrains.23 Table 1. Location and effects of SNPs and indels found in ‘PCC-M’ compared with the nucleotide sequence of ‘GT-Kazusa’ in the database The events are numbered (column #), the type of mutation (M) is indicated as S, substitution, D, deletion or I, insertion, together with the respective start and end positions in the ‘GT-Kazusa’ reference sequence. For each event the respective nucleotide change is indicated on the forward strand, together with the resulting codon modification (Ref. → Mut) and amino acid change, if any. Highlighted in italics are four instances of missing ISY203 copies and in bold all SNPs affecting intergenic spacer regions (IGR). aIndicate errors in the database. The number of differences between ‘PCC-M’ and ‘GT-Kazusa’ are almost twice as many as reported by Tajima et al.10 for the GT (GT-S) ‘Kazusa’ strain, where a total of 22 differences from the published sequence were found.10 All but 3 of those 22 differences were also detected in the ‘PCC-M’ strain studied here. The three unique differences in the ‘GT-S’ and 26 differences between ‘PCC-M’ and ‘GT-Kazusa’ underline the existence of lineage splitting in the Synechocystis substrains. Moreover, we found seven SNPs (#5, 13, 15, 16, 27, 32 and 33 in Tables 1 and 2) and one larger indel (#6 in Tables 1 and 2) specifically shared between the ‘PCC-M’ and the ‘PCC-N and PCC-P’ substrains, indicating that ‘PCC-M’ belongs to the ‘PCC’ group of motile substrains.9 ‘PCC-M and PCC-P’ are strains that both exhibit the native positive phototaxis, whereas ‘PCC-N’ strain shows negative phototaxis.24 Table 2. Comparison of SNPs and indels found in the chromosome of ‘PCC-M’ with sequences from other substrains All events are numbered (column #) as in Table 1. The presence of the respective ‘PCC-M’ mutation in the different substrains is indicated by the check marks. aThe deletion of 0.6 kb in the gene slr1753 compared with the reference was also verified here in ‘GT-Kazusa’. 3.2. SNPs in protein-coding genes Of the total of 36 SNPs in ‘PCC-M’ compared with ‘GT-Kazusa’, all except 1 are located in the chromosome. The single base substitution that was found on the plasmid pCA2.4 within the repA gene (#42 in Table 1) seems to be no mutation but an error in the published sequence of ‘GT-Kazusa’, since in our PCR-control experiments, the sequence was identical in the three strains ‘GT-Kazusa’, ‘PCC-M’ and ‘GT-V’. Of the 35 chromosomal SNPs compared with ‘GT-Kazusa’, 5 are silent base substitutions, 14 substitutions lead to amino acid substitutions, in 6 cases a single basepair is deleted and in 2 cases (#23 and #28) one basepair was inserted within an ORF, causing a frameshift mutation. Furthermore, five substitutions, two single basepair insertions and one single basepair deletion were observed in intergenic regions (IGR) of ‘PCC-M’ compared with the reference (Table 1). Seven SNPs are specifically shared between the ‘PCC-M’, ‘PCC-N and PCC-P’ substrains. These are in slr1865 (#13), encoding a hypothetical protein, in sll1951 (#15), encoding a haemolysin-like protein, in slr1983 (#16), encoding a two-component hybrid sensor and regulator protein, in slr0222 (#27), encoding the histidine kinase Hik25, a silent mutation in slr0302 (#32), encoding a PAS/PAC and GAF sensors-containing diguanylate cyclase, one missing basepair, leaving the spkA gene intact (#5) and, finally, in ssr1176 (#33), encoding a transposase (Tables 1 and 2). The gene for a cell surface-localized haemolysin-like protein, HlyA (sll1951), reported to function as a barrier against the adsorption of toxic compounds,25,26 is lacking one nucleotide in ‘PCC-M’ compared with the reference (difference #15). In the ‘GT-Kazusa’, ‘GT-V’ as well as the ‘GT-I’ and ‘GT-S’ strains,9 the presence of the additional A leads to the fusion of two ORFs that are separate in ‘PCC-M’, ‘PCC-N’ and ‘PCC-P’ substrains.9 As a result, Sll1951 is 1741 amino acids in the former and only 1437 residues in the latter. In our data, some other previously published mutations8,10 are confirmed. For instance, spkA (sll1574; #5), a regulator of cellular motility via phosphorylation of membrane proteins,11,27 is disrupted by a 1-bp insertion in the non-motile ‘GT-Kazusa’ and ‘GT-V’ strains, whereas it is intact in the motile ‘PCC-M’ strain (Table 1). Similarly, the pilC gene (slr0162/3) required for pili assembly has been reported to carry a frameshift mutation in the ‘GT-Kazusa’ and ‘GT-S’ sequences.8,10,28 We found an intact pilC gene in ‘PCC-M’ (#20), as well as in the ‘GT-V’ substrain. Another SNP (G–A) exists in psaA (slr1834; #9), encoding the photosynthetic P700 apoprotein subunit Ia; however, in accordance with Tajima et al.10, we believe this is an annotation error in the database as we found an A in the respective position in all three strains dealt with in this work (Table 1). Similarly, ycf22 (sll0751; #26) is here suggested to be fused to the downstream reading frame sll0752. Indeed, in blastp comparisons, both proteins together match against a single, widely distributed, larger protein of 452 amino acids. This protein possesses a Ttg2C domain (COG1463), which is found in an ABC-type transport system involved in resistance to organic solvents. The acronym ycf stands for hypothetical chloroplast reading frames, meaning proteins conserved in chloroplasts and also cyanobacteria. The 1-bp shorter version, which is splitted into sll0751/sll0752, is a database error in the case of ‘GT-Kazusa’ as well. 3.2.1. SNPs unique to ‘PCC-M’ Six of the 10 SNPs unique to ‘PCC-M’ are located within coding regions and cause amino acid substitutions or alter the length of the respective reading frame. A single basepair transversion in the gene sigF (slr1564; #39 in Table 1) is leading to a M231K substitution within the −35 element DNA-binding region29 of a group 3 sigma factor required for phototactic movement30 and salt-stress response.31 This SNP cannot lead to impaired motility as ‘PCC-M’ is motile but it might influence the DNA–protein interaction of SigF because positively charged residues such as lysine located in this part of the σ4.2 region can directly interact with DNA.29 Another transversion, in argB (slr1898; #8 in Table 1), leads to an S2N amino acid substitution in N-acetylglutamate kinase, the enzyme performing the first committed step of Arg biosynthesis. Transitions in sll1359 and slr1609 (#11 and #3 in Table 1) result in an N–K substitution at a very conserved position within a predicted cytochrome and an L608S (L548S) substitution in the long-chain acyl-CoA-synthetase Slr1609 that has been found crucial for fatty acid activation and the biosynthesis of alkanes.32 Interestingly, an unrelated SNP exists at position 488 923 within the slr1609 coding sequence in a strain ‘YF’, leading to a G546L (G486L) substitution.17 It should be noted that the slr1609 reading frame has been annotated 60 codons shorter (636 instead of 696 amino acids) during recent re-sequencing analyses,9,10 compared with the original annotation of ‘GT-Kazusa’ (numbers in brackets). The shorter Slr1609 protein of 636 amino acids is also consistent with the mapped start site of transcription at position 487 352,33 located 115 nt upstream of the revised start codon. A transition in slr0753 (#41 in Table 1) leads to a P113L substitution in a putative chloride efflux transport protein involved in maintaining the chloride ion concentration homoeostasis as required for a functional photosystem II.34 A single basepair deletion in sll1496 (#38 in Table 1), encoding mannose-1-phosphate guanyltransferase, causes a frameshift and premature stop of the gene in ‘PCC-M’. The resulting protein is with 515 instead of 643 amino acids severely truncated and may be rendered function-less. 3.3. Point mutations in IGRs Compared with the reference, eight SNPs are located in IGRs, three of these (#7, 24 and 36) are ‘PCC-M’ specific. One of these (#36 in Table 1) SNPs is predicted to affect one of the recently reported cis-antisense RNAs.33 The additional A between positions 3194022 and 3194023 is located in the IGR between genes slr0533 and slr0534, encoding histidine kinase 10 (Hik10) and the soluble lytic transglycosylase Slt. On the reverse strand, the additional T falls within the predicted −10 element of the slr0534_as3 promoter. Instead of the high-scoring CATAAT,33 the motif is changed to ATTAAT. Hence, a modulation of slr0534_as3 expression compared with the reference is possible. In contrast to its designation, this cis-antisense RNA overlaps the 3′ end of genes slr0533 and hik10 (due to an error in the annotation used as the reference). In microarray analyses, slr0534_as3 of strain ‘PCC-M’ was found to be moderately to highly expressed under four tested conditions. Compared with the accumulation of the hik10 mRNA, it appeared even stronger.33 A function for Hik10 has been found in the perception of salt stress or transduction of the signal.35 The slr0534_as3 transcript may play a silencing role with regard to hik10 under non-inducing conditions. Mutation of its promoter element may hence cause a physiological effect in the salt stress response. Two other SNPs (at positions 831 647 and 2 400 722; #7 and #24 in Table 1) could have an impact on the promoter strength or the regulation of the genes infA and glcP. For glcP, the initiation site of transcription was mapped to position 2 400 66633 and for infA to position 831 635 (unpublished). Thus, these two SNPs are located 12 and 56 nt upstream of the respective initiation site of transcription. In the case of the infA promoter, the transition replaces a nucleotide within the putative −10 element, changing it from TGTGAT to TATGAT, a much more typical motif for a −10 element in Synechocystis.33 The mutation 56 nt upstream of the initiation site of transcription of glcP might be functionally relevant as well. The gene product, a glucose transporter, is directly relevant for the physiological ability to use glucose; its gene expression is affected by mutation of the gene for the AbrB-type transcription factor Sll0822.36 The region at position −56 might well be part of the recognized sequence. 3.4. Larger indels and plasmids In addition to this relatively large number of SNPs, only seven larger deletions were found on the chromosome and two plasmids. Compared with the reference, a deletion of 0.6 kb exists in the gene slr1753 (#4 in Table 1), which encodes, according to our data, a giant protein comprising 1549 amino acids that probably is transported to the cell surface. However, we found this deletion in our verification also in ‘GT-Kazusa’ and ‘GT-V’. Moreover, the deleted/inserted region consists of long series of DNA repeats (Fig. 1), an evidence for a possible assembly or annotation error in the original sequence analysis. Figure 1. Alignment of the possible indel region in gene slr1753. The sequence obtained in the verification experiment is aligned with that of the ‘GT-Kazusa’ reference. Two types of DNA repeats are indicated by the filled and non-filled lozenges. Given the very scarce available information concerning biological functions of the plasmids in Synechocystis sp. PCC 6803, it was interesting that all seven plasmids were detected during our analysis. Two, pCC5.2 and pCB2.4, were initially not found. However, as they were amplified easily by PCR, we re-inspected the unmapped sequencing reads, but still could not detect a single read matching these plasmids. This observation may relate to a lower copy number of these compared with the other plasmids, but this was not tested in the current study. Analysing the plasmid sequences, we observed a remarkable genetic stability. In addition to a single-base substitution in the plasmid pCA2.4 that might rather constitute an error in the reference sequence37 (see above) and a missing mobile element on the plasmid pSYSM, two mutations were observed, both in the plasmid pSYSA. Two major mutations affect the clustered regularly interspaced short palindrome repeats-CRISPR-associated proteins (CRISPR-Cas) system, located on the plasmid pSYSA. CRISPR-Cas systems provide in many archaea and bacteria an adaptive immunity against invading DNA.38–44 The plasmid pSYSA encodes the three independent systems CRISPR1, CRISPR2 and CRISPR3. A 2399-bp deletion encompassing the spacer-repeat regions 15–47 of CRISPR1 was detected in ‘PCC-M’ (#43), which also eliminated the relatively short genes ssr7018, ssl7019, ssl7020 and ssl7021, annotated within the spacer-repeat array of CRISPR1. However, the theoretical protein sequences of these gene products show no conservation at all and might not constitute real genes. Nevertheless, the deletion of spacer-repeat regions 15–47 of CRISPR1 is severe, since compared with the reference, it has eliminated two-thirds, 33 of its 49 spacer-repeat units. The sequence analysis suggests that the recombination events leading to the deletion of spacer-repeat regions 15–47 must have occurred within the direct repeats. Thus, this recombination is in agreement with previous observations that the downstream ends of the repeat clusters are conserved such that deletions and recombination events occur internally.45 A very different type of deletion was noticed for the CRISPR2 system located on the same plasmid. In this case, 159 bp were deleted (event #44 in Table 1). These 159 deleted bases correspond to positions 71 499–71 657 in the reference. The deletion encompasses two repeats including the spacer 41 in between. It is very surprising that the recombination did not occur within the repeat sections but in the adjacent spacers 40 and 42, thus generating a new ‘hybrid’ spacer 40 at positions 69 082–69 111 in the pSYSA plasmid of ‘PCC-M’ (Fig. 2). As a result, spacers 40, 41 and 42 of the original sequence are missing and became replaced by this hybrid sequence. The vast majority of described deletions in the CRISPR system occur between the direct repeats.45 Non-homologous recombination between two different spacers is rare, the deletion observed here in CRISPR2 of the plasmid pSYSA is generating additional sequence diversity in the CRISPR system. Due to the two deletions in the plasmid pSYSA, we determined its total length as 100 749 bp, compared with 103 307 bp for the reference. Figure 2. Non-homologous recombination in the plasmid pSYSA affecting spacers 40, 41 and 42 of CRISPR2. As a result of the 159-bp deletion in ‘PCC-M’ compared with ‘GT-Kazusa’, a novel hybrid spacer 40 was generated. The direct repeats are presented as squares and the nucleotide positions in the ‘GT-Kazusa’ are given according to the GenBank file NC_005230. 3.5. Mobile elements As can be seen in Tables 1 and 2 (differences #12, 17, 40 and 45), the ‘PCC-M’ substrain lacks four insertion elements of the ISY203 type present in ‘GT-Kazusa’.7 These elements are ISY203b, e and g on the chromosome and ISY203j on the plasmid pSYSM. These four indels have the exact same size of 1183 bp, only one is 1185 bp. In the ‘GT-S’ substrain re-sequenced by Tajima et al.10 one of these four elements, ISY203e, is already present, placing this strain (in accordance with Ikeuchi and Tabata)8 before ‘GT-Kazusa’ in the strain history. The absence of ISY203b, e and g in ‘PCC-M’ is further shared with the strains ‘GT-I’, ‘PCC-N’ and ‘PCC-P’,9 whereas no statement is possible with regard to the possible presence of ISY203j on the plasmid pSYSM in the latter. With respect to the described mobile elements, ‘PCC-M’ appears as one of the least-derived substrains.
title	Results
sec	3.1. Overview Sequencing of the Synechocystis sp. PCC 6803 ‘Moscow’ substrain ‘PCC-M’ by Illumina (Solexa) yielded an average 1100-fold coverage of the chromosome and five of the seven plasmids. The existence of the two remaining plasmids was verified individually by PCR. Following assembly of sequences, mapping to the reference strain sequences and annotation, the obtained genome and plasmid sequences were deposited in the GenBank database with the accession numbers CP003265–CP003272. Altogether, we found 45 differences (36 SNPs and 9 indels >1 bp) between the investigated substrain ‘PCC-M’ and the published sequences of the ‘GT-Kazusa’ chromosome2 and plasmids20 used here as references (Table 1). From these differences, 41 are located in the chromosome and four in the plasmids pSYSA, pSYSM and pCA2.4. For verification, about one-third of these differences were randomly chosen and confirmed independently by PCR and Sanger sequencing of the respective regions in substrain ‘PCC-M’, but no misidentified mutations were found. These DNA regions were, in addition, amplified and compared with the sequences from substrains ‘GT-Kazusa’ and ‘GT-V’ for control and comparison, respectively. The GT ‘GT-V’ was chosen for comparison as is widely used for the dissection and analysis of photosynthetic mutants. Fully segregated PSI, PSII and Chl biosynthesis mutants were successfully generated in this genetic background21,22 and some of these mutants could not be obtained in other substrains.23 Table 1. Location and effects of SNPs and indels found in ‘PCC-M’ compared with the nucleotide sequence of ‘GT-Kazusa’ in the database The events are numbered (column #), the type of mutation (M) is indicated as S, substitution, D, deletion or I, insertion, together with the respective start and end positions in the ‘GT-Kazusa’ reference sequence. For each event the respective nucleotide change is indicated on the forward strand, together with the resulting codon modification (Ref. → Mut) and amino acid change, if any. Highlighted in italics are four instances of missing ISY203 copies and in bold all SNPs affecting intergenic spacer regions (IGR). aIndicate errors in the database. The number of differences between ‘PCC-M’ and ‘GT-Kazusa’ are almost twice as many as reported by Tajima et al.10 for the GT (GT-S) ‘Kazusa’ strain, where a total of 22 differences from the published sequence were found.10 All but 3 of those 22 differences were also detected in the ‘PCC-M’ strain studied here. The three unique differences in the ‘GT-S’ and 26 differences between ‘PCC-M’ and ‘GT-Kazusa’ underline the existence of lineage splitting in the Synechocystis substrains. Moreover, we found seven SNPs (#5, 13, 15, 16, 27, 32 and 33 in Tables 1 and 2) and one larger indel (#6 in Tables 1 and 2) specifically shared between the ‘PCC-M’ and the ‘PCC-N and PCC-P’ substrains, indicating that ‘PCC-M’ belongs to the ‘PCC’ group of motile substrains.9 ‘PCC-M and PCC-P’ are strains that both exhibit the native positive phototaxis, whereas ‘PCC-N’ strain shows negative phototaxis.24 Table 2. Comparison of SNPs and indels found in the chromosome of ‘PCC-M’ with sequences from other substrains All events are numbered (column #) as in Table 1. The presence of the respective ‘PCC-M’ mutation in the different substrains is indicated by the check marks. aThe deletion of 0.6 kb in the gene slr1753 compared with the reference was also verified here in ‘GT-Kazusa’.
title	Overview
p	Sequencing of the Synechocystis sp. PCC 6803 ‘Moscow’ substrain ‘PCC-M’ by Illumina (Solexa) yielded an average 1100-fold coverage of the chromosome and five of the seven plasmids. The existence of the two remaining plasmids was verified individually by PCR. Following assembly of sequences, mapping to the reference strain sequences and annotation, the obtained genome and plasmid sequences were deposited in the GenBank database with the accession numbers CP003265–CP003272.
p	Altogether, we found 45 differences (36 SNPs and 9 indels >1 bp) between the investigated substrain ‘PCC-M’ and the published sequences of the ‘GT-Kazusa’ chromosome2 and plasmids20 used here as references (Table 1). From these differences, 41 are located in the chromosome and four in the plasmids pSYSA, pSYSM and pCA2.4. For verification, about one-third of these differences were randomly chosen and confirmed independently by PCR and Sanger sequencing of the respective regions in substrain ‘PCC-M’, but no misidentified mutations were found. These DNA regions were, in addition, amplified and compared with the sequences from substrains ‘GT-Kazusa’ and ‘GT-V’ for control and comparison, respectively. The GT ‘GT-V’ was chosen for comparison as is widely used for the dissection and analysis of photosynthetic mutants. Fully segregated PSI, PSII and Chl biosynthesis mutants were successfully generated in this genetic background21,22 and some of these mutants could not be obtained in other substrains.23 Table 1. Location and effects of SNPs and indels found in ‘PCC-M’ compared with the nucleotide sequence of ‘GT-Kazusa’ in the database The events are numbered (column #), the type of mutation (M) is indicated as S, substitution, D, deletion or I, insertion, together with the respective start and end positions in the ‘GT-Kazusa’ reference sequence. For each event the respective nucleotide change is indicated on the forward strand, together with the resulting codon modification (Ref. → Mut) and amino acid change, if any. Highlighted in italics are four instances of missing ISY203 copies and in bold all SNPs affecting intergenic spacer regions (IGR). aIndicate errors in the database.
table caption	Table 1. Location and effects of SNPs and indels found in ‘PCC-M’ compared with the nucleotide sequence of ‘GT-Kazusa’ in the database The events are numbered (column #), the type of mutation (M) is indicated as S, substitution, D, deletion or I, insertion, together with the respective start and end positions in the ‘GT-Kazusa’ reference sequence. For each event the respective nucleotide change is indicated on the forward strand, together with the resulting codon modification (Ref. → Mut) and amino acid change, if any. Highlighted in italics are four instances of missing ISY203 copies and in bold all SNPs affecting intergenic spacer regions (IGR). aIndicate errors in the database.
p	Location and effects of SNPs and indels found in ‘PCC-M’ compared with the nucleotide sequence of ‘GT-Kazusa’ in the database
p	he events are numbered (column #), the type of mutation (M) is indicated as S, substitution, D, deletion or I, insertion, together with the respective start and end positions in the ‘GT-Kazusa’ reference sequence. For each event the respective nucleotide change is indicated on the forward strand, together with the resulting codon modification (Ref. → Mut) and amino acid change, if any. Highlighted in italics are four instances of missing ISY203 copies and in bold all SNPs affecting intergenic spacer regions (IGR).
p	Indicate errors in the database.
p	The number of differences between ‘PCC-M’ and ‘GT-Kazusa’ are almost twice as many as reported by Tajima et al.10 for the GT (GT-S) ‘Kazusa’ strain, where a total of 22 differences from the published sequence were found.10 All but 3 of those 22 differences were also detected in the ‘PCC-M’ strain studied here. The three unique differences in the ‘GT-S’ and 26 differences between ‘PCC-M’ and ‘GT-Kazusa’ underline the existence of lineage splitting in the Synechocystis substrains. Moreover, we found seven SNPs (#5, 13, 15, 16, 27, 32 and 33 in Tables 1 and 2) and one larger indel (#6 in Tables 1 and 2) specifically shared between the ‘PCC-M’ and the ‘PCC-N and PCC-P’ substrains, indicating that ‘PCC-M’ belongs to the ‘PCC’ group of motile substrains.9 ‘PCC-M and PCC-P’ are strains that both exhibit the native positive phototaxis, whereas ‘PCC-N’ strain shows negative phototaxis.24 Table 2. Comparison of SNPs and indels found in the chromosome of ‘PCC-M’ with sequences from other substrains All events are numbered (column #) as in Table 1. The presence of the respective ‘PCC-M’ mutation in the different substrains is indicated by the check marks. aThe deletion of 0.6 kb in the gene slr1753 compared with the reference was also verified here in ‘GT-Kazusa’.
table caption	Table 2. Comparison of SNPs and indels found in the chromosome of ‘PCC-M’ with sequences from other substrains All events are numbered (column #) as in Table 1. The presence of the respective ‘PCC-M’ mutation in the different substrains is indicated by the check marks. aThe deletion of 0.6 kb in the gene slr1753 compared with the reference was also verified here in ‘GT-Kazusa’.
p	Comparison of SNPs and indels found in the chromosome of ‘PCC-M’ with sequences from other substrains
p	ll events are numbered (column #) as in Table 1. The presence of the respective ‘PCC-M’ mutation in the different substrains is indicated by the check marks.
p	The deletion of 0.6 kb in the gene slr1753 compared with the reference was also verified here in ‘GT-Kazusa’.
sec	3.2. SNPs in protein-coding genes Of the total of 36 SNPs in ‘PCC-M’ compared with ‘GT-Kazusa’, all except 1 are located in the chromosome. The single base substitution that was found on the plasmid pCA2.4 within the repA gene (#42 in Table 1) seems to be no mutation but an error in the published sequence of ‘GT-Kazusa’, since in our PCR-control experiments, the sequence was identical in the three strains ‘GT-Kazusa’, ‘PCC-M’ and ‘GT-V’. Of the 35 chromosomal SNPs compared with ‘GT-Kazusa’, 5 are silent base substitutions, 14 substitutions lead to amino acid substitutions, in 6 cases a single basepair is deleted and in 2 cases (#23 and #28) one basepair was inserted within an ORF, causing a frameshift mutation. Furthermore, five substitutions, two single basepair insertions and one single basepair deletion were observed in intergenic regions (IGR) of ‘PCC-M’ compared with the reference (Table 1). Seven SNPs are specifically shared between the ‘PCC-M’, ‘PCC-N and PCC-P’ substrains. These are in slr1865 (#13), encoding a hypothetical protein, in sll1951 (#15), encoding a haemolysin-like protein, in slr1983 (#16), encoding a two-component hybrid sensor and regulator protein, in slr0222 (#27), encoding the histidine kinase Hik25, a silent mutation in slr0302 (#32), encoding a PAS/PAC and GAF sensors-containing diguanylate cyclase, one missing basepair, leaving the spkA gene intact (#5) and, finally, in ssr1176 (#33), encoding a transposase (Tables 1 and 2). The gene for a cell surface-localized haemolysin-like protein, HlyA (sll1951), reported to function as a barrier against the adsorption of toxic compounds,25,26 is lacking one nucleotide in ‘PCC-M’ compared with the reference (difference #15). In the ‘GT-Kazusa’, ‘GT-V’ as well as the ‘GT-I’ and ‘GT-S’ strains,9 the presence of the additional A leads to the fusion of two ORFs that are separate in ‘PCC-M’, ‘PCC-N’ and ‘PCC-P’ substrains.9 As a result, Sll1951 is 1741 amino acids in the former and only 1437 residues in the latter. In our data, some other previously published mutations8,10 are confirmed. For instance, spkA (sll1574; #5), a regulator of cellular motility via phosphorylation of membrane proteins,11,27 is disrupted by a 1-bp insertion in the non-motile ‘GT-Kazusa’ and ‘GT-V’ strains, whereas it is intact in the motile ‘PCC-M’ strain (Table 1). Similarly, the pilC gene (slr0162/3) required for pili assembly has been reported to carry a frameshift mutation in the ‘GT-Kazusa’ and ‘GT-S’ sequences.8,10,28 We found an intact pilC gene in ‘PCC-M’ (#20), as well as in the ‘GT-V’ substrain. Another SNP (G–A) exists in psaA (slr1834; #9), encoding the photosynthetic P700 apoprotein subunit Ia; however, in accordance with Tajima et al.10, we believe this is an annotation error in the database as we found an A in the respective position in all three strains dealt with in this work (Table 1). Similarly, ycf22 (sll0751; #26) is here suggested to be fused to the downstream reading frame sll0752. Indeed, in blastp comparisons, both proteins together match against a single, widely distributed, larger protein of 452 amino acids. This protein possesses a Ttg2C domain (COG1463), which is found in an ABC-type transport system involved in resistance to organic solvents. The acronym ycf stands for hypothetical chloroplast reading frames, meaning proteins conserved in chloroplasts and also cyanobacteria. The 1-bp shorter version, which is splitted into sll0751/sll0752, is a database error in the case of ‘GT-Kazusa’ as well. 3.2.1. SNPs unique to ‘PCC-M’ Six of the 10 SNPs unique to ‘PCC-M’ are located within coding regions and cause amino acid substitutions or alter the length of the respective reading frame. A single basepair transversion in the gene sigF (slr1564; #39 in Table 1) is leading to a M231K substitution within the −35 element DNA-binding region29 of a group 3 sigma factor required for phototactic movement30 and salt-stress response.31 This SNP cannot lead to impaired motility as ‘PCC-M’ is motile but it might influence the DNA–protein interaction of SigF because positively charged residues such as lysine located in this part of the σ4.2 region can directly interact with DNA.29 Another transversion, in argB (slr1898; #8 in Table 1), leads to an S2N amino acid substitution in N-acetylglutamate kinase, the enzyme performing the first committed step of Arg biosynthesis. Transitions in sll1359 and slr1609 (#11 and #3 in Table 1) result in an N–K substitution at a very conserved position within a predicted cytochrome and an L608S (L548S) substitution in the long-chain acyl-CoA-synthetase Slr1609 that has been found crucial for fatty acid activation and the biosynthesis of alkanes.32 Interestingly, an unrelated SNP exists at position 488 923 within the slr1609 coding sequence in a strain ‘YF’, leading to a G546L (G486L) substitution.17 It should be noted that the slr1609 reading frame has been annotated 60 codons shorter (636 instead of 696 amino acids) during recent re-sequencing analyses,9,10 compared with the original annotation of ‘GT-Kazusa’ (numbers in brackets). The shorter Slr1609 protein of 636 amino acids is also consistent with the mapped start site of transcription at position 487 352,33 located 115 nt upstream of the revised start codon. A transition in slr0753 (#41 in Table 1) leads to a P113L substitution in a putative chloride efflux transport protein involved in maintaining the chloride ion concentration homoeostasis as required for a functional photosystem II.34 A single basepair deletion in sll1496 (#38 in Table 1), encoding mannose-1-phosphate guanyltransferase, causes a frameshift and premature stop of the gene in ‘PCC-M’. The resulting protein is with 515 instead of 643 amino acids severely truncated and may be rendered function-less.
title	SNPs in protein-coding genes
p	Of the total of 36 SNPs in ‘PCC-M’ compared with ‘GT-Kazusa’, all except 1 are located in the chromosome. The single base substitution that was found on the plasmid pCA2.4 within the repA gene (#42 in Table 1) seems to be no mutation but an error in the published sequence of ‘GT-Kazusa’, since in our PCR-control experiments, the sequence was identical in the three strains ‘GT-Kazusa’, ‘PCC-M’ and ‘GT-V’. Of the 35 chromosomal SNPs compared with ‘GT-Kazusa’, 5 are silent base substitutions, 14 substitutions lead to amino acid substitutions, in 6 cases a single basepair is deleted and in 2 cases (#23 and #28) one basepair was inserted within an ORF, causing a frameshift mutation. Furthermore, five substitutions, two single basepair insertions and one single basepair deletion were observed in intergenic regions (IGR) of ‘PCC-M’ compared with the reference (Table 1).
p	Seven SNPs are specifically shared between the ‘PCC-M’, ‘PCC-N and PCC-P’ substrains. These are in slr1865 (#13), encoding a hypothetical protein, in sll1951 (#15), encoding a haemolysin-like protein, in slr1983 (#16), encoding a two-component hybrid sensor and regulator protein, in slr0222 (#27), encoding the histidine kinase Hik25, a silent mutation in slr0302 (#32), encoding a PAS/PAC and GAF sensors-containing diguanylate cyclase, one missing basepair, leaving the spkA gene intact (#5) and, finally, in ssr1176 (#33), encoding a transposase (Tables 1 and 2).
p	The gene for a cell surface-localized haemolysin-like protein, HlyA (sll1951), reported to function as a barrier against the adsorption of toxic compounds,25,26 is lacking one nucleotide in ‘PCC-M’ compared with the reference (difference #15). In the ‘GT-Kazusa’, ‘GT-V’ as well as the ‘GT-I’ and ‘GT-S’ strains,9 the presence of the additional A leads to the fusion of two ORFs that are separate in ‘PCC-M’, ‘PCC-N’ and ‘PCC-P’ substrains.9 As a result, Sll1951 is 1741 amino acids in the former and only 1437 residues in the latter.
p	In our data, some other previously published mutations8,10 are confirmed. For instance, spkA (sll1574; #5), a regulator of cellular motility via phosphorylation of membrane proteins,11,27 is disrupted by a 1-bp insertion in the non-motile ‘GT-Kazusa’ and ‘GT-V’ strains, whereas it is intact in the motile ‘PCC-M’ strain (Table 1). Similarly, the pilC gene (slr0162/3) required for pili assembly has been reported to carry a frameshift mutation in the ‘GT-Kazusa’ and ‘GT-S’ sequences.8,10,28 We found an intact pilC gene in ‘PCC-M’ (#20), as well as in the ‘GT-V’ substrain.
p	Another SNP (G–A) exists in psaA (slr1834; #9), encoding the photosynthetic P700 apoprotein subunit Ia; however, in accordance with Tajima et al.10, we believe this is an annotation error in the database as we found an A in the respective position in all three strains dealt with in this work (Table 1). Similarly, ycf22 (sll0751; #26) is here suggested to be fused to the downstream reading frame sll0752. Indeed, in blastp comparisons, both proteins together match against a single, widely distributed, larger protein of 452 amino acids. This protein possesses a Ttg2C domain (COG1463), which is found in an ABC-type transport system involved in resistance to organic solvents. The acronym ycf stands for hypothetical chloroplast reading frames, meaning proteins conserved in chloroplasts and also cyanobacteria. The 1-bp shorter version, which is splitted into sll0751/sll0752, is a database error in the case of ‘GT-Kazusa’ as well.
sec	3.2.1. SNPs unique to ‘PCC-M’ Six of the 10 SNPs unique to ‘PCC-M’ are located within coding regions and cause amino acid substitutions or alter the length of the respective reading frame. A single basepair transversion in the gene sigF (slr1564; #39 in Table 1) is leading to a M231K substitution within the −35 element DNA-binding region29 of a group 3 sigma factor required for phototactic movement30 and salt-stress response.31 This SNP cannot lead to impaired motility as ‘PCC-M’ is motile but it might influence the DNA–protein interaction of SigF because positively charged residues such as lysine located in this part of the σ4.2 region can directly interact with DNA.29 Another transversion, in argB (slr1898; #8 in Table 1), leads to an S2N amino acid substitution in N-acetylglutamate kinase, the enzyme performing the first committed step of Arg biosynthesis. Transitions in sll1359 and slr1609 (#11 and #3 in Table 1) result in an N–K substitution at a very conserved position within a predicted cytochrome and an L608S (L548S) substitution in the long-chain acyl-CoA-synthetase Slr1609 that has been found crucial for fatty acid activation and the biosynthesis of alkanes.32 Interestingly, an unrelated SNP exists at position 488 923 within the slr1609 coding sequence in a strain ‘YF’, leading to a G546L (G486L) substitution.17 It should be noted that the slr1609 reading frame has been annotated 60 codons shorter (636 instead of 696 amino acids) during recent re-sequencing analyses,9,10 compared with the original annotation of ‘GT-Kazusa’ (numbers in brackets). The shorter Slr1609 protein of 636 amino acids is also consistent with the mapped start site of transcription at position 487 352,33 located 115 nt upstream of the revised start codon. A transition in slr0753 (#41 in Table 1) leads to a P113L substitution in a putative chloride efflux transport protein involved in maintaining the chloride ion concentration homoeostasis as required for a functional photosystem II.34 A single basepair deletion in sll1496 (#38 in Table 1), encoding mannose-1-phosphate guanyltransferase, causes a frameshift and premature stop of the gene in ‘PCC-M’. The resulting protein is with 515 instead of 643 amino acids severely truncated and may be rendered function-less.
title	SNPs unique to ‘PCC-M’
p	Six of the 10 SNPs unique to ‘PCC-M’ are located within coding regions and cause amino acid substitutions or alter the length of the respective reading frame.
p	A single basepair transversion in the gene sigF (slr1564; #39 in Table 1) is leading to a M231K substitution within the −35 element DNA-binding region29 of a group 3 sigma factor required for phototactic movement30 and salt-stress response.31 This SNP cannot lead to impaired motility as ‘PCC-M’ is motile but it might influence the DNA–protein interaction of SigF because positively charged residues such as lysine located in this part of the σ4.2 region can directly interact with DNA.29
p	Another transversion, in argB (slr1898; #8 in Table 1), leads to an S2N amino acid substitution in N-acetylglutamate kinase, the enzyme performing the first committed step of Arg biosynthesis. Transitions in sll1359 and slr1609 (#11 and #3 in Table 1) result in an N–K substitution at a very conserved position within a predicted cytochrome and an L608S (L548S) substitution in the long-chain acyl-CoA-synthetase Slr1609 that has been found crucial for fatty acid activation and the biosynthesis of alkanes.32 Interestingly, an unrelated SNP exists at position 488 923 within the slr1609 coding sequence in a strain ‘YF’, leading to a G546L (G486L) substitution.17 It should be noted that the slr1609 reading frame has been annotated 60 codons shorter (636 instead of 696 amino acids) during recent re-sequencing analyses,9,10 compared with the original annotation of ‘GT-Kazusa’ (numbers in brackets). The shorter Slr1609 protein of 636 amino acids is also consistent with the mapped start site of transcription at position 487 352,33 located 115 nt upstream of the revised start codon.
p	A transition in slr0753 (#41 in Table 1) leads to a P113L substitution in a putative chloride efflux transport protein involved in maintaining the chloride ion concentration homoeostasis as required for a functional photosystem II.34
p	A single basepair deletion in sll1496 (#38 in Table 1), encoding mannose-1-phosphate guanyltransferase, causes a frameshift and premature stop of the gene in ‘PCC-M’. The resulting protein is with 515 instead of 643 amino acids severely truncated and may be rendered function-less.
sec	3.3. Point mutations in IGRs Compared with the reference, eight SNPs are located in IGRs, three of these (#7, 24 and 36) are ‘PCC-M’ specific. One of these (#36 in Table 1) SNPs is predicted to affect one of the recently reported cis-antisense RNAs.33 The additional A between positions 3194022 and 3194023 is located in the IGR between genes slr0533 and slr0534, encoding histidine kinase 10 (Hik10) and the soluble lytic transglycosylase Slt. On the reverse strand, the additional T falls within the predicted −10 element of the slr0534_as3 promoter. Instead of the high-scoring CATAAT,33 the motif is changed to ATTAAT. Hence, a modulation of slr0534_as3 expression compared with the reference is possible. In contrast to its designation, this cis-antisense RNA overlaps the 3′ end of genes slr0533 and hik10 (due to an error in the annotation used as the reference). In microarray analyses, slr0534_as3 of strain ‘PCC-M’ was found to be moderately to highly expressed under four tested conditions. Compared with the accumulation of the hik10 mRNA, it appeared even stronger.33 A function for Hik10 has been found in the perception of salt stress or transduction of the signal.35 The slr0534_as3 transcript may play a silencing role with regard to hik10 under non-inducing conditions. Mutation of its promoter element may hence cause a physiological effect in the salt stress response. Two other SNPs (at positions 831 647 and 2 400 722; #7 and #24 in Table 1) could have an impact on the promoter strength or the regulation of the genes infA and glcP. For glcP, the initiation site of transcription was mapped to position 2 400 66633 and for infA to position 831 635 (unpublished). Thus, these two SNPs are located 12 and 56 nt upstream of the respective initiation site of transcription. In the case of the infA promoter, the transition replaces a nucleotide within the putative −10 element, changing it from TGTGAT to TATGAT, a much more typical motif for a −10 element in Synechocystis.33 The mutation 56 nt upstream of the initiation site of transcription of glcP might be functionally relevant as well. The gene product, a glucose transporter, is directly relevant for the physiological ability to use glucose; its gene expression is affected by mutation of the gene for the AbrB-type transcription factor Sll0822.36 The region at position −56 might well be part of the recognized sequence.
title	Point mutations in IGRs
p	Compared with the reference, eight SNPs are located in IGRs, three of these (#7, 24 and 36) are ‘PCC-M’ specific. One of these (#36 in Table 1) SNPs is predicted to affect one of the recently reported cis-antisense RNAs.33 The additional A between positions 3194022 and 3194023 is located in the IGR between genes slr0533 and slr0534, encoding histidine kinase 10 (Hik10) and the soluble lytic transglycosylase Slt. On the reverse strand, the additional T falls within the predicted −10 element of the slr0534_as3 promoter. Instead of the high-scoring CATAAT,33 the motif is changed to ATTAAT. Hence, a modulation of slr0534_as3 expression compared with the reference is possible. In contrast to its designation, this cis-antisense RNA overlaps the 3′ end of genes slr0533 and hik10 (due to an error in the annotation used as the reference). In microarray analyses, slr0534_as3 of strain ‘PCC-M’ was found to be moderately to highly expressed under four tested conditions. Compared with the accumulation of the hik10 mRNA, it appeared even stronger.33 A function for Hik10 has been found in the perception of salt stress or transduction of the signal.35 The slr0534_as3 transcript may play a silencing role with regard to hik10 under non-inducing conditions. Mutation of its promoter element may hence cause a physiological effect in the salt stress response.
p	Two other SNPs (at positions 831 647 and 2 400 722; #7 and #24 in Table 1) could have an impact on the promoter strength or the regulation of the genes infA and glcP. For glcP, the initiation site of transcription was mapped to position 2 400 66633 and for infA to position 831 635 (unpublished). Thus, these two SNPs are located 12 and 56 nt upstream of the respective initiation site of transcription. In the case of the infA promoter, the transition replaces a nucleotide within the putative −10 element, changing it from TGTGAT to TATGAT, a much more typical motif for a −10 element in Synechocystis.33 The mutation 56 nt upstream of the initiation site of transcription of glcP might be functionally relevant as well. The gene product, a glucose transporter, is directly relevant for the physiological ability to use glucose; its gene expression is affected by mutation of the gene for the AbrB-type transcription factor Sll0822.36 The region at position −56 might well be part of the recognized sequence.
sec	3.4. Larger indels and plasmids In addition to this relatively large number of SNPs, only seven larger deletions were found on the chromosome and two plasmids. Compared with the reference, a deletion of 0.6 kb exists in the gene slr1753 (#4 in Table 1), which encodes, according to our data, a giant protein comprising 1549 amino acids that probably is transported to the cell surface. However, we found this deletion in our verification also in ‘GT-Kazusa’ and ‘GT-V’. Moreover, the deleted/inserted region consists of long series of DNA repeats (Fig. 1), an evidence for a possible assembly or annotation error in the original sequence analysis. Figure 1. Alignment of the possible indel region in gene slr1753. The sequence obtained in the verification experiment is aligned with that of the ‘GT-Kazusa’ reference. Two types of DNA repeats are indicated by the filled and non-filled lozenges. Given the very scarce available information concerning biological functions of the plasmids in Synechocystis sp. PCC 6803, it was interesting that all seven plasmids were detected during our analysis. Two, pCC5.2 and pCB2.4, were initially not found. However, as they were amplified easily by PCR, we re-inspected the unmapped sequencing reads, but still could not detect a single read matching these plasmids. This observation may relate to a lower copy number of these compared with the other plasmids, but this was not tested in the current study. Analysing the plasmid sequences, we observed a remarkable genetic stability. In addition to a single-base substitution in the plasmid pCA2.4 that might rather constitute an error in the reference sequence37 (see above) and a missing mobile element on the plasmid pSYSM, two mutations were observed, both in the plasmid pSYSA. Two major mutations affect the clustered regularly interspaced short palindrome repeats-CRISPR-associated proteins (CRISPR-Cas) system, located on the plasmid pSYSA. CRISPR-Cas systems provide in many archaea and bacteria an adaptive immunity against invading DNA.38–44 The plasmid pSYSA encodes the three independent systems CRISPR1, CRISPR2 and CRISPR3. A 2399-bp deletion encompassing the spacer-repeat regions 15–47 of CRISPR1 was detected in ‘PCC-M’ (#43), which also eliminated the relatively short genes ssr7018, ssl7019, ssl7020 and ssl7021, annotated within the spacer-repeat array of CRISPR1. However, the theoretical protein sequences of these gene products show no conservation at all and might not constitute real genes. Nevertheless, the deletion of spacer-repeat regions 15–47 of CRISPR1 is severe, since compared with the reference, it has eliminated two-thirds, 33 of its 49 spacer-repeat units. The sequence analysis suggests that the recombination events leading to the deletion of spacer-repeat regions 15–47 must have occurred within the direct repeats. Thus, this recombination is in agreement with previous observations that the downstream ends of the repeat clusters are conserved such that deletions and recombination events occur internally.45 A very different type of deletion was noticed for the CRISPR2 system located on the same plasmid. In this case, 159 bp were deleted (event #44 in Table 1). These 159 deleted bases correspond to positions 71 499–71 657 in the reference. The deletion encompasses two repeats including the spacer 41 in between. It is very surprising that the recombination did not occur within the repeat sections but in the adjacent spacers 40 and 42, thus generating a new ‘hybrid’ spacer 40 at positions 69 082–69 111 in the pSYSA plasmid of ‘PCC-M’ (Fig. 2). As a result, spacers 40, 41 and 42 of the original sequence are missing and became replaced by this hybrid sequence. The vast majority of described deletions in the CRISPR system occur between the direct repeats.45 Non-homologous recombination between two different spacers is rare, the deletion observed here in CRISPR2 of the plasmid pSYSA is generating additional sequence diversity in the CRISPR system. Due to the two deletions in the plasmid pSYSA, we determined its total length as 100 749 bp, compared with 103 307 bp for the reference. Figure 2. Non-homologous recombination in the plasmid pSYSA affecting spacers 40, 41 and 42 of CRISPR2. As a result of the 159-bp deletion in ‘PCC-M’ compared with ‘GT-Kazusa’, a novel hybrid spacer 40 was generated. The direct repeats are presented as squares and the nucleotide positions in the ‘GT-Kazusa’ are given according to the GenBank file NC_005230.
title	Larger indels and plasmids
p	In addition to this relatively large number of SNPs, only seven larger deletions were found on the chromosome and two plasmids. Compared with the reference, a deletion of 0.6 kb exists in the gene slr1753 (#4 in Table 1), which encodes, according to our data, a giant protein comprising 1549 amino acids that probably is transported to the cell surface. However, we found this deletion in our verification also in ‘GT-Kazusa’ and ‘GT-V’. Moreover, the deleted/inserted region consists of long series of DNA repeats (Fig. 1), an evidence for a possible assembly or annotation error in the original sequence analysis. Figure 1. Alignment of the possible indel region in gene slr1753. The sequence obtained in the verification experiment is aligned with that of the ‘GT-Kazusa’ reference. Two types of DNA repeats are indicated by the filled and non-filled lozenges.
figure caption	Figure 1. Alignment of the possible indel region in gene slr1753. The sequence obtained in the verification experiment is aligned with that of the ‘GT-Kazusa’ reference. Two types of DNA repeats are indicated by the filled and non-filled lozenges.
p	Alignment of the possible indel region in gene slr1753. The sequence obtained in the verification experiment is aligned with that of the ‘GT-Kazusa’ reference. Two types of DNA repeats are indicated by the filled and non-filled lozenges.
p	Given the very scarce available information concerning biological functions of the plasmids in Synechocystis sp. PCC 6803, it was interesting that all seven plasmids were detected during our analysis. Two, pCC5.2 and pCB2.4, were initially not found. However, as they were amplified easily by PCR, we re-inspected the unmapped sequencing reads, but still could not detect a single read matching these plasmids. This observation may relate to a lower copy number of these compared with the other plasmids, but this was not tested in the current study. Analysing the plasmid sequences, we observed a remarkable genetic stability. In addition to a single-base substitution in the plasmid pCA2.4 that might rather constitute an error in the reference sequence37 (see above) and a missing mobile element on the plasmid pSYSM, two mutations were observed, both in the plasmid pSYSA.
p	Two major mutations affect the clustered regularly interspaced short palindrome repeats-CRISPR-associated proteins (CRISPR-Cas) system, located on the plasmid pSYSA. CRISPR-Cas systems provide in many archaea and bacteria an adaptive immunity against invading DNA.38–44 The plasmid pSYSA encodes the three independent systems CRISPR1, CRISPR2 and CRISPR3. A 2399-bp deletion encompassing the spacer-repeat regions 15–47 of CRISPR1 was detected in ‘PCC-M’ (#43), which also eliminated the relatively short genes ssr7018, ssl7019, ssl7020 and ssl7021, annotated within the spacer-repeat array of CRISPR1. However, the theoretical protein sequences of these gene products show no conservation at all and might not constitute real genes. Nevertheless, the deletion of spacer-repeat regions 15–47 of CRISPR1 is severe, since compared with the reference, it has eliminated two-thirds, 33 of its 49 spacer-repeat units. The sequence analysis suggests that the recombination events leading to the deletion of spacer-repeat regions 15–47 must have occurred within the direct repeats. Thus, this recombination is in agreement with previous observations that the downstream ends of the repeat clusters are conserved such that deletions and recombination events occur internally.45
p	A very different type of deletion was noticed for the CRISPR2 system located on the same plasmid. In this case, 159 bp were deleted (event #44 in Table 1). These 159 deleted bases correspond to positions 71 499–71 657 in the reference. The deletion encompasses two repeats including the spacer 41 in between. It is very surprising that the recombination did not occur within the repeat sections but in the adjacent spacers 40 and 42, thus generating a new ‘hybrid’ spacer 40 at positions 69 082–69 111 in the pSYSA plasmid of ‘PCC-M’ (Fig. 2). As a result, spacers 40, 41 and 42 of the original sequence are missing and became replaced by this hybrid sequence. The vast majority of described deletions in the CRISPR system occur between the direct repeats.45 Non-homologous recombination between two different spacers is rare, the deletion observed here in CRISPR2 of the plasmid pSYSA is generating additional sequence diversity in the CRISPR system. Due to the two deletions in the plasmid pSYSA, we determined its total length as 100 749 bp, compared with 103 307 bp for the reference. Figure 2. Non-homologous recombination in the plasmid pSYSA affecting spacers 40, 41 and 42 of CRISPR2. As a result of the 159-bp deletion in ‘PCC-M’ compared with ‘GT-Kazusa’, a novel hybrid spacer 40 was generated. The direct repeats are presented as squares and the nucleotide positions in the ‘GT-Kazusa’ are given according to the GenBank file NC_005230.
figure caption	Figure 2. Non-homologous recombination in the plasmid pSYSA affecting spacers 40, 41 and 42 of CRISPR2. As a result of the 159-bp deletion in ‘PCC-M’ compared with ‘GT-Kazusa’, a novel hybrid spacer 40 was generated. The direct repeats are presented as squares and the nucleotide positions in the ‘GT-Kazusa’ are given according to the GenBank file NC_005230.
p	Non-homologous recombination in the plasmid pSYSA affecting spacers 40, 41 and 42 of CRISPR2. As a result of the 159-bp deletion in ‘PCC-M’ compared with ‘GT-Kazusa’, a novel hybrid spacer 40 was generated. The direct repeats are presented as squares and the nucleotide positions in the ‘GT-Kazusa’ are given according to the GenBank file NC_005230.
sec	3.5. Mobile elements As can be seen in Tables 1 and 2 (differences #12, 17, 40 and 45), the ‘PCC-M’ substrain lacks four insertion elements of the ISY203 type present in ‘GT-Kazusa’.7 These elements are ISY203b, e and g on the chromosome and ISY203j on the plasmid pSYSM. These four indels have the exact same size of 1183 bp, only one is 1185 bp. In the ‘GT-S’ substrain re-sequenced by Tajima et al.10 one of these four elements, ISY203e, is already present, placing this strain (in accordance with Ikeuchi and Tabata)8 before ‘GT-Kazusa’ in the strain history. The absence of ISY203b, e and g in ‘PCC-M’ is further shared with the strains ‘GT-I’, ‘PCC-N’ and ‘PCC-P’,9 whereas no statement is possible with regard to the possible presence of ISY203j on the plasmid pSYSM in the latter. With respect to the described mobile elements, ‘PCC-M’ appears as one of the least-derived substrains.
title	Mobile elements
p	As can be seen in Tables 1 and 2 (differences #12, 17, 40 and 45), the ‘PCC-M’ substrain lacks four insertion elements of the ISY203 type present in ‘GT-Kazusa’.7 These elements are ISY203b, e and g on the chromosome and ISY203j on the plasmid pSYSM. These four indels have the exact same size of 1183 bp, only one is 1185 bp.
p	In the ‘GT-S’ substrain re-sequenced by Tajima et al.10 one of these four elements, ISY203e, is already present, placing this strain (in accordance with Ikeuchi and Tabata)8 before ‘GT-Kazusa’ in the strain history. The absence of ISY203b, e and g in ‘PCC-M’ is further shared with the strains ‘GT-I’, ‘PCC-N’ and ‘PCC-P’,9 whereas no statement is possible with regard to the possible presence of ISY203j on the plasmid pSYSM in the latter.
p	With respect to the described mobile elements, ‘PCC-M’ appears as one of the least-derived substrains.
sec	4. Discussion 4.1. Strain history ‘PCC-M’ shows sequence differences in several genes compared with the reference sequence of ‘GT-Kazusa’ and also to the recently sequenced ‘GT-S’ strain. Kanesaki et al.9 concluded that 15 differences between the resequenced strains and the published GT-Kazusa sequence were annotation errors in the latter due to sequencing artefacts, a list to which we add two more putative errors in the database, differences #4 and #42 in Table 1. According to the proposed strain history in Ikeuchi and Tabata,8 the early division of Synechocystis sp. PCC 6803 into two branches occurred due to an insertion in spkA. Thus, our data suggest that the motile ‘PCC-M’ strain belongs to the motile PCC 6803 branch, whereas the non-motile ‘GT-Kazusa’, ‘GT-S’ and ‘GT-V’ strains are more closely related to each other and belong to the ATCC 27 184 branch. However, the 1-bp insertion in the pilC leading to ‘GT-Kazusa’ as described in the proposed strain history8 is not present in either ‘GT-S’ or ‘GT-V’, characterizing ‘GT-Kazusa’ as a more derived substrain. That ‘PCC-M’ belongs to the motile PCC 6803 branch is further reinforced by our finding of six SNPs specifically shared between the ‘PCC-M’ and the ‘PCC-N and PCC-P’ substrains (Tables 1 and 2).9 These six SNPs are in slr1865, in sll1951, encoding a haemolysin-like protein, in ssr1176, encoding a transposase and, interestingly, in genes encoding sensor and/or regulatory proteins (slr1983, slr0222 and slr0302) (Tables 1 and 2) and must already have been present in the progenitor strain to ‘PCC-M’, ‘PCC-N’ and ‘PCC-P’. Additional support comes from the analysis of two larger indels (#2 and #6 in Table 1). The preceding paper, Kanesaki et al.,9 described difficulties in finding indels between direct repeat sequences such as slr1084 and slr2031 by short read type re-sequencing data. Therefore, these two regions were analysed by PCR and Sanger sequencing in addition to the re-sequencing analysis. Indeed, the finding of indels between direct repeat sequences in genes slr1084 and slr2031 turned out as not been straightforward in our analysis as well. Compared with the reference, we found in both cases the additional sequences of 102 and 154 bp to be present in ‘PCC-M’. This result is relevant for lineage relationships among substrains. The additional 102 bp in gene slr1084 are shared between ‘PCC-M’ and the other substrains ‘PCC-P’, ‘PCC-N’ and ‘GT-I’. Therefore, this must be a deletion in the lineage leading to GT-Kazusa and GT-S. In contrast, the additional 154 bp within and upstream of gene slr2031 are shared between ‘PCC-M’, ‘PCC-P’ and ‘PCC-N’ and are absent from all studied GT substrains. These 154 bp comprise the conserved start codon of slr2031 and extend the gene by 29 codons compared with ‘GT-Kazusa’. Hence, the lack of these 154 bp in GT strains indicate a functionally adverse deletion there. In fact, the 154-bp deletion in GT substrains was noticed before,46 as well as the activity of slr2031 in the original Synechocystis sp. PCC 6803 substrains.47 From these considerations, the tree shown in Fig. 3 can be derived. In this tree, ‘GT-Kazusa’ is displayed as the strain with the longest evolutionary distance from the original isolate, whereas the ‘PCC-M’ substrain belongs to the ‘PCC’ group of substrains and is probably close to the original characteristics. All strains belonging to the ‘PCC’ group of substrains exhibit twitching motility as was shown also for the original PCC strain deposited in the Pasteur Culture Collection6 with variations in the motility behaviour.48,49 Since ‘PCC-M’ shows motility and is tolerant to glucose, it appears physiologically as a sort of intermediate between the two major branches: the motile and GT branches, consistent with its characterization as being close to the original characteristics. Figure 3. Visualization of phylogenetic relationships between various strains of Synechocystis sp. PCC 6803. The occurrence of the identified SNPs and other known events are indicated along the branches. The eight events separating the ‘GT’ and ‘PCC’ strains from each other are given at the branch point where these two lineages split or on the respective branches where they occurred. Putative insertions and deletions are labelled ‘Ins’. and ‘Del’., respectively. 4.2. Re-sequencing studies of Synechocystis sp. PCC 6803 The analysis of genome sequences of cyanobacteria has had a large impact on photosynthesis, ecology and biotechnology research.50 The present re-sequencing project delivers the new and complete sequence of the Synechocystis sp. PCC 6803 ‘PCC-M’, a substrain used in many laboratories and in several aspects close to the original isolate. Altogether, there are now chromosomal sequences for seven substrains of Synechocystis sp. PCC 6803 available: ‘PCC-M’ (this study); ‘PCC-P’ (positive phototaxis) and ‘PCC-N’ (negative phototaxis), both based on single colonies isolated from the PCC strain and designated according to their direction of phototactic movement;24 ‘GT-I’, the standard strain in Dr Ikeuchi's group;8 ‘YF’17 and ‘GT-S’,10 a current derivative of the original stock of Synechocystis sp. PCC 6803 from which the chromosomal reference sequence for ‘GT-Kazusa’ was determined in 19962 and for the large plasmids in 2003,20 whereas the three small plasmids had been sequenced already before.37,51,52 4.3. Mutations potentially linked to phenotype It is likely that most of the identified differences between the sequenced substrains result from distinct differences in the cultivation conditions in the different laboratories that have selected for fixing one or the other mutation. That also implies that the majority of identified mutations are not silent but linked to a certain effect. Indeed, most mutations in coding regions are not silent as might be expected but lead to frameshifts, amino acid substitutions or the truncation of reading frames. Similarly, SNPs in non-coding regions are probably biologically meaningful, too. This idea received support here by linking three ‘PCC-M’-specific SNPs in IGRs to the promoter regions controlling the expression of two protein-coding and one antisense RNA. For all these reasons, it appears likely that several of the mutations specific to ‘PCC-M’ or shared with ‘PCC-P’ and ‘PCC-N’ may be related to the known phenotypes of these strains. For example, the truncation of sll1951 (haemolysin) and possible truncation of slr1753 (surface protein) may contribute to a stress-induced clumping phenotype. Several other mutations might cause alterations in glucose tolerance or phototactic behaviour of these substrains. Differences at other loci may affect the phage resistance, stress response or functions in the primary metabolism, potentially relevant for the synthesis of alkanes or the N and C metabolism. The absence of ISY203g in the sll1473–5 regions in PCC substrains leads to an intact photoreceptor that regulates the expression of an alternative phycobilisome linker gene.53 Regarding phenotypic differences among motile PCC substrains, it might be noteworthy that ‘PCC-M’, despite its general ability to be motile, is not phototactic towards blue light (see direct comparison of strains in Fig. 1 of Fiedler et al.48). Here, the SNP #39 in the sigF gene, known to be involved in the control of phototactic movement30 might be considered, as the resulting M231K substitution could influence the DNA–protein interaction of this group 3 sigma factor in a very subtle way. For sure, the subtle differences in genome sequences have to be considered when choosing a particular substrain for certain experiments and when comparing phenotypes of mutant lines from different laboratories with the wild-type strain. Information on the re-sequenced genome and plasmid sequences including precisely annotated SNPs can be found in the eight sequence files available from GenBank under the accession numbers CP003265–CP003272.
title	Discussion
sec	4.1. Strain history ‘PCC-M’ shows sequence differences in several genes compared with the reference sequence of ‘GT-Kazusa’ and also to the recently sequenced ‘GT-S’ strain. Kanesaki et al.9 concluded that 15 differences between the resequenced strains and the published GT-Kazusa sequence were annotation errors in the latter due to sequencing artefacts, a list to which we add two more putative errors in the database, differences #4 and #42 in Table 1. According to the proposed strain history in Ikeuchi and Tabata,8 the early division of Synechocystis sp. PCC 6803 into two branches occurred due to an insertion in spkA. Thus, our data suggest that the motile ‘PCC-M’ strain belongs to the motile PCC 6803 branch, whereas the non-motile ‘GT-Kazusa’, ‘GT-S’ and ‘GT-V’ strains are more closely related to each other and belong to the ATCC 27 184 branch. However, the 1-bp insertion in the pilC leading to ‘GT-Kazusa’ as described in the proposed strain history8 is not present in either ‘GT-S’ or ‘GT-V’, characterizing ‘GT-Kazusa’ as a more derived substrain. That ‘PCC-M’ belongs to the motile PCC 6803 branch is further reinforced by our finding of six SNPs specifically shared between the ‘PCC-M’ and the ‘PCC-N and PCC-P’ substrains (Tables 1 and 2).9 These six SNPs are in slr1865, in sll1951, encoding a haemolysin-like protein, in ssr1176, encoding a transposase and, interestingly, in genes encoding sensor and/or regulatory proteins (slr1983, slr0222 and slr0302) (Tables 1 and 2) and must already have been present in the progenitor strain to ‘PCC-M’, ‘PCC-N’ and ‘PCC-P’. Additional support comes from the analysis of two larger indels (#2 and #6 in Table 1). The preceding paper, Kanesaki et al.,9 described difficulties in finding indels between direct repeat sequences such as slr1084 and slr2031 by short read type re-sequencing data. Therefore, these two regions were analysed by PCR and Sanger sequencing in addition to the re-sequencing analysis. Indeed, the finding of indels between direct repeat sequences in genes slr1084 and slr2031 turned out as not been straightforward in our analysis as well. Compared with the reference, we found in both cases the additional sequences of 102 and 154 bp to be present in ‘PCC-M’. This result is relevant for lineage relationships among substrains. The additional 102 bp in gene slr1084 are shared between ‘PCC-M’ and the other substrains ‘PCC-P’, ‘PCC-N’ and ‘GT-I’. Therefore, this must be a deletion in the lineage leading to GT-Kazusa and GT-S. In contrast, the additional 154 bp within and upstream of gene slr2031 are shared between ‘PCC-M’, ‘PCC-P’ and ‘PCC-N’ and are absent from all studied GT substrains. These 154 bp comprise the conserved start codon of slr2031 and extend the gene by 29 codons compared with ‘GT-Kazusa’. Hence, the lack of these 154 bp in GT strains indicate a functionally adverse deletion there. In fact, the 154-bp deletion in GT substrains was noticed before,46 as well as the activity of slr2031 in the original Synechocystis sp. PCC 6803 substrains.47 From these considerations, the tree shown in Fig. 3 can be derived. In this tree, ‘GT-Kazusa’ is displayed as the strain with the longest evolutionary distance from the original isolate, whereas the ‘PCC-M’ substrain belongs to the ‘PCC’ group of substrains and is probably close to the original characteristics. All strains belonging to the ‘PCC’ group of substrains exhibit twitching motility as was shown also for the original PCC strain deposited in the Pasteur Culture Collection6 with variations in the motility behaviour.48,49 Since ‘PCC-M’ shows motility and is tolerant to glucose, it appears physiologically as a sort of intermediate between the two major branches: the motile and GT branches, consistent with its characterization as being close to the original characteristics. Figure 3. Visualization of phylogenetic relationships between various strains of Synechocystis sp. PCC 6803. The occurrence of the identified SNPs and other known events are indicated along the branches. The eight events separating the ‘GT’ and ‘PCC’ strains from each other are given at the branch point where these two lineages split or on the respective branches where they occurred. Putative insertions and deletions are labelled ‘Ins’. and ‘Del’., respectively.
title	Strain history
p	‘PCC-M’ shows sequence differences in several genes compared with the reference sequence of ‘GT-Kazusa’ and also to the recently sequenced ‘GT-S’ strain. Kanesaki et al.9 concluded that 15 differences between the resequenced strains and the published GT-Kazusa sequence were annotation errors in the latter due to sequencing artefacts, a list to which we add two more putative errors in the database, differences #4 and #42 in Table 1. According to the proposed strain history in Ikeuchi and Tabata,8 the early division of Synechocystis sp. PCC 6803 into two branches occurred due to an insertion in spkA. Thus, our data suggest that the motile ‘PCC-M’ strain belongs to the motile PCC 6803 branch, whereas the non-motile ‘GT-Kazusa’, ‘GT-S’ and ‘GT-V’ strains are more closely related to each other and belong to the ATCC 27 184 branch. However, the 1-bp insertion in the pilC leading to ‘GT-Kazusa’ as described in the proposed strain history8 is not present in either ‘GT-S’ or ‘GT-V’, characterizing ‘GT-Kazusa’ as a more derived substrain.
p	That ‘PCC-M’ belongs to the motile PCC 6803 branch is further reinforced by our finding of six SNPs specifically shared between the ‘PCC-M’ and the ‘PCC-N and PCC-P’ substrains (Tables 1 and 2).9 These six SNPs are in slr1865, in sll1951, encoding a haemolysin-like protein, in ssr1176, encoding a transposase and, interestingly, in genes encoding sensor and/or regulatory proteins (slr1983, slr0222 and slr0302) (Tables 1 and 2) and must already have been present in the progenitor strain to ‘PCC-M’, ‘PCC-N’ and ‘PCC-P’. Additional support comes from the analysis of two larger indels (#2 and #6 in Table 1). The preceding paper, Kanesaki et al.,9 described difficulties in finding indels between direct repeat sequences such as slr1084 and slr2031 by short read type re-sequencing data. Therefore, these two regions were analysed by PCR and Sanger sequencing in addition to the re-sequencing analysis. Indeed, the finding of indels between direct repeat sequences in genes slr1084 and slr2031 turned out as not been straightforward in our analysis as well. Compared with the reference, we found in both cases the additional sequences of 102 and 154 bp to be present in ‘PCC-M’. This result is relevant for lineage relationships among substrains. The additional 102 bp in gene slr1084 are shared between ‘PCC-M’ and the other substrains ‘PCC-P’, ‘PCC-N’ and ‘GT-I’. Therefore, this must be a deletion in the lineage leading to GT-Kazusa and GT-S. In contrast, the additional 154 bp within and upstream of gene slr2031 are shared between ‘PCC-M’, ‘PCC-P’ and ‘PCC-N’ and are absent from all studied GT substrains. These 154 bp comprise the conserved start codon of slr2031 and extend the gene by 29 codons compared with ‘GT-Kazusa’. Hence, the lack of these 154 bp in GT strains indicate a functionally adverse deletion there. In fact, the 154-bp deletion in GT substrains was noticed before,46 as well as the activity of slr2031 in the original Synechocystis sp. PCC 6803 substrains.47 From these considerations, the tree shown in Fig. 3 can be derived. In this tree, ‘GT-Kazusa’ is displayed as the strain with the longest evolutionary distance from the original isolate, whereas the ‘PCC-M’ substrain belongs to the ‘PCC’ group of substrains and is probably close to the original characteristics. All strains belonging to the ‘PCC’ group of substrains exhibit twitching motility as was shown also for the original PCC strain deposited in the Pasteur Culture Collection6 with variations in the motility behaviour.48,49 Since ‘PCC-M’ shows motility and is tolerant to glucose, it appears physiologically as a sort of intermediate between the two major branches: the motile and GT branches, consistent with its characterization as being close to the original characteristics. Figure 3. Visualization of phylogenetic relationships between various strains of Synechocystis sp. PCC 6803. The occurrence of the identified SNPs and other known events are indicated along the branches. The eight events separating the ‘GT’ and ‘PCC’ strains from each other are given at the branch point where these two lineages split or on the respective branches where they occurred. Putative insertions and deletions are labelled ‘Ins’. and ‘Del’., respectively.
figure caption	Figure 3. Visualization of phylogenetic relationships between various strains of Synechocystis sp. PCC 6803. The occurrence of the identified SNPs and other known events are indicated along the branches. The eight events separating the ‘GT’ and ‘PCC’ strains from each other are given at the branch point where these two lineages split or on the respective branches where they occurred. Putative insertions and deletions are labelled ‘Ins’. and ‘Del’., respectively.
p	Visualization of phylogenetic relationships between various strains of Synechocystis sp. PCC 6803. The occurrence of the identified SNPs and other known events are indicated along the branches. The eight events separating the ‘GT’ and ‘PCC’ strains from each other are given at the branch point where these two lineages split or on the respective branches where they occurred. Putative insertions and deletions are labelled ‘Ins’. and ‘Del’., respectively.
sec	4.2. Re-sequencing studies of Synechocystis sp. PCC 6803 The analysis of genome sequences of cyanobacteria has had a large impact on photosynthesis, ecology and biotechnology research.50 The present re-sequencing project delivers the new and complete sequence of the Synechocystis sp. PCC 6803 ‘PCC-M’, a substrain used in many laboratories and in several aspects close to the original isolate. Altogether, there are now chromosomal sequences for seven substrains of Synechocystis sp. PCC 6803 available: ‘PCC-M’ (this study); ‘PCC-P’ (positive phototaxis) and ‘PCC-N’ (negative phototaxis), both based on single colonies isolated from the PCC strain and designated according to their direction of phototactic movement;24 ‘GT-I’, the standard strain in Dr Ikeuchi's group;8 ‘YF’17 and ‘GT-S’,10 a current derivative of the original stock of Synechocystis sp. PCC 6803 from which the chromosomal reference sequence for ‘GT-Kazusa’ was determined in 19962 and for the large plasmids in 2003,20 whereas the three small plasmids had been sequenced already before.37,51,52
title	Re-sequencing studies of Synechocystis sp. PCC 6803
p	The analysis of genome sequences of cyanobacteria has had a large impact on photosynthesis, ecology and biotechnology research.50 The present re-sequencing project delivers the new and complete sequence of the Synechocystis sp. PCC 6803 ‘PCC-M’, a substrain used in many laboratories and in several aspects close to the original isolate. Altogether, there are now chromosomal sequences for seven substrains of Synechocystis sp. PCC 6803 available: ‘PCC-M’ (this study); ‘PCC-P’ (positive phototaxis) and ‘PCC-N’ (negative phototaxis), both based on single colonies isolated from the PCC strain and designated according to their direction of phototactic movement;24 ‘GT-I’, the standard strain in Dr Ikeuchi's group;8 ‘YF’17 and ‘GT-S’,10 a current derivative of the original stock of Synechocystis sp. PCC 6803 from which the chromosomal reference sequence for ‘GT-Kazusa’ was determined in 19962 and for the large plasmids in 2003,20 whereas the three small plasmids had been sequenced already before.37,51,52
sec	4.3. Mutations potentially linked to phenotype It is likely that most of the identified differences between the sequenced substrains result from distinct differences in the cultivation conditions in the different laboratories that have selected for fixing one or the other mutation. That also implies that the majority of identified mutations are not silent but linked to a certain effect. Indeed, most mutations in coding regions are not silent as might be expected but lead to frameshifts, amino acid substitutions or the truncation of reading frames. Similarly, SNPs in non-coding regions are probably biologically meaningful, too. This idea received support here by linking three ‘PCC-M’-specific SNPs in IGRs to the promoter regions controlling the expression of two protein-coding and one antisense RNA. For all these reasons, it appears likely that several of the mutations specific to ‘PCC-M’ or shared with ‘PCC-P’ and ‘PCC-N’ may be related to the known phenotypes of these strains. For example, the truncation of sll1951 (haemolysin) and possible truncation of slr1753 (surface protein) may contribute to a stress-induced clumping phenotype. Several other mutations might cause alterations in glucose tolerance or phototactic behaviour of these substrains. Differences at other loci may affect the phage resistance, stress response or functions in the primary metabolism, potentially relevant for the synthesis of alkanes or the N and C metabolism. The absence of ISY203g in the sll1473–5 regions in PCC substrains leads to an intact photoreceptor that regulates the expression of an alternative phycobilisome linker gene.53 Regarding phenotypic differences among motile PCC substrains, it might be noteworthy that ‘PCC-M’, despite its general ability to be motile, is not phototactic towards blue light (see direct comparison of strains in Fig. 1 of Fiedler et al.48). Here, the SNP #39 in the sigF gene, known to be involved in the control of phototactic movement30 might be considered, as the resulting M231K substitution could influence the DNA–protein interaction of this group 3 sigma factor in a very subtle way. For sure, the subtle differences in genome sequences have to be considered when choosing a particular substrain for certain experiments and when comparing phenotypes of mutant lines from different laboratories with the wild-type strain. Information on the re-sequenced genome and plasmid sequences including precisely annotated SNPs can be found in the eight sequence files available from GenBank under the accession numbers CP003265–CP003272.
title	Mutations potentially linked to phenotype
p	It is likely that most of the identified differences between the sequenced substrains result from distinct differences in the cultivation conditions in the different laboratories that have selected for fixing one or the other mutation. That also implies that the majority of identified mutations are not silent but linked to a certain effect. Indeed, most mutations in coding regions are not silent as might be expected but lead to frameshifts, amino acid substitutions or the truncation of reading frames. Similarly, SNPs in non-coding regions are probably biologically meaningful, too. This idea received support here by linking three ‘PCC-M’-specific SNPs in IGRs to the promoter regions controlling the expression of two protein-coding and one antisense RNA.
p	For all these reasons, it appears likely that several of the mutations specific to ‘PCC-M’ or shared with ‘PCC-P’ and ‘PCC-N’ may be related to the known phenotypes of these strains. For example, the truncation of sll1951 (haemolysin) and possible truncation of slr1753 (surface protein) may contribute to a stress-induced clumping phenotype. Several other mutations might cause alterations in glucose tolerance or phototactic behaviour of these substrains. Differences at other loci may affect the phage resistance, stress response or functions in the primary metabolism, potentially relevant for the synthesis of alkanes or the N and C metabolism. The absence of ISY203g in the sll1473–5 regions in PCC substrains leads to an intact photoreceptor that regulates the expression of an alternative phycobilisome linker gene.53 Regarding phenotypic differences among motile PCC substrains, it might be noteworthy that ‘PCC-M’, despite its general ability to be motile, is not phototactic towards blue light (see direct comparison of strains in Fig. 1 of Fiedler et al.48). Here, the SNP #39 in the sigF gene, known to be involved in the control of phototactic movement30 might be considered, as the resulting M231K substitution could influence the DNA–protein interaction of this group 3 sigma factor in a very subtle way. For sure, the subtle differences in genome sequences have to be considered when choosing a particular substrain for certain experiments and when comparing phenotypes of mutant lines from different laboratories with the wild-type strain. Information on the re-sequenced genome and plasmid sequences including precisely annotated SNPs can be found in the eight sequence files available from GenBank under the accession numbers CP003265–CP003272.
sec	Funding The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7-ENERGY-2010-1) under grant agreement no. [256808] and from the German Research Foundation (DFG) project FOR1680 ‘Unravelling the Prokaryotic Immune System’ (grant HE 2544/8-1) to WRH and from grant AL 892/1-4 to SAB.
title	Funding
p	The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7-ENERGY-2010-1) under grant agreement no. [256808] and from the German Research Foundation (DFG) project FOR1680 ‘Unravelling the Prokaryotic Immune System’ (grant HE 2544/8-1) to WRH and from grant AL 892/1-4 to SAB.

projects that include this document

Unselected / annnotation		Selected / annnotation
TEST0 (0) 2_test (89)

TAB JSON ListView MergeView

PMC:3514855 JSONTXT

Document structure show

projects that include this document

PMC:3514855 JSON TXT