Generalized enhanced suffix array construction in external memory Abstract Background Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure exceeds the available internal memory. Results In this article we present and analyze eGSA [introduced in CPM (External memory generalized suffix and LCP arrays construction. In: Proceedings of CPM. pp 201–10, 2013)], the first external memory algorithm to construct generalized suffix arrays augmented with the longest common prefix array for a string collection. Our algorithm relies on a combination of buffers, induced sorting and a heap to avoid direct string comparisons. We performed experiments that covered different aspects of our algorithm, including running time, efficiency, external memory access, internal phases and the influence of different optimization strategies. On real datasets of size up to 24 GB and using 2 GB of internal memory, eGSA showed a competitive performance when compared to eSAIS and SAscan, which are efficient algorithms for a single string according to the related literature. We also show the effect of disk caching managed by the operating system on our algorithm. Conclusions The proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations, and showed a competitive performance. Our algorithm can also construct the generalized Burrows-Wheeler transform of a string collection with no additional cost except by the output time. Introduction Suffix arrays [40] (also known as PAT arrays [23]) may be used for the solution of string processing problems in several areas, including pattern matching, data compression and information retrieval [24, 39, 47]. Combining a suffix array with the longest common prefix (LCP) array and with the Burrows–Wheeler transform (BWT) [12] provides a data structure, an enhanced suffix array (ESA) [2], that enables solving many string processing problems in optimal time and space. Using such structures in the solution of problems involving strings is usually done in two steps: the structure is first constructed and then it is queried. This article is about the construction of generalized enhanced suffix arrays for a collection of strings using external memory. This is motivated by the rising number of applications that deal with huge sets of strings, such as those in Bioinformatics and Internet searching. Moreover, recent advancements in non-volatile storage technologies have substantially narrowed the gap between internal and external memory access times, making the querying of external suffix arrays significantly faster. Different algorithms have been proposed for internal memory suffix array construction (see [17, 49]), including algorithms with linear running time [30, 33, 46]. Gonnet et al.  [23] proposed the first external memory algorithm for constructing suffix arrays. Later, Crauser and Ferragina [14] adapted internal memory algorithms to work in external memory. Dementiev et al.  [16] observed that these algorithms do not scale well and presented a pipelined version of the internal memory algorithm DC3 [30] to external memory. Nong et al.  [44, 45] adapted the internal memory algorithms SA-DS and SA-IS [46] to external memory, and Liu et al.  [34] presented an enhanced version of SA-IS to external memory. Kärkkäinen and Kempa [25] presented the SAscan algorithm, improving on the earlier proposal by Gonnet et al.  [23], and later, Kärkkäinen et al.  presented a parallel external version of SAscan algorithm [27]. BWT can be either obtained from the suffix array or constructed directly in internal memory in linear time [48]. Ferragina et al.  [18] proposed an external memory algorithm to construct the BWT for a single string, and Bauer et al.  [5] presented external memory algorithms to compute and decode the BWT for a string collection. LCP construction in internal memory is also possible in linear time during the suffix array construction [19, 35] or afterwards, given the suffix array [29, 31, 41] or the BWT as input [7, 22]. Kärkkäinen and Kempa [26] presented the LCPscan, an external memory algorithm to construct LCP arrays given the suffix array as input, and Bauer et al.  [6] proposed the extLCP algorithm to construct both BWT and LCP arrays for large collections of equally sized strings in external memory, and later, Cox et al.  [13] presented an extended version of extLCP to deal with strings with different sizes. The suffix and the LCP arrays are constructed together in external memory by eSAIS, proposed by Bingmann et al. [10], one of the most efficient external memory algorithm to date. There exists alternatives to compute suffix and LCP arrays in parallel [20] and using small space [37, 42]. In this article we present and analyze the algorithm eGSA (introduced in [38]) in depth. To our knowledge this is the first algorithm to construct generalized enhanced suffix arrays in external memory. We compared eGSA with the most efficient related algorithms in the literature, eSAIS [10] and SAscan [25]. Although eSAIS and SAscan can easily be applied to the concatenation of a string collection, our method is shown to run faster in practice. In addition to the LCP array, our method also constructs the BWT for the collection. eGSA uses a heap and a combination of optimization procedures that are shown to be very effective in practice. The optimizing strategies that we propose in this article are based on nice properties of strings and their relation with the LCP array, and are applied across the nodes of a heap. Theoreticallly, eSAIS runs in O(nlogM/B(n/B)) time and O((n/B)logM/B(n/B)) I/Os, where n is the length of the input string, B is the disk block size and M is the RAM size. SAscan runs in O((n2/M)log(2+logσ/loglogn)) time and O(n2logσ/(MBlogn)+(n/B)logM/B(n/B)) I/Os. Our algorithm runs in O((Nlogm)maxlcp) time and O(Nlogm|Tℓ|) I/Os, where N is the sum of the m string lengths in the input, maxlcp is the length of the longest common prefix between suffixes of the input strings, |Tℓ| is the length of the longest string in the collection. The rest of the article is organized as follows.  "Background" section introduces concepts and notation, "eGSA" section describes the algorithm and presents a theoretical analysis, "Performance evaluation" section details the experiments, results and investigates limitations of the algorithm. "Conclusions" section concludes the article. Background Let Σ be an ordered alphabet of symbols. We denote the set of every string of symbols in Σ by Σ∗ and the concatenation of strings or symbols by the dot operator (·). Let $ be a symbol not in Σ that precedes every symbol in Σ with respect to the alphabetical order. We define Σ$={T·$|T∈Σ∗}. We use the symbol < for the lexicographic order relation between strings. The ith symbol in a string T of length n is denoted T[i], 1≤i≤n. A substring of T is denoted T[i,j]=T[i]·…·T[j], 1≤i≤j≤n. A prefix of T is a substring of the form T[1, k] and a suffix is a substring of the form T[k, n], 1≤k≤n. A suffix array for a string T∈Σ$ of size n, denoted SA, is an array of integers SA=[i1,i2,…,in] such that T[i1,n]lcp(S1,S3) then S2S3 (case 2). Otherwise, if lcp(S1,S2)=lcp(S1,S3)=ℓ then lcp(S2,S3)≥ℓ (case 3). Let X, Y and Z be nodes in the binary heap storing Ea[i], Eb[j] and Ec[k], respectively. Let X, Y and Z be also the suffixes stored by such heap nodes. Suppose that node X is the parent of Y and Z. Because Xlcp(suff2(3),suff1(4)) and that suff2(4)=AGAGA$ is less than suff1(4)=ATAGA$. Suffix induction The induced sorting principle corresponds to deduce the order of unsorted suffixes from already sorted suffixes. This strategy is used by many suffix array construction algorithms [49]. We apply an induced sorting approach that relies on the following lemma. Let a suffix starting with a symbol α be denoted α-suffix and let suffT be the set of all suffixes of strings in T. Lemma 2 If Ti[j,ni] is the smallest suffix in suffT then Ti[j-1,ni]=α·Ti[j,ni] is the smallest α -suffix in suffT\{Ti[j,ni]}. Proof Suppose that there is a α -suffix Tℓ[k,nℓ] in suffT that precedes Ti[j-1,ni]. Then Tℓ[k+1,nℓ] must be smaller than Ti[j,ni], a contradiction. Lemma 2 can be used for sorting the suffixes of a string T of length n as follows. Let an α-bucket be a block of a partition of SA that contains only α-suffixes. suffT is initialized with every suffix of T and an empty bucket for each symbol in Σ is created. While suffT is not empty, the smallest suffix T[j,n]=α·T[j+1,n] in suffT is moved to the leftmost available position in the α-bucket and, if α<β then T[j-1,n]=β·T[j,n] is added to the leftmost available position in the β-bucket (it is induced). The induced suffix T[j-1,n] cannot be removed from suffT yet because it may induce T[j-2,ni] as well. When a suffix that is already in a bucket is also the smallest in suffT, the suffix itself and those that succeed it in the bucket are used to induce another suffix and are removed from suffT at once. Note that if α>β then the suffix T[j-1,ni] was already sorted and if α=β then reading induced suffixes from the β-bucket can cause the induction of already induced suffixes. So no induction is done when α≥β. This approach is not efficient to sort the suffixes of a single string T, since it is often necessary to find a smallest suffix. But in merging previously sorted suffixes the smallest one can be determined efficiently using the heap. Suppose that Ei[k] is at the root of the heap. Then Ti[j,ni] is the smallest suffix in suffT and Ti[j-1,ni] can be induced if Ti[j]A, T1[6-1=5,n1]=GA$ is induced as the smallest G-suffix in suffT. Then the pair (1, 0) is written to the buffer IG to indicate that a suffix from string 1 was induced with lcp=0. The lcp value in GESA between T1[5,n1] and the next induced G-suffix (T2[5,n2]) is computed by the minimum lcp value from the suffixes passing through the heap until T2[5,n2] is induced. This happens when T2[6,n2] is the smallest element in the heap and T2[5,n2] is induced together with the lcp(T1[6,n1],T2[6,n2])+1=2, obtained by the current minimum lcp value. When T1[5,n1] is the smallest suffix in the heap, FG is read sequentially and the induced G-suffixes are recovered together with their lcp values. Using prefix assembly together with induction requires additional care. Since induced suffixes are not compared in the heap, they do not participate in the prefix assembly. Thus during the evaluation of PREFIX in Phase 1, hj must be equal to 0 for every last α-suffix that will be induced, then the prefix of the first non-induced α-suffix will start at its initial position. To this end, we set 0 as the LCP[pos(Ti[j,ni])] of every suffix Ti[j,ni] that will be induced, i.e. when Ti[j]>Ti[j+1]. Recall that all such lcp values will be also induced. For instance, Table 5 illustrates the construction of ESA1 in the first phase of eGSA, for j=2. When j=2, SA[j=2]=6 and T1[6]>T1[6+1], then the suffix T1[6,n1] will be induced and LCP1[2+1=3] receives 0. Next, j=3, SA[j=3]=4 and T1[4]