PMC:4748600 / 18383-37726 JSON TXT

Annnotations TAB JSON ListView MergeView

2_test

{"project":"2_test","denotations":[{"id":"26864881-21631914-8288304","span":{"begin":67,"end":68},"obj":"21631914"},{"id":"26864881-2231712-8288305","span":{"begin":568,"end":569},"obj":"2231712"},{"id":"26864881-20003500-8288306","span":{"begin":580,"end":582},"obj":"20003500"},{"id":"26864881-21631914-8288307","span":{"begin":596,"end":597},"obj":"21631914"},{"id":"26864881-18959793-8288308","span":{"begin":705,"end":707},"obj":"18959793"},{"id":"26864881-17110365-8288309","span":{"begin":781,"end":782},"obj":"17110365"},{"id":"26864881-21631914-8288310","span":{"begin":4813,"end":4814},"obj":"21631914"},{"id":"26864881-21631914-8288311","span":{"begin":6326,"end":6327},"obj":"21631914"},{"id":"26864881-21631914-8288312","span":{"begin":6462,"end":6463},"obj":"21631914"},{"id":"26864881-1438297-8288313","span":{"begin":6750,"end":6751},"obj":"1438297"},{"id":"26864881-21631914-8288314","span":{"begin":7301,"end":7302},"obj":"21631914"},{"id":"26864881-21631914-8288315","span":{"begin":8512,"end":8513},"obj":"21631914"},{"id":"26864881-21631914-8288316","span":{"begin":8555,"end":8556},"obj":"21631914"},{"id":"26864881-21631914-8288317","span":{"begin":11595,"end":11596},"obj":"21631914"},{"id":"26864881-21631914-8288318","span":{"begin":12388,"end":12389},"obj":"21631914"},{"id":"26864881-21631914-8288319","span":{"begin":17984,"end":17985},"obj":"21631914"},{"id":"26864881-2231712-8288320","span":{"begin":18167,"end":18168},"obj":"2231712"},{"id":"26864881-21631914-8288321","span":{"begin":18178,"end":18179},"obj":"21631914"}],"text":"Results\nThe following benchmarks repeat those performed by Rognes [8]. The reader is encouraged to refer to Rognes’s original manuscript. The database sequences, accession numbers, score matrices and gap penalties are identical to those used previously, however they are repeated in later sections for convenience. Further, the figures and tables are intentionally similar in color, layout, and styling in order to more easily compare to the previous evaluation. However, the tests performed here do differ with respect to the software selected for evaluation. BLAST [4], BLAST+ [19], and SWIPE [8] were previously evaluated; the latest available versions were selected for the current evaluation. SWPS3 [16] is not evaluated. The latest implementation of Farrar’s striped method [7] is evaluated in the context of the SSW library [9]. The new implementation of the striped method within Parasail’s library is evaluated as “Parasail”. No previous evaluation of global and semi-global alignment performance has been performed. Since opal [14] and libssa [15] support global alignments, their performance is compared against Parasail’s implementations. When comparing the performance of various approaches and implementations, speed is reported in billion (giga) cell updates per second (GCUPS), where a cell is a value in the dynamic programming table.\nGreat care was taken to compare the various software libraries and applications fairly. All tests were run three times and the average GCUPS performance was recorded. The loading times for the database sequences were excluded from this study in order to focus on the algorithmic performance of the alignment routines. The following software reported the GCUPS performance directly for the alignments performed: Parasail, opal, SWIPE, and SSW. For the remaining software including BLAST, BLAST+, and libssa, a wall clock timer was patched into the code that reported on the alignment times only, and the wall clock time was used to calculate the GCUPS result based on the known amount of work for each test case performed.\n\nSoftware\nTable 1 lists the software packages that were evaluated, including their version numbers and command line options used. The SSW library provides its own test program for performing alignments, but it was intentionally not used for benchmarking due to its additional overhead. Instead, the parasail_aligner was duplicated and modified to use the routines within the SSW library. Though opal and libssa are both software libraries and not stand-alone tools, they both provide an example application in addition to the library. The example applications are evaluated here, namely ‘opal_aligner’ and ‘libssa_example’. Opal’s aligner was only available as a single-threaded application.\nTable 1 Software included in performance testingCommand line variables: threads ($T), score matrix file name ($M), gap open ($O) and extension ($E) penalties (positive values), query file name ($Q), database file basename ($D), Parasail alignment function name ($A)\nThe parasail_aligner can be run using any one of the many alignment routines the Parasail library provides. The initial benchmarks compare other local alignment implementations against Parasail’s local alignment implementation. The latter benchmarks compare Parasail’s local alignment performance against its global and semi-global performance for the striped and scan vectorized approaches. The following Parasail functions were evaluated. parasail_sw_striped_profile_sse41_128_sat,parasail_sw_striped_profile_avx2_256_sat,parasail_nw_striped_profile_sse41_128_16,parasail_nw_striped_profile_avx2_256_16,parasail_nw_scan_profile_sse41_128_16,parasail_nw_scan_profile_avx2_256_16,parasail_sg_striped_profile_sse41_128_16,parasail_sg_striped_profile_avx2_256_16,parasail_sg_scan_profile_sse41_128_16, andparasail_sg_scan_profile_avx2_256_16.\n\nHardware\nResults were taken on compute nodes of the Constance cluster, part of Pacific Northwest National Laboratory’s Institutional Computing. Each Constance node contains dual Intel Haswell E5-2670 CPUs (2.3 Ghz) giving 24 cores and 64 GB 2133Mhz DDR4 memory per node. Intel’s Haswell CPUs support the AVX2 instruction set. Although these processors are capable of hyper-threading, which would have given each node 48 logical cores, it was not used for these experiments. Hyper-threading is intentionally disabled as a matter of policy because Constance is a general use cluster and hyper-threading benefits are strongly application dependent [20]. Because it was not enabled, only up to 24 cores were utilized.\n\nDatabase and query sequences\nThe UniProt Knowledgebase Release 11.0 [21] (consisting of both Swiss-Prot release 53.0 and TrEMBL release 36.0) was chosen because it duplicates the benchmark evaluation performed by Rognes in [8]. The dataset was originally chosen for being a realistic dataset less than 2GB in size because some of the software originally tested would fail for larger file sizes. The current software tested does not have the same input size limits, however the same dataset is used for consistency with the previous evaluation. This validates Rognes’s original intent of selecting a dataset that should be available for download in the foreseeable future, since the new evaluation presented here occurs over eight years later.\nThe query sequences used here are identical to the ones used by Rognes. The 32 accession numbers are [Swiss-Prot:P56980, Swiss-Prot:O29181, Swiss-Prot: P03630, Swiss-Prot:P02232, Swiss-Prot:P01111, Swiss-Prot: P05013, Swiss-Prot:P14942, Swiss-Prot:P00762, Swiss-Prot:P53765, Swiss-Prot:Q8ZGB4, Swiss-Prot: P03989, Swiss-Prot:P07327, Swiss-Prot:P01008, Swiss-Prot:P10635, Swiss-Prot:P58229, Swiss-Prot:P25705, Swiss-Prot:P03435, Swiss-Prot:P42357, Swiss-Prot: P21177, Swiss-Prot:Q8LLD0, Swiss-Prot:O60341, Swiss-Prot:P27895, Swiss-Prot:P07756, Swiss-Prot:P04775, Swiss-Prot:P19096, Swiss-Prot:P28167, Swiss-Prot: P0C6B8, Swiss-Prot:P20930, Swiss-Prot:P08519, Swiss- Prot:Q7TMA5, Swiss-Prot:P33450 and Swiss-Prot: Q9UKN1]. Note that [Swiss-Prot:Q8LLD0] replaces [Swiss-Prot:Q38941]. The queries range in length from 24 to 5478 residues. As done previously by Rognes, some of the tests were only performed with the 375 residues long P07327 query, representing a protein of roughly average length [8].\n\nScore matrices and gap penalties\nThe score matrices and gap penalties selected for this evaluation duplicate those used by Rognes [8] so that a direct comparison can be made between the two evaluations. Previously, 82 different combinations of amino acid substitution score matrices and gap penalties accepted by BLAST were tested, including BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, and BLOSUM90 from the BLOSUM series [1] as well as PAM30, PAM70, and PAM250 from the PAM series [2]. These matrices previously represented the ones supported by earlier BLAST versions, though there are currently 84 combinations of amino acid substitution score matrices and gap penalties accepted by the ‘blastp’ website [22]. Matrices were obtained from the NCBI FTP site. Again, duplicating the same evaluation criteria as Rognes, some of the tests were only performed with the BLOSUM62 matrix and gap open and extension penalties of 11 and 1, respectively, which is the BLAST default [8].\n\nThreading evaluation\nFigure 1a shows the performance characteristics of all software as the number of threads increase from 1 to 24. The query sequence was fixed at the 375 residue [Swiss-Prot:P07327]. Additionally, the BLOSUM62 matrix and gap open and extension penalties of 11 and 1 were used.\nFig. 1 Performance dependency on number of threads and query length. The speed in billion cell updates per second (GCUPS) of BLAST (red), BLAST+ (orange), SWIPE (black), SSW (green), Parasail’s striped SW implementation using SSE4.1 (light blue) and AVX2 (dark blue) instruction sets, as well as libssa using SSE4.1 (purple) and AVX2 (gray), using a variable number of threads and queries of varying length. Opal is only evaluated in c since the application was single-threaded (SSE4.1 as pink, AVX2 as brown). a Number of threads ranging from 1 to 24 and the 375 residue long P07327 query sequence. b Query sequences ranging from 24 to 5478 residues in length and 24 threads. c Query sequences of varying length and 1 thread. All scales are logarithmic. The BLOSUM62 matrix and gap open and extension penalties of 11 and 1, respectively, were used in all cases. This figure corresponds to Figure 6 in Rognes [8]\nCompared to the previous evaluation in [8], the performance of all striped implementations perform significantly better than previously reported. At best, striped (SSW) had a performance range of 3.7 to 15.0 GCUPS and now runs from 4.1 to 96.6 GCUPS. This large difference is attributed to both improved sequence database processing as well as the use of better workload scheduling. The performance of Parasail’s SSE4.1 implementation ranges from 5.0 to 107.5 GCUPS while the AVX2 implementation ranges from 6.6 to 137.7 GCUPS. SWIPE previously ran from 9.1 to 106.2 GCUPS and now runs from 7.2 to 163.6 GCUPS, indicating a relatively small drop in performance for a single thread while significantly improving on multithreaded performance. libssa’s SSE4.1 implementation ranges from 3.4 to 79.4 GCUPS while the AVX2 implementation ranges from 5.8 to 135.1 GCUPS. BLAST reaches an early peak of 116.5 GCUPS at 13 threads while BLAST+ continues to scale to upwards of 261.8 GCUPS. Overall, BLAST+ performs the best while the striped implementations are narrowly below SWIPE in performance.\n\nQuery length evaluation\nFigure 1b and c show the performance characteristics of all software as the query lengths vary, while keeping the number of threads fixed to 24 threads (B) or one thread (C). Additionally, the BLOSUM62 matrix and gap open and extension penalties of 11 and 1 were used. The query lengths ranged from 24 to 5,478 amino acid residues.\nUsing 24 threads, similar to the threading evaluation above, the performance of all striped implementations perform significantly better than previously reported. SSW ranged from 21.7 to 156.9 GCUPS while previously ranging from 1.2 to 46.6 GCUPS. The Parasail implementations start off somewhat slower at 9.5 and 9.1, and peak at 164.1 and 291.5 for SSE4.1 and AVX2, respectively. Generally, longer queries performed better for all software evaluated. SWIPE was an exception, having a rather flat performance curve but quickly maxing out at 184.0 GCUPS. libssa was the other exception, having peaked at 82.0 and 148.4 GCUPS for SSE4.1 and AVX2, respectively, but performance quickly dropped off for sequences longer than 1,000 amino acids. Parasail’s AVX2 implementation begins to outperform SWIPE for query sequences longer than approximately 500 amino acids. BLAST+ again was the strongest performer, topping out at 5654.9 GCUPS.\nUsing a single thread compared to all 24 threads, similar performance characteristics are noted. Performance peaks of 7.9, 6.6, 6.9, 12.3, 4.3, and 7.4 GCUPS are noted for SWIPE, SSW, Parasail SSE4.1, Parasail AVX2, libssa SSE4.1, and libssa AVX2, respectively. BLAST and BLAST+ perform similarly and peak near 27 GCUPS. Opal is evaluated using a single thread, peaking at 3.8 and 4.9 GCUPS for its SSE4.1 and AVX2 implementations, respectively, the slowest of any implementation evaluated.\n\nScoring system evaluation\nFigure 2 shows the performance characteristics of all software as the scoring conditions are varied. All 82 combinations of matrices and gap penalties previously evaluated by Rognes [8] are repeated here. The 375 residue long P07327 query sequence and 24 threads were used.\nFig. 2 Performance with different scoring systems. The speed in billion cell updates per second (GCUPS) (logarithmic scale) is shown for BLAST (red), BLAST+ (orange), SWIPE (black), and SSW (green), Parasail’s striped SW implementation using SSE4.1 (light blue) and AVX2 (dark blue) instruction sets, as well as libssa using SSE4.1 (purple) and AVX2 (gray), using different scoring systems. All combinations of scoring matrices and gap penalties accepted by BLAST were tested. The matrix name is indicated above each graph, while the open and extension gap penalties are indicated on the x-axis. The query sequence was P07327 and 24 threads were running. This figure corresponds to Figure 7 in Rognes [8]\nThe striped implementations again perform better than previously reported. Previously striped was running at an almost constant 14 GCUPS while SSW now shows a dependence on the scoring criteria with an average of 96.5±9.0 GCUPS. Both Parasail implementations show a similar dependence on the scoring criteria of 109.5±10.0 GCUPS and 130.4±12.4 GCUPS for SSE4.1 and AVX2, respectively. SWIPE runs better at 166.9±2.3 GCUPS compared to the previous 104±2 GCUPS. libssa performs better than its SWIPE counterpart when using AVX2 instructions, running at 125.7±15.5 GCUPS, while performing only at 85.5±6.7 GCUPS using SSE4.1 instructions. The performance of BLAST and BLAST+ was highly dependent on the scoring matrix used, with GCUPS of 146.8±137.5 and 247.1±134.9, respectively.\n\nEvaluation of global and semi-global implementations\nThe threading and query length evaluations were additionally performed using global and semi-global alignment routines.\nlibssa’s global alignment capabilities were evaluated. Due to the lack of an available multithreaded application, opal’s global alignment evaluation was limited to the lone single-threaded evaluation. In addition to Parasail’s striped vector approach, the prefix scan approach is also evaluated. Unlike in the evaluation of local alignments where the 8-bit saturation-checking routines were used, the 16-bit element versions of the global and semi-global alignment routines were used because the 8-bit versions almost always saturated, resulting in poor performance with wasted computation. This was true both for Parasail as well as libssa.\nFigure 3 evaluates the threading performance while Fig. 4 evaluates the scoring system of global alignment routines. Parasail’s prefix scan implementation outperforms the implementation of Farrar’s striped approach. Though not shown in Figs. 1 and 2, the prefix scan implementation only outperforms the striped implementation for global alignments; striped is faster for local as well as semi-global alignments. This is attributed to the higher number of corrective passes that must be made during the striped computation for global alignments [13].\nFig. 3 Performance dependency on number of threads and query length for global alignments. The speed in billion cell updates per second (GCUPS) of striped SSE4.1 vectors (red), prefix scan SSE4.1 vectors (orange), striped AVX2 vectors (black), and prefix scan AVX2 vectors (green), as well as libssa using SSE4.1 (purple) and AVX2 (gray) instruction sets, using a variable number of threads and queries of varying length. a Number of threads ranging from 1 to 24 and the 375 residue long P07327 query sequence. b Query sequences ranging from 24 to 5478 residues in length and 24 threads. c Query sequences of varying length and 1 thread. All scales are logarithmic. The BLOSUM62 matrix and gap open and extension penalties of 11 and 1, respectively, were used in all cases\nFig. 4 Performance with different scoring systems for global alignments. The speed in billion cell updates per second (GCUPS) (logarithmic scale) is shown for striped SSE4.1 vectors (red), prefix scan SSE4.1 vectors (orange), striped AVX2 vectors (black), and prefix scan AVX2 vectors (green), as well as libssa using SSE4.1 (purple) and AVX2 (gray), using different scoring systems. Note that the libssa results completely overlap. All combinations of scoring matrices and gap penalties accepted by BLAST were tested. The matrix name is indicated above each graph, while the open and extension gap penalties are indicated on the x-axis. The query sequence was P07327 and 24 threads were running\nUnfortunately, libssa is not performing as expected for global alignments. libssa was previously reported to be nearly two times faster than opal [15], though here opal is outperforming all other implementations by a wide margin for single-threaded global alignments. Parasail is also outperforming libssa for all alignments evaluated. Examining the output of these runs showed that libssa was detecting overflow of the 16-bit calculations for all query and database sequence combinations, forcing libssa to perform a 64-bit evaluation instead. This is clearly an error, since Parasail is not detecting overflow during its 16-bit calculations. This unfortunate result is attributed to libssa being research code; the results presented in [15] validate the approach even though the current evaluation was unable to reproduce favorable results.\nSemi-global results are similar to global results for opal. libssa does not implement semi-global alignment and as such is not evaluated. For Parasail’s striped implementation, the semi-global routines are slightly faster than the global routines but significantly slower than local alignments. Parasail’s scan implementation is slower than the striped implementation for semi-global alignments.\nFor Parasail’s prefix scan implementations, each column of the dynamic programming table is iterated over twice. This leads to stable, predictable performance of the prefix scan vector approach. The prefix scan routines have stable performance for the global classes of alignments. This stable performance of the prefix scan vector routines is especially evident in Fig. 4. Even considering the stable performance of the prefix scan routines, the striped approach is always faster for local alignments and semi-global alignments.\n\nDiscussion\nThe Parasail library is an improvement over earlier SIMD intra-sequence implementations. The performance reported here is also better than what was previously reported for intra-sequence alignments [8].\nThe intent of the Parasail library is to be integrated into other software packages, not necessarily to replace the already highly performing database search tools such as BLAST [4], SWIPE [8], or libssa [15]; database search on its own is an important problem with satisfactory solutions. As a software library that focuses on individual pairwise alignments, Parasail can be more readily adapted into other software packages needing such a capability, such as those described by [9] or as part of other programming language packages such as scikit-bio [23].\nThe Parasail library represents the first time global, semi-global, and local alignment routines are available as a high performance software library. The routines have been written to utilize the latest x86 CPU instruction sets while remaining compatible with any platform. The modular implementation and code generation process will easily allow Parasail to be adapted to future instruction sets with wider vector units as they become available.\nFuture versions of Parasail will add the capability of returning alignment tracebacks. Though Parasail already has the option of returning additional alignment statistics, the full traceback is important in some cases. As open source software, the intent is to welcome feature requests, enhancements, and fixes from the growing Parasail community.\n"}

PMC:4748600 / 18383-37726 JSONTXT

Annnotations TAB JSON ListView MergeView

2_test

PMC:4748600 / 18383-37726 JSON TXT