Results and Discussion We compare the performance of the proposed approach with Zerin et al. [13] for searching contiguous subsequences. For this purpose, we use a randomly generated DNA sequence database and a real DNA sequence database. We first generated a DNA sequence database by means of C++ program that randomly generates the variable length datasets. This database contains 5,000 DNA sequences with lengths from 100 to 1,000. The real DNA sequence database is downloaded from the bio-mirror portal (http://www.bio-mirror.net), which contains 19,979 DNA sequences, with average sequence length 1,024. All programs are written in Microsoft Visual C++ 6.0 and run with the Windows XP operating system on a Pentium dual core 2.13 GHz CPU with 1 GB main memory. In the first experiment, we compare the memory usage of our approach and Zerin et al. [13] by varying the sequence length. Fig. 5 shows the memory usage according to the sequence length change, where we used our generated database for construction of the spanning tree with a fixed length 7. Here, we used information gain threshold, min_in_gain = 35(%) and minimum confidence threshold, min_conf = 30(%). From this experiment, we can see that the proposed approach has efficient improvement over Zerin et al. [13]. When the sequence length becomes longer, it shows better performance in comparison with the existing algorithm. The memory consumption of our proposed approach and [13] for different values of min_in_gain over a real DNA sequence database is shown in Fig. 6. The x-axis in the graph indicates the change in min_in_gain as a percentage of the data point. A tree with a fixed length-10 was constructed using the aforementioned real datasets, and min_conf value 0.35 is taken. Fig. 6 indicates that for increasing the value of min_in_gain for both approaches, fewer candidates are generated and less memory is required. The second experiment examined mining time performance according to change in sequence length. Fig. 7A shows the mining time of the surprising contiguous patterns, starting from length-4 to length-9, in a randomly generated DNA sequence database, where we used information gain threshold min_in_gain = 35% and min_conf = 30%. On the other hand, we performed the same experiment with real DNA sequence datasets, where we used the value of min_in_gain = 45% and min_conf = 40%, which is shown in Fig. 7B. From Fig. 7, we can see that our proposed approach could mine the surprising contiguous patterns within a reasonable computation cost. The third experiment shows the effect of information gain threshold on mining time to find out the surprising contiguous patterns up to length 8. In this experiment, we take min_conf 0.3 and 0.4 for the random and real datasets, respectively. Fig. 8 indicates that increasing the information gain threshold decreases mining time for both random and real datasets. The fourth experiment studied the impacts of varying minimum confidence from 0.2 to 0.5 for random datasets and 0.25 to 0.55 for real datasets. This time, we take min_in_gain 0.3 and 0.4 for random and real datasets, respectively. Fig. 9 shows our proposed performance with different values for minimum confidence and illustrates a non-linear effect. The greatest change along the confidence axis is from 0.2 to 0.3 for random datasets and 0.25 to 0.35 real datasets. In most cases, the characters are evenly distributed. This means that the A, C, G, and T occur at almost the same ratio in the dataset as the frequency of each character, which is approximately 25%. So, if the min_conf is set to 0.3, most random patterns will be filtered out, and real patterns occurring more frequently than 25% of the time will survive and be extended. To summarize, we have developed an index-based method, where we need to scan the database only once to mine the surprising contiguous patterns in biological data sequences, which are considered very important in bioinformatics and computational biology. Our aim is not to discover the patterns that occur often but rather to find patterns that are surprising by introducing a new threshold information gain. It has been shown by the experimental results that the proposed approach is very efficient in finding interesting patterns within a reasonable computation cost. For future work, we will try to optimize the proposed approach by considering a variety of environments with different parameters and also consider promoting new measurement parameters, which is very feasible for describing the sequences in a biological substance.