|
Title
|
The challenges of delivering bioinformatics training in the analysis of high-throughput data
|
|
Abstract
|
High-throughput technologies are widely used in the field of functional genomics and used in an increasing number of applications. For many ‘wet lab’ scientists, the analysis of the large amount of data generated by such technologies is a major bottleneck that can only be overcome through very specialized training in advanced data analysis methodologies and the use of dedicated bioinformatics software tools. In this article, we wish to discuss the challenges related to delivering training in the analysis of high-throughput sequencing data and how we addressed these challenges in the hands-on training courses that we have developed at the European Bioinformatics Institute.
|
|
Body
|
INTRODUCTION
Over the last two decades, the field of functional genomics has been revolutionized by the introduction of high-throughput (HT) technologies, such as microarray and next-generation sequencing (NGS), which allow for the study of many thousands of genomic targets and their functions at the molecular level. NGS technologies [1] are now routinely used in many applications including genome sequencing/re-sequencing, small RNA discovery [2], deep SNP discovery [3], chromatin immunoprecipitation sequencing (ChIP-seq) [4], ribonomics [5], transcriptome analysis for discovery and characterization of alternative splicing [6] and expression profiling (RNA-seq) [7, 8].
These applications are generating a wealth of data that require increasingly sophisticated statistical and computational analyses to extract biologically meaningful information from such data [9]. Bench scientists, who generate the data, often do not have the computational and statistical knowledge required to properly analyse it and have to rely on the support of a statistician or bioinformatician.
This is often problematic for a variety of reasons; bench scientists and bioinformaticians have different backgrounds, and the interaction between these two groups can be difficult. In addition, bench scientists often seek support from the statistician after the data have already been generated, instead of at the stage of experiment planning, resulting in poor experimental design and consequently statistically weak data analysis output.
In recent years, we have witnessed an increasing demand, from bench scientists, for training on the analysis of HT data, reflecting their desire to become more independent in the analysis of their own data. To achieve this, they require specialized hands-on training in the latest analytical methodologies that is not provided by many institutions, especially for researchers at later stages of their career. Only through such training, they will develop crucial interdisciplinary skills that are at the basis of modern science and are becoming absolutely fundamental in the fast growing area of genomics and its many applications.
For the past 6 years, the authors of this article have been responsible for developing and delivering advanced courses on the analysis of microarray and high-throughput sequencing (HTS) data aimed at bench scientists. In this article, we discuss the challenges that we have faced in developing training solutions that fit the needs of a very specialized user community and the best practices that we have embraced to tackle such challenges. Although we have organized many courses on analysis of microarray data, this article will primarily focus on the challenges related to delivering training in HTS data analysis.
CHALLENGES
Diversified audience
The trainees that we typically target in our HTS data analysis courses fall into four main categories that we here summarize in the form of use cases (Table 1). These are based on the profiles of the scientists that apply to our courses.
Table 1: Main profiles of the users applying to our courses on the analysis of HTS data
It becomes clear, when reading these use cases, that our audience is very diversified. We are dealing with different backgrounds (biologists, bioinformaticians, etc.), different levels of statistical knowledge and different levels of familiarity with programming languages and scripting, as well as different learning styles.
Over the course of the last decade, biomedical research has become a multi-disciplinary environment, and scientists need to develop new skills to bridge the gap between statistics, mathematics, computer science and biology. Currently, our users struggle with such environment, as they are very specialized in one field and inexperienced in another, such as statisticians with little biological knowledge or biologists unequipped for the statistical challenge ahead. This is a much wider issue linked to the low amount, and low quality, of bioinformatics education for undergraduate or master’s students in life science curricula and is beyond the scope of this article, but it is a reality that we have to take into consideration when developing our hands-on courses.
Although academic curricula might be changing and adapting to address the needs of modern science, we believe that the development of cross-discipline communication skills and interdisciplinary working experience will be as crucial in the future as it is now. In recent years, we have noticed that the applicants to our courses are gaining some cross-disciplinary skills, but in many cases as result of self-teaching efforts. Therefore, they still require appropriate training to understand essential concepts exogenous to their field, allowing effective communication with collaborators from different area of expertise and, consequently, efficient data handling.
Topic complexity and the software choice dilemma
Analysis of HTS data is a complex topic, and the analytical pipelines required for processing this kind of data include many steps [9–11]. Figure 1 shows the fundamental steps in a typical RNA-seq pipeline for assessing differential expression, going from raw sequence reads to a list of differentially expressed genes, and it lists some of the popular tools used to perform individual steps of the analysis.
Figure 1: A typical RNA-seq data analysis workflow: the major steps involved in this pipeline are indicated, alongside some of the tools used to carry out individual steps. Quality assessment is first performed on the sequence reads before mapping them to a reference genome. The reads are then quantified into counts and normalized to minimize technical variability. Then statistical models for count data are applied to infer differential expression or differential exon usage.
In the scenario depicted here, users are left to decide which tools are appropriate for their analysis needs, but most of them do not have the necessary knowledge required to take an informed decision [12]. Additionally, different research groups develop the software tools that comprise this analytical pipeline, and the resulting solutions are often heterogeneous, imposing different requirements to the user. It is not uncommon for these tools to use different data formats, forcing the user to perform format conversions that can introduce additional problems. Also, the majority of our trainees are familiar with MS-Windows environments, while software developers usually provide tools for Linux-based systems; this is one major source of concern for many of our users, as many of the Linux-based solutions do not have user-friendly interfaces and require basic familiarity with programming languages. Combined with the lack of publications that provide unbiased comparisons of the many tools available to researchers embarking on HTS data analysis [13–15], these factors set a steep learning curve for the majority of our users.
Compared with the well-established microarray-based applications, HTS is still an emerging technology and new algorithms, to deal with the sheer size of the data and to model HTS data, are currently under development. Therefore, to deliver training on state-of-the-art analytical methods, we need to use open-source, stable, actively developed and well-maintained software tools.
Balance course content and practical outcome
When developing courses on the analysis of HTS data, as well as any other topic, we must keep in mind what is the realistic outcome of our courses. Given the complexity of the topic presented and the short duration of our courses, participants will not have sufficient time to absorb all the information given to them. We need to focus the training on the crucial steps in the analysis of HTS data, provide the necessary information needed to run each step of the analysis and clearly connect the theory to hands-on exercises on real data. We will not expose our trainees to all HTS applications, but we will provide them with a solid data-processing framework and stimulate their critical thinking to adapt this framework to other HTS applications.
It is unrealistic to believe that, after attending one course, the participants will be able to analyse the data completely on their own, but we aim at: (i) training them on how to interpret HTS data; (ii) equipping them with the fundamental knowledge required to understand what the data analysis entails; (iii) providing means to critically evaluate the data analysis tools that are made available to them and (iv) enabling them to establish a strategic partnership with their statistician and/or bioinformatician collaborators, based on mutual understanding. Then, trainees will gain the essential ‘instruments’ required to achieve a more effective communication between bench and data scientists, overcoming the obstacles owing to field-specific working languages and mutual negative conceptions.
Provide hands-on experience on concrete biological examples
A large portion of a course should be dedicated to hands-on sessions, where participants are given the opportunity to practice what they are learning. Although these sessions require a large number of teaching assistants, they offer participants the opportunity to handle real data and run analysis tasks that implement the theory being illustrated in the lectures. This is of great importance, as often trainees fail to appreciate how what is explained in the lectures can be directly applied to the data. Given the technical nature of the teaching that we deliver during these courses and the non-technical background of our audience, we often risk providing high-level concepts that the audience fails to relate to their experiments and/or biological application. Consequently, we must ensure that the connection with concrete biological examples is always evident.
Computing infrastructure
There are technical challenges associated with teaching HTS data analysis. The average size of an HTS dataset is on the order of tens of Gigabytes, imposing higher requirements for the computational infrastructure and increasing the need for clusters or cloud computing resources to run tasks like sequence alignment and model fitting in an acceptable amount of time. This is not a typical set-up for a training venue, and such requirements need to be taken into consideration when developing training courses or when planning new training facilities.
The facility available at EMBL-EBI is equipped with 40 desktops (Intel® Core™ Quad CPU Q9550 @ 2.83 GHz, 8GB RAM and 500 Gb HDD) running 64-bits Operating Systems (MS Windows 7 Professional or CentOS 5.5 Linux). This set-up allows performing all the required tasks in reasonable time using small-sized datasets, which are split among the students for the more computationally demanding tasks. Considering that sequencing technology is rapidly improving and generating increasingly larger datasets, machines of higher performance or, preferably, clusters/parallel environments should be used, if we want to deliver training of high standard. To improve scalability, cloud computing can be considered, but it should be noted that data transfer is often a limiting factor.
OUR SOLUTION
Over the course of the last 6 years, we have developed an extremely successful series of hands-on courses dedicated to the analysis of HT data. Since 2007, we have organized 25 training events on this topic as part of the ‘Hands-on EMBL-EBI User Training Program’ (http://www.ebi.ac.uk/training/); many of these events have taken place at EMBL-EBI as well as at universities around the world. Similarly, the University of Cambridge offers to students enrolled in the MPhil program in Computational Biology a number of courses presenting both the theoretical and practical sides of HT data analysis. These courses are designed to introduce the MPhil students to programming, quantification technologies for different applications and the most up-to-date solutions for analysis, using software tools that are publicly available, with the majority being open-source.
Demand for such advanced training is steadily increasing (Figure 2), and, at EMBL-EBI, we are already planning 15 training events, focusing on HTS analysis, for 2013.
Figure 2: Number of applications (solid line) received since 2009 for HT data analysis courses at EMBL-EBI and number of participants to such courses (dashed line).
Here, we wish to discuss in detail one particular course dedicated to the analysis of HTS data that we organize at EMBL-EBI and how we tried to tackle the training challenges listed above when developing this course.
The ‘EMBO practical course on the analysis of high-throughput sequencing data’—overview
The ‘EMBO practical course on the analysis of high-throughput sequencing data’ is now in its third edition and is the most oversubscribed event of the entire EMBL-EBI training calendar, with an average of 250 applications per course. This course is a well-balanced mixture of lectures (41%), which aim at providing the necessary knowledge required to understand the fundamental concepts in the analysis of HTS data, and hands-on sessions (48%), which allow the students to practice how to run analysis of HTS data on real datasets. The remaining time (11%) is set aside for ‘Questions & Answers’ and poster sessions.
Participants
Forty participants are taken on this course. The audience typically consists of 40% PhD students, 40% postdocs, 15% senior academics and 5% master students or research assistants. In the last course, 62.5% of the participants had a background in biology, 25% in bioinformatics, 7.5% in biotechnology, 2.5% in mathematics and 2.5% in medicine. An accurate participant selection is of fundamental importance for the success of this course. We try to select a relatively homogeneous audience, particularly with respect to the level of familiarity with the programming languages used during the course (R and Unix). A mixed audience of beginners, who can run scripts, and intermediate users, who already feel more confident at manipulating scripts, typically strikes the right balance and encourages interactions among course participants. To ensure a balanced audience, we also circulate pre-course materials consisting of targeted exercises that should bring all participants to the same basic level of confidence with running simple scripts.
Program
The course is 6 days long, and each day of the course is dedicated to a particular aspect of the HTS data analysis pipeline. The learning objectives for each session of the course can be found in Table 2. We believe 6 days to be the optimal length for such course, as it allows adequate covering of the fundamental aspects of the analysis and sufficient time for the students to practice. Based on the feedback from the 2012 edition of this course, 89% of the participants felt that the duration of the course was appropriate while the remaining 11% thought the course to be either too short or a bit too long.
Table 2: Learning objectives for lectures (L) and practicals (P) of the ‘EMBO practical course on the analysis of high-throughput sequencing data’ For organizers that wish to run shorter courses, we marked with a the sessions that can be shorten and with b the sessions that can be excluded. For more information on any of the Bioconductor packages listed in this table, please refer to individual packages, pages available at http://bioconductor.org/packages/release/.
When developing the scientific content of this event, we paid particular attention to logically connecting all the different sessions of the course. The current format reflects the order of steps that a person analysing the data should follow when working through the pipeline. To make the connection between modules even more evident, for each HTS application, we perform all analysis steps on a single dataset that is available in the public domain and is associated with a scientific publication, which provides additional information on the experimental design and the biological questions asked by the authors of such study. For the analysis of RNA-seq data, we use the pasilla dataset, derived from [41]. The authors investigated conservation of RNA regulation between Drosophila melanogaster and mammals. Part of their study used RNAi and RNA-seq to identify exons regulated by Pasilla, the D. melanogaster ortholog of mammalian NOVA1 and NOVA2. Their assessment investigated differential exon usage, but in our worked example, we also focus on gene-level differences. For the analysis of ChIP-seq data, we use a dataset that consists of ChIPs against the transcription factor ERa in five breast cancer cell lines [38]. To perform the mapping practical in a reasonable amount of time, and with the computational power available, we split a complete dataset among the participants. In this way, each person is responsible for aligning 1/40th of the data, and the dataset is reassembled after the mapping to perform the downstream analysis steps. This approach allows for the use of the entire dataset, rather than just few chromosomes, achieving a much more biologically meaningful analysis output.
Faculty
The course involves a core faculty of 10 lecturers and 3 teaching assistants that support the faculty during the practical sessions. All instructors are established investigators in the area of genomics and computational biology, or experienced research scientists, deeply involved in the analysis of HTS data. This is of fundamental importance, as only hands-on experience in the analysis of HTS data can provide the knowledge required to train others. The majority of the faculty members are also authors or key contributors to the development of the software being used during the course, giving the students the opportunity to interact with the experts that are shaping the HTS data analysis field.
All instructors are excellent communicators, passionate about training and willing to collaborate with each other to ensure that there is a smooth transition between the courses’ sessions, and that the content of lectures and practicals is not redundant, unless necessary.
Practical sessions
The popularity of this course relies on the significant amount of time that is dedicated to practical sessions (48% of the entire course). These sessions are often the main reasons why people apply to our courses, and they are regarded as the most valuable part of a training event. During these sessions, students are given step-by-step tutorials that allow them to practice running specific analysis steps, seeking the help of faculty members when struggling with the exercises. Practical sessions are also an excellent opportunity for one-on-one interactions with the course participants, but the faculty is always encouraged to engage the audience throughout the course, stimulating discussion and laying out the issues that the participants encounter as the course progresses.
In previous courses, focusing on the analysis of microarray data, we introduced a practical session dedicated to the analysis of trainees’ own data, which was highly successful. This session is not part of the courses dedicated to the analysis of HTS data, mostly because of the technical challenges previously discussed. To solve this issue, we are considering allocating some EMBL-EBI cluster’s nodes to run computationally intense tasks (e.g. short read alignment) during our training courses as well as using cloud computing services to decentralize the execution of tasks. Both options would allow us to cope with the increasing size of HTS datasets and make the analysis of participants’ data feasible, over the span of few practical sessions.
Software choice
It is crucial that the software used during the course is open-source, easy to install, well maintained and documented. This ensures that the software will be accessible to all participants after the course and reliably kept up to date. For this reason, we have chosen to use software solutions like Bowtie [18] and Tophat [19], for the alignment of short reads, and statistical packages available through Bioconductor [16], for the downstream analysis steps. All these software products are widely used and fully supported. In particular, we concentrate on the use of Bioconductor tools for the representation, manipulation and visualization of alignments, including quantification, annotation and statistical modelling of the data. Bioconductor is a free, open-source and open-development software project for the analysis and comprehension of genomic data. It is based primarily on the statistical R programming language, and its latest release comprises 610 software packages. It is under active development by a dedicated team of researchers with a strong commitment to good documentation and software design. In addition, the Bioconductor mailing list is a great forum to post questions about problems with Bioconductor as well as discuss topics of interest to the community, providing post-course support that the course faculty would not be able to provide otherwise, owing to time constrain and work commitments.
Bowtie, Tophat and Bioconductor are command line-based applications as opposed to workflow-based. Workflow-based solutions are more suitable for audiences that are less familiar with programming languages and command line-based applications, as they provide web interfaces through which users can use computational tools for data analysis with minimal input. In our opinion, the risk with workflows is that the user will simply press a button to obtain the results of the analysis, without understanding what is being done at each step of the analysis, what parameters influence the analysis outcome and how these parameters should be modified, according to the different biological question being asked. Therefore, we prefer using command line approaches in which the user is exposed to an environment where instructions for each data analysis step must be explicitly given, imposing the critical assessment of choices of parameters and algorithms.
An alternative solution, that course organizers could consider when targeting users with no or little familiarity with programming languages, is software like Galaxy [42] and RStudio (http://www.rstudio.com/). These projects have developed user-friendly interfaces that do not expose the user to command line environments and still provide the opportunity to explore what is happening behind the scenes, giving access to the code being used and documenting all the analytical steps, ensuring transparency and reproducibility.
CONCLUSIONS
The number of courses being organized around the world on the topic of HTS data analysis is increasing, and the solution that we have presented here has been the source of inspiration for many training events planned in collaboration with various institutions, including University College London (http://www.ucl.ac.uk), the National Institute of Medical Research (http://www.mrc.nimr.ac.uk) and the MRC Functional Genomics Unit (http://www.mrcfgu.ox.ac.uk/) in the UK, the University of Helsinki (http://www.helsinki.fi) in Finland, the National Institute of Biomedical Genomics (http://www.nibmg.ac.in/) and the National Centre for Biological Sciences (http://www.ncbs.res.in/) in India, the Okinawa Institute of Science and Technology (http://www.oist.jp/) and Kyoto University (http://www.kyoto-u.ac.jp/) in Japan and the State University of Campinas (http://www.unicamp.br) in Brazil. In addition, with the support of EMBO, we are planning to organize similar courses at the Beijing Genomics Institute (http://www.genomics.cn) in China and at the University of the Witwatersrand (http://www.bioinf.wits.ac.za) in South Africa. Although such courses might be shorter than the EMBO course presented here, they are designed with the same aims. In addition, they often include, in the courses faculty, local experts in the analysis of HTS data to establish a collaboration between them and the external faculty and ensure that the topics presented by external trainers will become part of future courses run by the hosting institution.
For the future, we should also consider providing training for different audiences. So far, the main target audience of our courses has been scientists with a background in biology, but recently we have started to develop training solutions that address the needs of scientists with different expertise. For example, we are currently organizing a course targeting bioinformaticians that want to learn how to efficiently use high performance computing in the analysis of HTS data.
As shown in Figure 2, since 2009, the number of applicants to our HT data analysis courses has almost doubled. Over the same period of time, the number of participants that we were able to take for such courses has remained unchanged, owing to the size of EMBL-EBI training facility. The demand rate is constantly increasing, and it is unlikely that we will be able to accommodate all applications in the current format. This suggests the need for a change in the paradigm of teaching, including the decentralization of our courses, in a scenario where participants are trained to train their respective local communities. This would allow the demand to be dispersed over a network of training centers, enabling them to provide more customized services. This approach has been successfully piloted in collaboration between EMBL-EBI and Bioplatforms Australia (http://www.bioplatforms.com.au/), encouraging us to apply this approach again in the near future.
We are also working together with several members of the BioQUEST Curriculum Consortium (http://www.bioquest.org) towards developing undergraduate open curricula for teaching analysis of HTS data in American universities, allowing us to train much larger groups of students.
Additionally, we need to further develop e-learning courses that will allow us to reach an even wider community. Towards this goal, we have already converted two of our HTS data analysis courses into on-line courses. These are available through the EMBL-EBI e-Learning portal, Train Online (http://www.ebi.ac.uk/training/online/). Course materials from the EMBO course can also be reached through the Bioinformatics Training Network website (http://www.biotnet.org/).
The success of the ‘EMBO practical course on the analysis of high-throughput sequencing data’, as well as many other courses that we organize each year on similar topics, is largely due to the dedication and expertise of the faculty involved in delivering such courses. Their deep knowledge of the field, combined with excellent communication skills, is key to achieving the high training standard for which we strive. The work that is done behind the scenes to prepare and test all course materials, and to ensure a smooth running of an event, requires a much longer preparation time than the actual delivery. For this reason, training should receive much more peer recognition, for those involved in delivering the training, as well as for those benefiting from it.
Key Points The lack of statistical knowledge required to carry out analysis of high-throughput data often results in poorly designed experiments and statistically weak data analysis output.Training solutions are needed to equip researchers with the fundamental knowledge required to interpret high-throughput sequencing data, to understand how to perform analysis of such data and to critically evaluate the analysis software tools that are available to them.All software used in training sessions must be open-source, stable, actively developed, well maintained and documented.A significant proportion of any training event dedicated to high-throughput data analysis should consist of hands-on sessions where trainees can practice, on real data, what they are learning; all hands-on exercises must be well documented and easily reproducible.A change in the paradigm of teaching is needed to meet the high demand for training. We need to train scientists to become trainers at their respective institutions as well as develop new teaching resources such as e-learning courses.
|
|
Section
|
INTRODUCTION
Over the last two decades, the field of functional genomics has been revolutionized by the introduction of high-throughput (HT) technologies, such as microarray and next-generation sequencing (NGS), which allow for the study of many thousands of genomic targets and their functions at the molecular level. NGS technologies [1] are now routinely used in many applications including genome sequencing/re-sequencing, small RNA discovery [2], deep SNP discovery [3], chromatin immunoprecipitation sequencing (ChIP-seq) [4], ribonomics [5], transcriptome analysis for discovery and characterization of alternative splicing [6] and expression profiling (RNA-seq) [7, 8].
These applications are generating a wealth of data that require increasingly sophisticated statistical and computational analyses to extract biologically meaningful information from such data [9]. Bench scientists, who generate the data, often do not have the computational and statistical knowledge required to properly analyse it and have to rely on the support of a statistician or bioinformatician.
This is often problematic for a variety of reasons; bench scientists and bioinformaticians have different backgrounds, and the interaction between these two groups can be difficult. In addition, bench scientists often seek support from the statistician after the data have already been generated, instead of at the stage of experiment planning, resulting in poor experimental design and consequently statistically weak data analysis output.
In recent years, we have witnessed an increasing demand, from bench scientists, for training on the analysis of HT data, reflecting their desire to become more independent in the analysis of their own data. To achieve this, they require specialized hands-on training in the latest analytical methodologies that is not provided by many institutions, especially for researchers at later stages of their career. Only through such training, they will develop crucial interdisciplinary skills that are at the basis of modern science and are becoming absolutely fundamental in the fast growing area of genomics and its many applications.
For the past 6 years, the authors of this article have been responsible for developing and delivering advanced courses on the analysis of microarray and high-throughput sequencing (HTS) data aimed at bench scientists. In this article, we discuss the challenges that we have faced in developing training solutions that fit the needs of a very specialized user community and the best practices that we have embraced to tackle such challenges. Although we have organized many courses on analysis of microarray data, this article will primarily focus on the challenges related to delivering training in HTS data analysis.
|
|
Title
|
INTRODUCTION
|
|
Section
|
CHALLENGES
Diversified audience
The trainees that we typically target in our HTS data analysis courses fall into four main categories that we here summarize in the form of use cases (Table 1). These are based on the profiles of the scientists that apply to our courses.
Table 1: Main profiles of the users applying to our courses on the analysis of HTS data
It becomes clear, when reading these use cases, that our audience is very diversified. We are dealing with different backgrounds (biologists, bioinformaticians, etc.), different levels of statistical knowledge and different levels of familiarity with programming languages and scripting, as well as different learning styles.
Over the course of the last decade, biomedical research has become a multi-disciplinary environment, and scientists need to develop new skills to bridge the gap between statistics, mathematics, computer science and biology. Currently, our users struggle with such environment, as they are very specialized in one field and inexperienced in another, such as statisticians with little biological knowledge or biologists unequipped for the statistical challenge ahead. This is a much wider issue linked to the low amount, and low quality, of bioinformatics education for undergraduate or master’s students in life science curricula and is beyond the scope of this article, but it is a reality that we have to take into consideration when developing our hands-on courses.
Although academic curricula might be changing and adapting to address the needs of modern science, we believe that the development of cross-discipline communication skills and interdisciplinary working experience will be as crucial in the future as it is now. In recent years, we have noticed that the applicants to our courses are gaining some cross-disciplinary skills, but in many cases as result of self-teaching efforts. Therefore, they still require appropriate training to understand essential concepts exogenous to their field, allowing effective communication with collaborators from different area of expertise and, consequently, efficient data handling.
Topic complexity and the software choice dilemma
Analysis of HTS data is a complex topic, and the analytical pipelines required for processing this kind of data include many steps [9–11]. Figure 1 shows the fundamental steps in a typical RNA-seq pipeline for assessing differential expression, going from raw sequence reads to a list of differentially expressed genes, and it lists some of the popular tools used to perform individual steps of the analysis.
Figure 1: A typical RNA-seq data analysis workflow: the major steps involved in this pipeline are indicated, alongside some of the tools used to carry out individual steps. Quality assessment is first performed on the sequence reads before mapping them to a reference genome. The reads are then quantified into counts and normalized to minimize technical variability. Then statistical models for count data are applied to infer differential expression or differential exon usage.
In the scenario depicted here, users are left to decide which tools are appropriate for their analysis needs, but most of them do not have the necessary knowledge required to take an informed decision [12]. Additionally, different research groups develop the software tools that comprise this analytical pipeline, and the resulting solutions are often heterogeneous, imposing different requirements to the user. It is not uncommon for these tools to use different data formats, forcing the user to perform format conversions that can introduce additional problems. Also, the majority of our trainees are familiar with MS-Windows environments, while software developers usually provide tools for Linux-based systems; this is one major source of concern for many of our users, as many of the Linux-based solutions do not have user-friendly interfaces and require basic familiarity with programming languages. Combined with the lack of publications that provide unbiased comparisons of the many tools available to researchers embarking on HTS data analysis [13–15], these factors set a steep learning curve for the majority of our users.
Compared with the well-established microarray-based applications, HTS is still an emerging technology and new algorithms, to deal with the sheer size of the data and to model HTS data, are currently under development. Therefore, to deliver training on state-of-the-art analytical methods, we need to use open-source, stable, actively developed and well-maintained software tools.
Balance course content and practical outcome
When developing courses on the analysis of HTS data, as well as any other topic, we must keep in mind what is the realistic outcome of our courses. Given the complexity of the topic presented and the short duration of our courses, participants will not have sufficient time to absorb all the information given to them. We need to focus the training on the crucial steps in the analysis of HTS data, provide the necessary information needed to run each step of the analysis and clearly connect the theory to hands-on exercises on real data. We will not expose our trainees to all HTS applications, but we will provide them with a solid data-processing framework and stimulate their critical thinking to adapt this framework to other HTS applications.
It is unrealistic to believe that, after attending one course, the participants will be able to analyse the data completely on their own, but we aim at: (i) training them on how to interpret HTS data; (ii) equipping them with the fundamental knowledge required to understand what the data analysis entails; (iii) providing means to critically evaluate the data analysis tools that are made available to them and (iv) enabling them to establish a strategic partnership with their statistician and/or bioinformatician collaborators, based on mutual understanding. Then, trainees will gain the essential ‘instruments’ required to achieve a more effective communication between bench and data scientists, overcoming the obstacles owing to field-specific working languages and mutual negative conceptions.
Provide hands-on experience on concrete biological examples
A large portion of a course should be dedicated to hands-on sessions, where participants are given the opportunity to practice what they are learning. Although these sessions require a large number of teaching assistants, they offer participants the opportunity to handle real data and run analysis tasks that implement the theory being illustrated in the lectures. This is of great importance, as often trainees fail to appreciate how what is explained in the lectures can be directly applied to the data. Given the technical nature of the teaching that we deliver during these courses and the non-technical background of our audience, we often risk providing high-level concepts that the audience fails to relate to their experiments and/or biological application. Consequently, we must ensure that the connection with concrete biological examples is always evident.
Computing infrastructure
There are technical challenges associated with teaching HTS data analysis. The average size of an HTS dataset is on the order of tens of Gigabytes, imposing higher requirements for the computational infrastructure and increasing the need for clusters or cloud computing resources to run tasks like sequence alignment and model fitting in an acceptable amount of time. This is not a typical set-up for a training venue, and such requirements need to be taken into consideration when developing training courses or when planning new training facilities.
The facility available at EMBL-EBI is equipped with 40 desktops (Intel® Core™ Quad CPU Q9550 @ 2.83 GHz, 8GB RAM and 500 Gb HDD) running 64-bits Operating Systems (MS Windows 7 Professional or CentOS 5.5 Linux). This set-up allows performing all the required tasks in reasonable time using small-sized datasets, which are split among the students for the more computationally demanding tasks. Considering that sequencing technology is rapidly improving and generating increasingly larger datasets, machines of higher performance or, preferably, clusters/parallel environments should be used, if we want to deliver training of high standard. To improve scalability, cloud computing can be considered, but it should be noted that data transfer is often a limiting factor.
|
|
Title
|
CHALLENGES
|
|
Section
|
Diversified audience
The trainees that we typically target in our HTS data analysis courses fall into four main categories that we here summarize in the form of use cases (Table 1). These are based on the profiles of the scientists that apply to our courses.
Table 1: Main profiles of the users applying to our courses on the analysis of HTS data
It becomes clear, when reading these use cases, that our audience is very diversified. We are dealing with different backgrounds (biologists, bioinformaticians, etc.), different levels of statistical knowledge and different levels of familiarity with programming languages and scripting, as well as different learning styles.
Over the course of the last decade, biomedical research has become a multi-disciplinary environment, and scientists need to develop new skills to bridge the gap between statistics, mathematics, computer science and biology. Currently, our users struggle with such environment, as they are very specialized in one field and inexperienced in another, such as statisticians with little biological knowledge or biologists unequipped for the statistical challenge ahead. This is a much wider issue linked to the low amount, and low quality, of bioinformatics education for undergraduate or master’s students in life science curricula and is beyond the scope of this article, but it is a reality that we have to take into consideration when developing our hands-on courses.
Although academic curricula might be changing and adapting to address the needs of modern science, we believe that the development of cross-discipline communication skills and interdisciplinary working experience will be as crucial in the future as it is now. In recent years, we have noticed that the applicants to our courses are gaining some cross-disciplinary skills, but in many cases as result of self-teaching efforts. Therefore, they still require appropriate training to understand essential concepts exogenous to their field, allowing effective communication with collaborators from different area of expertise and, consequently, efficient data handling.
|
|
Title
|
Diversified audience
|
|
Table caption
|
Table 1: Main profiles of the users applying to our courses on the analysis of HTS data
|
|
Section
|
Topic complexity and the software choice dilemma
Analysis of HTS data is a complex topic, and the analytical pipelines required for processing this kind of data include many steps [9–11]. Figure 1 shows the fundamental steps in a typical RNA-seq pipeline for assessing differential expression, going from raw sequence reads to a list of differentially expressed genes, and it lists some of the popular tools used to perform individual steps of the analysis.
Figure 1: A typical RNA-seq data analysis workflow: the major steps involved in this pipeline are indicated, alongside some of the tools used to carry out individual steps. Quality assessment is first performed on the sequence reads before mapping them to a reference genome. The reads are then quantified into counts and normalized to minimize technical variability. Then statistical models for count data are applied to infer differential expression or differential exon usage.
In the scenario depicted here, users are left to decide which tools are appropriate for their analysis needs, but most of them do not have the necessary knowledge required to take an informed decision [12]. Additionally, different research groups develop the software tools that comprise this analytical pipeline, and the resulting solutions are often heterogeneous, imposing different requirements to the user. It is not uncommon for these tools to use different data formats, forcing the user to perform format conversions that can introduce additional problems. Also, the majority of our trainees are familiar with MS-Windows environments, while software developers usually provide tools for Linux-based systems; this is one major source of concern for many of our users, as many of the Linux-based solutions do not have user-friendly interfaces and require basic familiarity with programming languages. Combined with the lack of publications that provide unbiased comparisons of the many tools available to researchers embarking on HTS data analysis [13–15], these factors set a steep learning curve for the majority of our users.
Compared with the well-established microarray-based applications, HTS is still an emerging technology and new algorithms, to deal with the sheer size of the data and to model HTS data, are currently under development. Therefore, to deliver training on state-of-the-art analytical methods, we need to use open-source, stable, actively developed and well-maintained software tools.
|
|
Title
|
Topic complexity and the software choice dilemma
|
|
Figure caption
|
Figure 1: A typical RNA-seq data analysis workflow: the major steps involved in this pipeline are indicated, alongside some of the tools used to carry out individual steps. Quality assessment is first performed on the sequence reads before mapping them to a reference genome. The reads are then quantified into counts and normalized to minimize technical variability. Then statistical models for count data are applied to infer differential expression or differential exon usage.
|
|
Section
|
Balance course content and practical outcome
When developing courses on the analysis of HTS data, as well as any other topic, we must keep in mind what is the realistic outcome of our courses. Given the complexity of the topic presented and the short duration of our courses, participants will not have sufficient time to absorb all the information given to them. We need to focus the training on the crucial steps in the analysis of HTS data, provide the necessary information needed to run each step of the analysis and clearly connect the theory to hands-on exercises on real data. We will not expose our trainees to all HTS applications, but we will provide them with a solid data-processing framework and stimulate their critical thinking to adapt this framework to other HTS applications.
It is unrealistic to believe that, after attending one course, the participants will be able to analyse the data completely on their own, but we aim at: (i) training them on how to interpret HTS data; (ii) equipping them with the fundamental knowledge required to understand what the data analysis entails; (iii) providing means to critically evaluate the data analysis tools that are made available to them and (iv) enabling them to establish a strategic partnership with their statistician and/or bioinformatician collaborators, based on mutual understanding. Then, trainees will gain the essential ‘instruments’ required to achieve a more effective communication between bench and data scientists, overcoming the obstacles owing to field-specific working languages and mutual negative conceptions.
|
|
Title
|
Balance course content and practical outcome
|
|
Section
|
Provide hands-on experience on concrete biological examples
A large portion of a course should be dedicated to hands-on sessions, where participants are given the opportunity to practice what they are learning. Although these sessions require a large number of teaching assistants, they offer participants the opportunity to handle real data and run analysis tasks that implement the theory being illustrated in the lectures. This is of great importance, as often trainees fail to appreciate how what is explained in the lectures can be directly applied to the data. Given the technical nature of the teaching that we deliver during these courses and the non-technical background of our audience, we often risk providing high-level concepts that the audience fails to relate to their experiments and/or biological application. Consequently, we must ensure that the connection with concrete biological examples is always evident.
|
|
Title
|
Provide hands-on experience on concrete biological examples
|
|
Section
|
Computing infrastructure
There are technical challenges associated with teaching HTS data analysis. The average size of an HTS dataset is on the order of tens of Gigabytes, imposing higher requirements for the computational infrastructure and increasing the need for clusters or cloud computing resources to run tasks like sequence alignment and model fitting in an acceptable amount of time. This is not a typical set-up for a training venue, and such requirements need to be taken into consideration when developing training courses or when planning new training facilities.
The facility available at EMBL-EBI is equipped with 40 desktops (Intel® Core™ Quad CPU Q9550 @ 2.83 GHz, 8GB RAM and 500 Gb HDD) running 64-bits Operating Systems (MS Windows 7 Professional or CentOS 5.5 Linux). This set-up allows performing all the required tasks in reasonable time using small-sized datasets, which are split among the students for the more computationally demanding tasks. Considering that sequencing technology is rapidly improving and generating increasingly larger datasets, machines of higher performance or, preferably, clusters/parallel environments should be used, if we want to deliver training of high standard. To improve scalability, cloud computing can be considered, but it should be noted that data transfer is often a limiting factor.
|
|
Title
|
Computing infrastructure
|
|
Section
|
OUR SOLUTION
Over the course of the last 6 years, we have developed an extremely successful series of hands-on courses dedicated to the analysis of HT data. Since 2007, we have organized 25 training events on this topic as part of the ‘Hands-on EMBL-EBI User Training Program’ (http://www.ebi.ac.uk/training/); many of these events have taken place at EMBL-EBI as well as at universities around the world. Similarly, the University of Cambridge offers to students enrolled in the MPhil program in Computational Biology a number of courses presenting both the theoretical and practical sides of HT data analysis. These courses are designed to introduce the MPhil students to programming, quantification technologies for different applications and the most up-to-date solutions for analysis, using software tools that are publicly available, with the majority being open-source.
Demand for such advanced training is steadily increasing (Figure 2), and, at EMBL-EBI, we are already planning 15 training events, focusing on HTS analysis, for 2013.
Figure 2: Number of applications (solid line) received since 2009 for HT data analysis courses at EMBL-EBI and number of participants to such courses (dashed line).
Here, we wish to discuss in detail one particular course dedicated to the analysis of HTS data that we organize at EMBL-EBI and how we tried to tackle the training challenges listed above when developing this course.
The ‘EMBO practical course on the analysis of high-throughput sequencing data’—overview
The ‘EMBO practical course on the analysis of high-throughput sequencing data’ is now in its third edition and is the most oversubscribed event of the entire EMBL-EBI training calendar, with an average of 250 applications per course. This course is a well-balanced mixture of lectures (41%), which aim at providing the necessary knowledge required to understand the fundamental concepts in the analysis of HTS data, and hands-on sessions (48%), which allow the students to practice how to run analysis of HTS data on real datasets. The remaining time (11%) is set aside for ‘Questions & Answers’ and poster sessions.
Participants
Forty participants are taken on this course. The audience typically consists of 40% PhD students, 40% postdocs, 15% senior academics and 5% master students or research assistants. In the last course, 62.5% of the participants had a background in biology, 25% in bioinformatics, 7.5% in biotechnology, 2.5% in mathematics and 2.5% in medicine. An accurate participant selection is of fundamental importance for the success of this course. We try to select a relatively homogeneous audience, particularly with respect to the level of familiarity with the programming languages used during the course (R and Unix). A mixed audience of beginners, who can run scripts, and intermediate users, who already feel more confident at manipulating scripts, typically strikes the right balance and encourages interactions among course participants. To ensure a balanced audience, we also circulate pre-course materials consisting of targeted exercises that should bring all participants to the same basic level of confidence with running simple scripts.
Program
The course is 6 days long, and each day of the course is dedicated to a particular aspect of the HTS data analysis pipeline. The learning objectives for each session of the course can be found in Table 2. We believe 6 days to be the optimal length for such course, as it allows adequate covering of the fundamental aspects of the analysis and sufficient time for the students to practice. Based on the feedback from the 2012 edition of this course, 89% of the participants felt that the duration of the course was appropriate while the remaining 11% thought the course to be either too short or a bit too long.
Table 2: Learning objectives for lectures (L) and practicals (P) of the ‘EMBO practical course on the analysis of high-throughput sequencing data’ For organizers that wish to run shorter courses, we marked with a the sessions that can be shorten and with b the sessions that can be excluded. For more information on any of the Bioconductor packages listed in this table, please refer to individual packages, pages available at http://bioconductor.org/packages/release/.
When developing the scientific content of this event, we paid particular attention to logically connecting all the different sessions of the course. The current format reflects the order of steps that a person analysing the data should follow when working through the pipeline. To make the connection between modules even more evident, for each HTS application, we perform all analysis steps on a single dataset that is available in the public domain and is associated with a scientific publication, which provides additional information on the experimental design and the biological questions asked by the authors of such study. For the analysis of RNA-seq data, we use the pasilla dataset, derived from [41]. The authors investigated conservation of RNA regulation between Drosophila melanogaster and mammals. Part of their study used RNAi and RNA-seq to identify exons regulated by Pasilla, the D. melanogaster ortholog of mammalian NOVA1 and NOVA2. Their assessment investigated differential exon usage, but in our worked example, we also focus on gene-level differences. For the analysis of ChIP-seq data, we use a dataset that consists of ChIPs against the transcription factor ERa in five breast cancer cell lines [38]. To perform the mapping practical in a reasonable amount of time, and with the computational power available, we split a complete dataset among the participants. In this way, each person is responsible for aligning 1/40th of the data, and the dataset is reassembled after the mapping to perform the downstream analysis steps. This approach allows for the use of the entire dataset, rather than just few chromosomes, achieving a much more biologically meaningful analysis output.
Faculty
The course involves a core faculty of 10 lecturers and 3 teaching assistants that support the faculty during the practical sessions. All instructors are established investigators in the area of genomics and computational biology, or experienced research scientists, deeply involved in the analysis of HTS data. This is of fundamental importance, as only hands-on experience in the analysis of HTS data can provide the knowledge required to train others. The majority of the faculty members are also authors or key contributors to the development of the software being used during the course, giving the students the opportunity to interact with the experts that are shaping the HTS data analysis field.
All instructors are excellent communicators, passionate about training and willing to collaborate with each other to ensure that there is a smooth transition between the courses’ sessions, and that the content of lectures and practicals is not redundant, unless necessary.
Practical sessions
The popularity of this course relies on the significant amount of time that is dedicated to practical sessions (48% of the entire course). These sessions are often the main reasons why people apply to our courses, and they are regarded as the most valuable part of a training event. During these sessions, students are given step-by-step tutorials that allow them to practice running specific analysis steps, seeking the help of faculty members when struggling with the exercises. Practical sessions are also an excellent opportunity for one-on-one interactions with the course participants, but the faculty is always encouraged to engage the audience throughout the course, stimulating discussion and laying out the issues that the participants encounter as the course progresses.
In previous courses, focusing on the analysis of microarray data, we introduced a practical session dedicated to the analysis of trainees’ own data, which was highly successful. This session is not part of the courses dedicated to the analysis of HTS data, mostly because of the technical challenges previously discussed. To solve this issue, we are considering allocating some EMBL-EBI cluster’s nodes to run computationally intense tasks (e.g. short read alignment) during our training courses as well as using cloud computing services to decentralize the execution of tasks. Both options would allow us to cope with the increasing size of HTS datasets and make the analysis of participants’ data feasible, over the span of few practical sessions.
Software choice
It is crucial that the software used during the course is open-source, easy to install, well maintained and documented. This ensures that the software will be accessible to all participants after the course and reliably kept up to date. For this reason, we have chosen to use software solutions like Bowtie [18] and Tophat [19], for the alignment of short reads, and statistical packages available through Bioconductor [16], for the downstream analysis steps. All these software products are widely used and fully supported. In particular, we concentrate on the use of Bioconductor tools for the representation, manipulation and visualization of alignments, including quantification, annotation and statistical modelling of the data. Bioconductor is a free, open-source and open-development software project for the analysis and comprehension of genomic data. It is based primarily on the statistical R programming language, and its latest release comprises 610 software packages. It is under active development by a dedicated team of researchers with a strong commitment to good documentation and software design. In addition, the Bioconductor mailing list is a great forum to post questions about problems with Bioconductor as well as discuss topics of interest to the community, providing post-course support that the course faculty would not be able to provide otherwise, owing to time constrain and work commitments.
Bowtie, Tophat and Bioconductor are command line-based applications as opposed to workflow-based. Workflow-based solutions are more suitable for audiences that are less familiar with programming languages and command line-based applications, as they provide web interfaces through which users can use computational tools for data analysis with minimal input. In our opinion, the risk with workflows is that the user will simply press a button to obtain the results of the analysis, without understanding what is being done at each step of the analysis, what parameters influence the analysis outcome and how these parameters should be modified, according to the different biological question being asked. Therefore, we prefer using command line approaches in which the user is exposed to an environment where instructions for each data analysis step must be explicitly given, imposing the critical assessment of choices of parameters and algorithms.
An alternative solution, that course organizers could consider when targeting users with no or little familiarity with programming languages, is software like Galaxy [42] and RStudio (http://www.rstudio.com/). These projects have developed user-friendly interfaces that do not expose the user to command line environments and still provide the opportunity to explore what is happening behind the scenes, giving access to the code being used and documenting all the analytical steps, ensuring transparency and reproducibility.
|
|
Title
|
OUR SOLUTION
|
|
Figure caption
|
Figure 2: Number of applications (solid line) received since 2009 for HT data analysis courses at EMBL-EBI and number of participants to such courses (dashed line).
|
|
Section
|
The ‘EMBO practical course on the analysis of high-throughput sequencing data’—overview
The ‘EMBO practical course on the analysis of high-throughput sequencing data’ is now in its third edition and is the most oversubscribed event of the entire EMBL-EBI training calendar, with an average of 250 applications per course. This course is a well-balanced mixture of lectures (41%), which aim at providing the necessary knowledge required to understand the fundamental concepts in the analysis of HTS data, and hands-on sessions (48%), which allow the students to practice how to run analysis of HTS data on real datasets. The remaining time (11%) is set aside for ‘Questions & Answers’ and poster sessions.
|
|
Title
|
The ‘EMBO practical course on the analysis of high-throughput sequencing data’—overview
|
|
Section
|
Participants
Forty participants are taken on this course. The audience typically consists of 40% PhD students, 40% postdocs, 15% senior academics and 5% master students or research assistants. In the last course, 62.5% of the participants had a background in biology, 25% in bioinformatics, 7.5% in biotechnology, 2.5% in mathematics and 2.5% in medicine. An accurate participant selection is of fundamental importance for the success of this course. We try to select a relatively homogeneous audience, particularly with respect to the level of familiarity with the programming languages used during the course (R and Unix). A mixed audience of beginners, who can run scripts, and intermediate users, who already feel more confident at manipulating scripts, typically strikes the right balance and encourages interactions among course participants. To ensure a balanced audience, we also circulate pre-course materials consisting of targeted exercises that should bring all participants to the same basic level of confidence with running simple scripts.
|
|
Title
|
Participants
|
|
Section
|
Program
The course is 6 days long, and each day of the course is dedicated to a particular aspect of the HTS data analysis pipeline. The learning objectives for each session of the course can be found in Table 2. We believe 6 days to be the optimal length for such course, as it allows adequate covering of the fundamental aspects of the analysis and sufficient time for the students to practice. Based on the feedback from the 2012 edition of this course, 89% of the participants felt that the duration of the course was appropriate while the remaining 11% thought the course to be either too short or a bit too long.
Table 2: Learning objectives for lectures (L) and practicals (P) of the ‘EMBO practical course on the analysis of high-throughput sequencing data’ For organizers that wish to run shorter courses, we marked with a the sessions that can be shorten and with b the sessions that can be excluded. For more information on any of the Bioconductor packages listed in this table, please refer to individual packages, pages available at http://bioconductor.org/packages/release/.
When developing the scientific content of this event, we paid particular attention to logically connecting all the different sessions of the course. The current format reflects the order of steps that a person analysing the data should follow when working through the pipeline. To make the connection between modules even more evident, for each HTS application, we perform all analysis steps on a single dataset that is available in the public domain and is associated with a scientific publication, which provides additional information on the experimental design and the biological questions asked by the authors of such study. For the analysis of RNA-seq data, we use the pasilla dataset, derived from [41]. The authors investigated conservation of RNA regulation between Drosophila melanogaster and mammals. Part of their study used RNAi and RNA-seq to identify exons regulated by Pasilla, the D. melanogaster ortholog of mammalian NOVA1 and NOVA2. Their assessment investigated differential exon usage, but in our worked example, we also focus on gene-level differences. For the analysis of ChIP-seq data, we use a dataset that consists of ChIPs against the transcription factor ERa in five breast cancer cell lines [38]. To perform the mapping practical in a reasonable amount of time, and with the computational power available, we split a complete dataset among the participants. In this way, each person is responsible for aligning 1/40th of the data, and the dataset is reassembled after the mapping to perform the downstream analysis steps. This approach allows for the use of the entire dataset, rather than just few chromosomes, achieving a much more biologically meaningful analysis output.
|
|
Title
|
Program
|
|
Table caption
|
Table 2: Learning objectives for lectures (L) and practicals (P) of the ‘EMBO practical course on the analysis of high-throughput sequencing data’ For organizers that wish to run shorter courses, we marked with a the sessions that can be shorten and with b the sessions that can be excluded. For more information on any of the Bioconductor packages listed in this table, please refer to individual packages, pages available at http://bioconductor.org/packages/release/.
|
|
Section
|
Faculty
The course involves a core faculty of 10 lecturers and 3 teaching assistants that support the faculty during the practical sessions. All instructors are established investigators in the area of genomics and computational biology, or experienced research scientists, deeply involved in the analysis of HTS data. This is of fundamental importance, as only hands-on experience in the analysis of HTS data can provide the knowledge required to train others. The majority of the faculty members are also authors or key contributors to the development of the software being used during the course, giving the students the opportunity to interact with the experts that are shaping the HTS data analysis field.
All instructors are excellent communicators, passionate about training and willing to collaborate with each other to ensure that there is a smooth transition between the courses’ sessions, and that the content of lectures and practicals is not redundant, unless necessary.
|
|
Title
|
Faculty
|
|
Section
|
Practical sessions
The popularity of this course relies on the significant amount of time that is dedicated to practical sessions (48% of the entire course). These sessions are often the main reasons why people apply to our courses, and they are regarded as the most valuable part of a training event. During these sessions, students are given step-by-step tutorials that allow them to practice running specific analysis steps, seeking the help of faculty members when struggling with the exercises. Practical sessions are also an excellent opportunity for one-on-one interactions with the course participants, but the faculty is always encouraged to engage the audience throughout the course, stimulating discussion and laying out the issues that the participants encounter as the course progresses.
In previous courses, focusing on the analysis of microarray data, we introduced a practical session dedicated to the analysis of trainees’ own data, which was highly successful. This session is not part of the courses dedicated to the analysis of HTS data, mostly because of the technical challenges previously discussed. To solve this issue, we are considering allocating some EMBL-EBI cluster’s nodes to run computationally intense tasks (e.g. short read alignment) during our training courses as well as using cloud computing services to decentralize the execution of tasks. Both options would allow us to cope with the increasing size of HTS datasets and make the analysis of participants’ data feasible, over the span of few practical sessions.
|
|
Title
|
Practical sessions
|
|
Section
|
Software choice
It is crucial that the software used during the course is open-source, easy to install, well maintained and documented. This ensures that the software will be accessible to all participants after the course and reliably kept up to date. For this reason, we have chosen to use software solutions like Bowtie [18] and Tophat [19], for the alignment of short reads, and statistical packages available through Bioconductor [16], for the downstream analysis steps. All these software products are widely used and fully supported. In particular, we concentrate on the use of Bioconductor tools for the representation, manipulation and visualization of alignments, including quantification, annotation and statistical modelling of the data. Bioconductor is a free, open-source and open-development software project for the analysis and comprehension of genomic data. It is based primarily on the statistical R programming language, and its latest release comprises 610 software packages. It is under active development by a dedicated team of researchers with a strong commitment to good documentation and software design. In addition, the Bioconductor mailing list is a great forum to post questions about problems with Bioconductor as well as discuss topics of interest to the community, providing post-course support that the course faculty would not be able to provide otherwise, owing to time constrain and work commitments.
Bowtie, Tophat and Bioconductor are command line-based applications as opposed to workflow-based. Workflow-based solutions are more suitable for audiences that are less familiar with programming languages and command line-based applications, as they provide web interfaces through which users can use computational tools for data analysis with minimal input. In our opinion, the risk with workflows is that the user will simply press a button to obtain the results of the analysis, without understanding what is being done at each step of the analysis, what parameters influence the analysis outcome and how these parameters should be modified, according to the different biological question being asked. Therefore, we prefer using command line approaches in which the user is exposed to an environment where instructions for each data analysis step must be explicitly given, imposing the critical assessment of choices of parameters and algorithms.
An alternative solution, that course organizers could consider when targeting users with no or little familiarity with programming languages, is software like Galaxy [42] and RStudio (http://www.rstudio.com/). These projects have developed user-friendly interfaces that do not expose the user to command line environments and still provide the opportunity to explore what is happening behind the scenes, giving access to the code being used and documenting all the analytical steps, ensuring transparency and reproducibility.
|
|
Title
|
Software choice
|
|
Section
|
CONCLUSIONS
The number of courses being organized around the world on the topic of HTS data analysis is increasing, and the solution that we have presented here has been the source of inspiration for many training events planned in collaboration with various institutions, including University College London (http://www.ucl.ac.uk), the National Institute of Medical Research (http://www.mrc.nimr.ac.uk) and the MRC Functional Genomics Unit (http://www.mrcfgu.ox.ac.uk/) in the UK, the University of Helsinki (http://www.helsinki.fi) in Finland, the National Institute of Biomedical Genomics (http://www.nibmg.ac.in/) and the National Centre for Biological Sciences (http://www.ncbs.res.in/) in India, the Okinawa Institute of Science and Technology (http://www.oist.jp/) and Kyoto University (http://www.kyoto-u.ac.jp/) in Japan and the State University of Campinas (http://www.unicamp.br) in Brazil. In addition, with the support of EMBO, we are planning to organize similar courses at the Beijing Genomics Institute (http://www.genomics.cn) in China and at the University of the Witwatersrand (http://www.bioinf.wits.ac.za) in South Africa. Although such courses might be shorter than the EMBO course presented here, they are designed with the same aims. In addition, they often include, in the courses faculty, local experts in the analysis of HTS data to establish a collaboration between them and the external faculty and ensure that the topics presented by external trainers will become part of future courses run by the hosting institution.
For the future, we should also consider providing training for different audiences. So far, the main target audience of our courses has been scientists with a background in biology, but recently we have started to develop training solutions that address the needs of scientists with different expertise. For example, we are currently organizing a course targeting bioinformaticians that want to learn how to efficiently use high performance computing in the analysis of HTS data.
As shown in Figure 2, since 2009, the number of applicants to our HT data analysis courses has almost doubled. Over the same period of time, the number of participants that we were able to take for such courses has remained unchanged, owing to the size of EMBL-EBI training facility. The demand rate is constantly increasing, and it is unlikely that we will be able to accommodate all applications in the current format. This suggests the need for a change in the paradigm of teaching, including the decentralization of our courses, in a scenario where participants are trained to train their respective local communities. This would allow the demand to be dispersed over a network of training centers, enabling them to provide more customized services. This approach has been successfully piloted in collaboration between EMBL-EBI and Bioplatforms Australia (http://www.bioplatforms.com.au/), encouraging us to apply this approach again in the near future.
We are also working together with several members of the BioQUEST Curriculum Consortium (http://www.bioquest.org) towards developing undergraduate open curricula for teaching analysis of HTS data in American universities, allowing us to train much larger groups of students.
Additionally, we need to further develop e-learning courses that will allow us to reach an even wider community. Towards this goal, we have already converted two of our HTS data analysis courses into on-line courses. These are available through the EMBL-EBI e-Learning portal, Train Online (http://www.ebi.ac.uk/training/online/). Course materials from the EMBO course can also be reached through the Bioinformatics Training Network website (http://www.biotnet.org/).
The success of the ‘EMBO practical course on the analysis of high-throughput sequencing data’, as well as many other courses that we organize each year on similar topics, is largely due to the dedication and expertise of the faculty involved in delivering such courses. Their deep knowledge of the field, combined with excellent communication skills, is key to achieving the high training standard for which we strive. The work that is done behind the scenes to prepare and test all course materials, and to ensure a smooth running of an event, requires a much longer preparation time than the actual delivery. For this reason, training should receive much more peer recognition, for those involved in delivering the training, as well as for those benefiting from it.
Key Points The lack of statistical knowledge required to carry out analysis of high-throughput data often results in poorly designed experiments and statistically weak data analysis output.Training solutions are needed to equip researchers with the fundamental knowledge required to interpret high-throughput sequencing data, to understand how to perform analysis of such data and to critically evaluate the analysis software tools that are available to them.All software used in training sessions must be open-source, stable, actively developed, well maintained and documented.A significant proportion of any training event dedicated to high-throughput data analysis should consist of hands-on sessions where trainees can practice, on real data, what they are learning; all hands-on exercises must be well documented and easily reproducible.A change in the paradigm of teaching is needed to meet the high demand for training. We need to train scientists to become trainers at their respective institutions as well as develop new teaching resources such as e-learning courses.
|
|
Title
|
CONCLUSIONS
|
|
Title
|
Key Points
|