Proposed data analysis and data management infrastructure The data analysis and data management components represent an integral cornerstone of the Open AUC Project. These elements will provide all required data analysis routines and provide Internet-capable database access for efficient data exchange, storage and management, and interface with the AU-AOS operating system to assist with data collection. Furthermore, a data interchange module provides a platform independent XML interface between database, data acquisition operating system and third party data analysis software. The system is based on the existing open source UltraScan data analysis platform (http://www.ultrascan.uthscsa.edu) and the UltraScan Laboratory Information Management System (US LIMS, http://uslims.uthscsa.edu). An overview of the proposed structure is shown in Fig. 5. Fig. 5 Proposed Open AUC data analysis and data management infrastructure: shown are the UltraScan data analysis and LIMS database, the supercomputer module, the AU-AOS data collection and instrument operating system containing the sample service module (SSM), the protocol service module (PSM), machine service module (MSM), and the optical system operations (OSO), and tools for the integration of legacy data. The network data server provides an XML interface between the AU-AOS and the LIMS system, and facilitates LIMS access by third party data analysis and interpretation software. Sednterp3 is a solution definition and calculation program. The US LIMS component stores experimental data, results, analyte information, the subscriber list, and other ancillary data. The UltraScan grid control module is responsible for managing analysis queues on the Teragrid infrastructure and communicating results with the US LIMS UltraScan data analysis The UltraScan platform (Demeler 2008) provides an ideal open source environment for further development of the Open AUC Project. It is based on platform independent and portable development tools and open source programming languages (Qt, qwt, qwtplot3d, GNU C/C++, Perl, PHP, HTML, MySQL). UltraScan has been placed under Subversion (http://subversion.tigris.org) version control, providing support for multi-user development and branching. A Trac bug tracking system and WiKi (http://trac.edgewall.org) is used to manage development goals and user feedback via a ticketing system (http://wiki.bcf.uthscsa.edu/ultrascan/). UltraScan has been successfully ported to Windows (Windows 2000, NT4, XP, Vista), Linux/X11 (all hardware platforms), Macintosh OS-X and Darwin/X11 (Intel and G3/G4), Sun Solaris (7–10), SGI Irix (6.4 and higher), Open-, Net-, and Free-BSD, as well as other X11-based Unix versions. UltraScan itself consists of an extensible C++ class library and a Qt-based multi-threaded graphical user interface (GUI) with independent binaries linking to the UltraScan class library. This system has proven to be the best arrangement to guarantee optimal memory management and robustness. Should multiple modules be running and one of them encounters a fatal error, all other open modules are unaffected and will continue to function normally. The library offers classes for many popular data analysis methods such as van Holde–Weischet (Demeler and van Holde 2004), dc/dt (Stafford 1992), second moment (van Holde 1985; Demeler 2005), c(s) (Schuck 2000), nonlinear and linear least squares optimization (Demeler and Saber 1998), modeling of ASTFEM solutions of the Lamm equation (Cao and Demeler 2005, 2008) by two-dimensional spectrum analysis (Brookes et al. 2006, 2009), by genetic algorithm optimization (Brookes and Demeler 2006, 2007), and by Monte Carlo analysis (Demeler and Brookes 2008). We propose a binary data standard for all experimental AUC data that replaces the currently used inefficient ASCII storage. An intrinsic absorbance spectrum can be derived by globally fitting wavelength scans taken at multiple concentrations. The spectra can be used as basis functions in the spectral decomposition fitting of spectra from mixtures (Demeler 2005). UltraScan further offers a complete data editing environment for absorbance, intensity, fluorescence, Rayleigh interference, and multiwavelength absorbance velocity and equilibrium data, and contains a module for import of legacy data and for conversion of binary formatted experimental data back to the legacy ASCII format. Intensity data can be converted to pseudo-absorbance data and corrected for time invariant noise contributions with the two-dimensional spectrum analysis. A toolkit for visualization provides comprehensive two-dimensional and three-dimensional visualization capabilities. A reporting mechanism dynamically generates HTML reports from an experiment for all analyses performed by the user. Hydrodynamic correction routines automatically adjust hydrodynamic data for standard conditions by calculating buffer densities and viscosities from user-supplied composition, as well as estimating partial specific volumes from protein sequence data. A wide range of simulation modules assist the user in the design of experiments and modeling of experimental results. Multi-speed velocity and equilibrium experiments can be simulated with arbitrary precision, and for arbitrary models involving reversible self- and hetero-associating reactions. Kinetic rate constants can be defined to simulate the reaction kinetics. A new bead modeling program, US-SOMO (based on SOlution MOdeler, Rai et al. 2005), facilitates rigid body modeling of NMR and X-ray crystallography structures (described below). Bead modeling The US-SOMO approach, fully described in another article in this issue (Brookes et al. 2009), is based on building a direct correspondence between groups of atoms within each residue in a biomacromolecule and the beads used to represent them. The bead models can be used to estimate the translational diffusion coefficient, the sedimentation coefficient, Stokes radius, rotational correlation time, intrinsic viscosity, and radius of gyration for comparison with results from hydrodynamic experiments and other related techniques. The bead models can also be used where the beads serve as scattering centers for the simulation of small-angle X-ray and neutron scattering data. US-SOMO generates a model of a macromolecule as an ensemble of rigid, non-overlapping spheres (beads) of different radii, utilizing a very well developed computational approach to calculate the hydrodynamic parameters (reviewed in García de la Torre and Bloomfield 1981; Spotorno et al. 1997; Carrasco and García de la Torre 1999). For instance, amino acids in proteins are usually represented with two beads, one for the atoms of the main chain and one for those of the side chain. The beads initial volumes are determined by the volumes of the atoms assigned to each bead and the volume of the theoretically bound water of hydration (Kuntz and Kauzmann 1974). Their position is determined by rules outlined in Rai et al. (2005), and several options are available to remove the bead overlaps while maintaining as much as possible the original surface envelope (Rai et al. 2005; Brookes et al. 2009). To improve the accuracy of the computations and reduce the computational load, an accessible surface area scan is performed on the original structure, identifying buried and exposed residues. Only the beads representing the exposed residues are used in the hydrodynamic computations. The residue definitions and their associate parameters reside in user-modifiable tables, affording a great flexibility in modeling. The program loads structures from protein data bank (PDB; Berman et al. 2000) formatted files, recognizing properly coded residues, and prompting the user when new residues are encountered. Currently, 64 residues containing ~300 different atom types are defined in the US-SOMO tables, including all standard amino acids, ribo- and deoxyribonucleosides and nucleotides, carbohydrates, and several co-factors. The program uses dynamic memory allocation and the size of the structure is theoretically limited only by the available memory in the computer. The original structures and the generated bead models can be visualized using the integrated molecular visualization program RasMol (Sayle and Milner-White 1995; http://openrasmol.org/#Software). Several extensions are planned within the Open AUC Project, including a mechanism to describe flexible structures and the application of grid procedures to treat very large structures and complexes. US LIMS features The US LIMS provides a user-friendly web portal to the data storage and supercomputer analysis methods. All routines are programmed in XHTML, PHP and Perl, and adhere to the W3C strict coding standard (http://www.w3.org/TR/xhtml1) to assure platform and browser independence. Each user is authenticated to a MySQL database, a separate instance belonging to each participating institution. Users are given a permission level which determines their role in the system (i.e. administrator, user, collaborator, analyst), and determines which data in the database can be accessed by the user. Any user can choose to share selected datasets from their own data with any other users to facilitate collaborations. The LIMS further offers web forms for the entry of peptide and nucleic acid sequences, buffer composition, upload of ancillary image data, as well as a virtual laboratory notebook for the annotation of projects and a place to provide experimental details. An interface for the retrieval of experimental and ancillary data as well as data analysis results is available as well. The supercomputer interface is used to enter analysis parameters and submit experimental analysis projects to the supercomputer clusters. Database tables in the US LIMS collect analysis results from the supercomputer, including submission parameters so analyses can be repeated in case of a malfunction. One of several clusters can be chosen for submission of compute-intensive jobs, and a detailed job queue informs the users about the status of their jobs, and permits job resubmission or job canceling. Supercomputer statistics such as used CPU time and performance are stored in database tables to permit accounting of computing allocations. Data analysis results are stored as compressed archives, whose contents are parsed by a web application that automatically generates a dynamically coded HTML file which presents the results through a browser as a well organized web page containing links to visualization images, ASCII formatted spreadsheets, and data analysis result reports generated by UltraScan. UltraScan grid control A separate grid control system running on the Bioinformatics Core Facility server at the University of Texas Health Science Center at San Antonio manages all submissions and queuing of analysis requests to remote clusters using the Globus (http://www.globus.org) and TIGRE (http://www.hipcat.net/Projects/tigre) toolkits and by communicating with all major queuing systems such as PBS (http://www.openpbs.org) and Sun GridEngine (http://gridengine.sunsource.net). Analysis results are committed back to the LIMS where they can be retrieved for further processing and visualized with UltraScan. The grid control module is written in Perl and communicates with selected remote clusters to run MPI jobs (http://www.open-mpi.org/) and to coordinate PBS job queuing (Brookes and Demeler 2008). Network data server The network data server (NDS) system acts as a data broker to provide platform independent access to all experimental and LIMS data by translating all data structures into clearly defined XML structures for maximum flexibility and ease of use for third parties. The XML design permits encapsulation of experimental information, including binary encoded experimental data, results, and experimental details, thereby minimizing misinterpretation. XML’s extensible nature and the design of these data structures allows for addition of new information without breaking existing client programs. The NDS system acts as a secured gateway between the AU-AOS and the LIMS component, and will also provide a documented, open interface for third party data analysis software to acquire data and store results. Any developer is also free to connect to the LIMS directly using standard SQL communications, bypassing the NDS. Data flow The investigator initiates a project by creating an account on the LIMS portal (if none exists) and entering a project description in the laboratory notebook and by submitting all ancillary data such as buffer composition, analyte properties, gel images, absorbance scans, etc. to the database. Subsequently, an experimental design is added to the notebook. Next, the operator sets up the experiment by directly linking all acquired AUC data from within the data acquisition software (AU-AOS) with the investigator’s database entries for each sample and commences the data acquisition. Communication between the AU-AOS and the US LIMS database is brokered by the NDS module. At this point, each experiment has a unique identification and the investigator identity associated with it which will carry forward through all data processing, analysis, and results presentation. During data collection all experimental data are forwarded to the US LIMS for storage in an internal, compressed binary format for maximum efficiency. A separate data conversion module assists with the import of data in legacy formats into the US LIMS. An export module permits export of US LIMS data to legacy formats for import into existing third party analysis software. In the next step, the experimental data are edited by storing all necessary transformations (meniscus position, data range minimum and maximum, baseline corrections, etc.) in the LIMS database as a separate dataset for each channel. Third party software will be able to import experimental data, store analysis results, and query the LIMS via the NDS. Edited data can be further processed by either a supercomputer-based analysis or by local UltraScan analysis using a control file to store experiment, cell, and channel information relevant to the analysis. Each channel produces a number of result files which are uniquely named by the analysis software according to a predefined standard. All analyses belonging to a channel are then grouped into a compressed archive and stored in the results database tables. During off-line operation, the data can be processed from the user’s computer without the requirement of an Internet connection. When Internet connectivity is again available, the results can be synchronized with the contents of the database. The proposed database structure is shown in the electronic supplementary material (Figure SI 1). Expandability The UltraScan C++ library API is documented with the open source Doxygen documentation system (http://www.doxygen.org). The library is organized into multiple modules designed to address GUI functions, and a modular design aids in the integration of new routines and expansions of the software. Translation classes support internationalization for different languages and the Qt framework provides all needed programming features for a modern software product. The modular, object-oriented design permits transparent maintenance and extension of UltraScan modules.