Implementation

Environment and technologies
R is a software environment that can be run on a variety of UNIX platforms, as well as Windows and Mac OS. This package employs R to analyze input files and output results in the HyperText Markup Language (HTML) format. In HTML files, PPI networks are displayed as Force-Directed Graphs with JavaScript library D3. To maintain consistency across different browsers, HTML files follow the HTML 4.01 Strict standard, HTML 5 experimental standard, and CSS version 3 standard. Thus, different browsers such as Chrome, Firefox, Safari, Opera and IE10, consistently display the output.
This cisPath package is available through the Bioconductor Project [10]. For users of cloud servers, RStudio is recommended as it enables primary users to provide a browser-based interface of a version of R running on a remote Linux server. Louis Aslett has provided various kinds of Amazon Machine Images (AMIs) which make deploying an RStudio Server very fast and easy [11]. These AMIs are highly recommended, especially for free micro instance users. The Bioconductor team has also developed AMIs optimized for running Bioconductor packages with the Amazon Elastic Compute Cloud (EC2) [12]. Users without any cloud services experience can easily launch the AMI using instructions on the Bioconductor website. An introduction on the use of this package for R beginners is also available on our website [9].

Data collection and integration
There are several protein interaction databases, such as PINA, STRING, and iRefIndex [13,14], which allow downloading PPI information for academic purposes free of charge, but such downloaded files from different databases do not take on a common format. In this cisPath package, functions are provided to format the downloaded files from the PINA, STRING and iRefIndex databases into a standard workable format. To remove redundant interactions, UniProt Knowledgebase (UniProtKB) accession numbers are used as unique protein identifiers. UniProtKB is a part of the UniProt database and serves as a central hub for collection of functional information on proteins with accurate annotation [15]. UniProtKB consists of two sections including UniProtKB/Swiss-Prot (reviewed and manually annotated) and UniProtKB/TrEMBL (unreviewed and automatically annotated). Proteins with names that cannot be mapped to UniProtKB accession numbers are discarded.
The PINA database includes unified PPI data integrated from six manually curated databases: IntAct [16], MINT [17], BioGRID [18], DIP [19], HPRD [20] and MIPS MPact [21]. Like PINA, the iRefIndex database also provides an index of protein interactions integrated from primary interaction databases. PPI data downloaded from the PINA and iRefIndex databases contain the PubMed IDs of corresponding papers which support the PPIs. The STRING database contains not only known PPIs but also predicted protein associations with confidence scores. The latest version of STRING (v9.1) currently covers 5,214,234 proteins from 1,133 organisms. Although the PINA and iRefIndex databases are both integrated from manually curated databases, many distinct interactions exist in each case. Thus, several functions have been included in this package to format downloaded PPI data from different databases, consequently allowing users to edit downloaded information or merge them with privately collected data to construct more comprehensive PPI networks.

Functions for visualization
After obtaining PPI data which has been downloaded from our website or privately generated, users can look up all the possible interactions for a given protein using the networkView function. Upon providing a gene name or the Swiss-Prot number for a given protein, proteins that are capable of interacting with the input protein are presented as shown in Figure 1A. The given protein is represented by a relatively larger blue node, while related proteins are presented as smaller green nodes (Figure 1A). These nodes can be clicked, which provide links to the UniProt database for the extraction of more information. Protein interactions are also presented in a table (Figure 1B), where the names of databases that support each specific interaction are displayed. PubMed IDs of corresponding publications and/or STRING scores are also displayed, providing users with direct links for the verification of corresponding sources as desired. In cases where there are more than 100 protein interactions for a given protein, the viewer randomly displays 100 of these interactions (Figure 1A), but all protein interactions and corresponding details are presented in the table (Figure 1B).
Figure 1  Screenshots of the networkView function outputs. (A) Visualization of the protein TP53 and interacting proteins. (B) Evidence supporting the specific interactions among these proteins.   The networkView function can also be used to visualize PPI networks in a given list of proteins, together with corresponding evidence of the specific interactions among them. A sample output is shown in Figure 2A, where the selected proteins are TP53, TP53BP2, MAGI1, and PTEN. The first three proteins are designated as main nodes and PTEN is designated as a leaf node. Since interaction between two proteins is often mediated by scaffolding proteins rather than direct interaction, the viewer also displays proteins that can interact with at least two of the main proteins, such as the green leaf nodes in Figure 2A. Additionally, users are free to choose the visualization style (color and node size) with which the proteins are displayed in the network. Using the parameter mainNode for this function, a selected protein can be designated as a main or leaf node. In this example, since PTEN is manually designated as a leaf node unlike the other three input proteins designated as main nodes, the only interactions presented for PTEN are those with the main nodes. Thus, views are generated corresponding to user preference. A protein often has multiple gene names, some of which may not be included in the input PPI data file. To avoid inputting invalid names of proteins, the unique identifier Swiss-Prot accession number may be used alternatively as input. Swiss-Prot accession numbers may be found in the UniProt database.
Figure 2  Screenshots of the cisPath function outputs and network graph editor. (A) PPI network visualization of the proteins TP53, TP53BP2, MAGI1, and PTEN. (B) Shortest interaction paths between proteins TP53 and STRAP. (C) Network graph editor.   In some cases, users may want to identify interaction paths with more than two interacting steps between a pair of given proteins in a PPI network, and another function may be used to yield this type of result. The function cisPath identifies and outputs the shortest PPI paths between a pair of given proteins involved in multiple interaction steps. Users can obtain the shortest path(s) by either directly requesting the path(s) that reflect minimal cost using the default "cost" values of edges, or manually assigning "costs" to specific edges in the PPI network by editing the input file. The "cost" of an edge between two interacting proteins is a numerical value that is greater or equal to one, quantifying the extent to which an interaction is unfavorable. The default value for the "cost" of each edge generated from the PINA and iRefIndex databases is 1, and the "cost" of the edge generated from the STRING database is given as max(1,log1001000-STRING_SCORE). The variable STRING_SCORE is the confidence score given by the STRING database. An example of this function is shown in Figure 2B. Evidence representing the STRING score or PubMed ID of relevant manuscripts is shown for all interaction paths. Similar to the networkView function, other proteins that can interact with at least two of the proteins that lie on the shortest PPI path are also displayed, giving a full range of possibilities despite the fact that they may be suboptimal paths. All of the shortest paths are listed in a table under the network view and can be shown graphically when selected (Figure 2B). To identify the paths that reflect the least number of steps independent of what the associated "costs" are, the parameter byStep may be set as TRUE. In this case, all edge "costs" are assigned as 1 and PPI paths with the minimum number of steps between a pair of given proteins are produced.
Research groups that focus on specific proteins may require screening of the shortest interaction paths from a single fixed protein to all other proteins in the input database. In this case, only the source protein name should be inputted in the cisPath function. All proteins in the input database are scanned for the shortest interaction paths to the fixed protein, and all of the shortest PPI paths from the fixed protein to each of the relevant proteins are outputted. Upon finding a new protein of interest, users can query the shortest interaction paths to the fixed protein with a browser without launching R. Although more CPU time and space is required to compute this function and store the results, results can be easily placed on a cloud driver or web server for quick access over the Internet. Sample results for fixed source proteins TP53 and PTEN can be found on our website.
The functions networkView and cisPath described above allow users to change color and size of the nodes in the network view prior to running. There is an additional editor for easy modification of network graphs after running. Figure 2C shows a screenshot of this tool. This editor is accessed via an "Edit graph" button on the output webpage, and allows users to make changes to the output graph as well as draw new network graphs that are directed or undirected, using different edge and arrow styles. The editor is compatible across a range of different browsers. Since most commonly used browsers support the HTML5 Web Storage, users can store the network graph view and open it later using the same browser. An additional function of this editor allows the view graph to be converted into a span of text. As the text is reversible to an editable view graph, it is possible to share output graphs easily via email or online messenger. This editor is independently usable, and is included in the source package. It is also available on our website for online access or downloading for offline usage.