PMC:4979052 / 6692-16509
Annnotations
{"target":"https://pubannotation.org/docs/sourcedb/PMC/sourceid/4979052","sourcedb":"PMC","sourceid":"4979052","source_url":"https://www.ncbi.nlm.nih.gov/pmc/4979052","text":"2. Methods\n\n2.1. Human Expression Data\nWe have downloaded the HumanExpressionAtlas data set (E-MTAB-62 on Array Express) compiled by Lukk et al. [7] consisting of 5372 (“qc-included”) samples hybridized to Affymetrix HG-U133a microarrays. This data set comprises in total 206 different studies from 163 different laboratories. Using text mining and manual curation, each sample was assigned one of 369 biological groups representing distinct human cell and tissue types, disease states and cell lines. The resulting expression space, the combined and processed gene expression data from this diverse collection of human samples, can be queried using the dedicated database ArrayExpress Atlas [8].\nThe 5372 samples have been selected from a larger data set of 8268 samples after application of strict quality control (qc). This was based on the quality measures scaling factor, average background, percentage of present calls, RNA degradation from whole array, Normalized Unscaled Standard Errors, and Relative Log Expression computed from the array data using Bioconductor [9]. Quality thresholds were selected based on the recommendations given in [10] adjusted to the distribution of the quality measures within this data set. Most of these metrics are typically the first choice for judging whether the samples have or have not sufficient quality relative to the complete set, while however being quite unspecific for any particular technical effect.\nWe obtained a full list of the 8268 samples from the authors and downloaded the remaining 2896 (“qc-excluded”) samples from public databases. From these, 137 samples could however not be retained as they were removed from the databases, leaving in total 8268 − 137 = 8131 samples. The full set of 8268 unique samples represents virtually all HG-U133a data publicly available in the two major public databases GEO and ArrayExpress in 2006 with no restrictions on the type of samples. This HumanArraysSet therefore is a representative sample of available human microarray experiments.\n\n2.2. Chip Characteristics Obtained from Physico-Chemical Modeling of Surface Hybridization\nThe key process in any microarray experiment is the hybridization of nucleic acids on a solid surface. Modeling this process based on physico-chemical principles has helped to improve the understanding of microarray surface hybridization and has been applied many times to precisely describe the response function of microarray probes [11,12,13,14,15]. Particularly, our previous work on microarray data analysis provided a comprehensive treatment of the physico-chemical processes involved in the hybridization and washing of microarrays, including the effects of non-specific binding, bulk hybridization and probe and target molecule interactions to develop practical algorithms for inferring target concentrations as expression measures (see [16] for an overview).\nIn the first step of our approach we disentangled the complex nature of microarray hybridization process by addressing a series of effects in separate studies in order to understand their nature, to establish causal relations between experimental factors and microarray measurements, to judge the effect size, and finally to develop models and algorithms that allow suitable calibration of the raw data. Particularly, we studied the character of the hybridization isotherm [17,18,19,20], the relation between the levels of non-specific background, specific hybridization and saturation [21], the effect of washing [13], the effect of target depletion due to surface hybridization [22], the effect of probe sequence including special sequence motifs [23,24,25] and of RNA-quality [26]. We developed physical models of surface hybridization and applied them to selected data sets, which specifically and systematically vary only one of the experimental factors (e.g., amount of RNA, its quality or the extent of post-hybridization washing) leaving all the other factors unchanged. This way we established causal relations and could describe each of the effects quantitatively in terms of well-defined measures using appropriate mathematical frameworks.\nIn the next step we developed practical algorithms such as the “Hook Curve” formalism [5,27] to estimate specific target abundances based on Langmuir-like models using only the information available in a given microarray experiment. It combines hybridization data of pairs of probes that bind the same transcript with different affinities such as the perfect match/mismatch (PM/MM) probe pairs on microarrays of the GeneChip-type. The plot of the log–intensity difference versus the logged mean intensity of these probe pairs provides curves of characteristic hook-like shape, the dimensions of which enable parameterization of the Langmuir isotherm in a chip- and probe-specific fashion and finally allow to extract expression measures from the data.\nMoreover, the dimensions of the hook-curve, e.g., its starting point, its width and height, provide quality measures characterizing the hybridization of each particular array in terms of, e.g., the amount of RNA and the fractions of specific and non-specific hybridization in large scale experimental series. The hook approach was also adapted to assess RNA-quality using microarray data [26]. Here, probe pairs with different distances to the 3'-end of the transcripts were used for mutual referencing. It was demonstrated that decomposition of the probe signals into contributions due to specific and non-specific hybridization and consideration of saturation behavior might be essential for proper quality control of the RNA. The hook-approach also allows distinguishing between different hybridization mechanisms such as local and global depletion of targets in supernatant solution [22] and to identify different effects causing chip-to-chip intensity variance such as scanner settings or non-specific background levels (see [16] for details).\nWe applied the hook approach for quality control and the identification of batch effects and of outlier samples in large scale cancer data sets [28,29]. For inspection of a detailed quality report based on our hook parameters we refer to the supplement of [29]. In our previous work we also compared the performance of different generations of GeneChip arrays with respect to sequence effects and hybridization parameters [5,25] and of different preprocessing algorithms and with respect to the obtained expression measures [5,25] and derived biological interpretations [29].\nAs mentioned above, the hook-quality measures make use of internal standards available for arrays of the GeneChip-type such as the PM/MM pairs and sets of several probes interrogating different positions of the gene of interest. Due to the special design of these arrays we were able to monitor the hybridization characteristics of large experimental series. In this publication we focus on arrays of the Affymetrix HG-133 design (including, e.g., HG-U133A and B, HG-U133plus 2 and the respective plate arrays). For comparisons between GeneChip arrays of different generations we refer to our previous work [5,25]. It has been shown that the hybridization effects in general apply to all array types studied but their amplitude can vary considerably.\nImportantly, the study of hybridization artifacts of “non-Affymetrix” microarrays, e.g., of the Illumina or Agilent products, would require first the development of appropriate methods of quality control which estimate analogous effects as discussed here. In general, we expect gradual differences in effect size but no principal difference due to the common working principle of microarrays based on sequence-specific surface hybridization [30]. Also the particular preprocessing method used for calibration can affect the amplitude of technical artifacts since different methods remove parasitic effects such as the non-specific background intensity or sequence effects of the intensity with different efficiency [5,25].\n\n2.3. Positional-Dependent Sequence Model and Sequence Effect Size\nThe intensity of a microarray probe is modeled in dependency of specific/non-specific target concentration, saturation intensity and a sequence effect δA as described in [27]. We employ a positional-dependent nearest neighbor model describing the sequence effect as the sum of sensitivity terms over all 16 dinucleotide subsequences ξk,k+1 ∈ {A, C, G, T}2 and all positions k = 1...24 of the 25-meric probe sequence ξ [17] (1) δA(ξ)=∑K=124σk(ξk,k+1) The sensitivity profiles were calculated using the non-specific hybridization signal of all PM probes of the arrays (see Supplemental Figure S2 for an example).\nWe define the maximum sensitivity amplitude as the maximum difference between all pairs of NN-sensitivity profiles (2) log(Kdiff)≡δAmax all ξ−δAmin all ξ of the non-specific hybridization mode. It determines how much (in units of log intensity contributions) a probe could shine brighter than another one given that both probes target the same transcript and thus it estimates the sequence effect size.\n\n2.4. Metrics of RNA Quantity\nIn [27] we define the relative hybridization degree, or S/N ratio, (3) Rp≡SpNp as the level of specific hybridization Sp of a probe set p relative to the baseline of non-specific binding Np of this probe set. Averaging these R values over all probe sets of a microarray exceeding a given expression threshold provides the relative specific transcript level, or mean log S/N ratio, (4) λ=〈log(R+1)〉R\u003e0.5;chip⋅ \nWe further defined β as the mean negative decadic logarithm of the non-specific signal of all probe sets of the array (5) β≈−〈logNp〉chip⋅ \nIt was previously shown that β has a geometric interpretation as the width of the hook curve and describes the measuring range of specific hybridization [5].\n","divisions":[{"label":"Title","span":{"begin":0,"end":10}},{"label":"Section","span":{"begin":12,"end":2036}},{"label":"Title","span":{"begin":12,"end":38}},{"label":"Section","span":{"begin":2038,"end":7998}},{"label":"Title","span":{"begin":2038,"end":2128}},{"label":"Section","span":{"begin":8000,"end":9079}},{"label":"Title","span":{"begin":8000,"end":8065}},{"label":"Section","span":{"begin":9081,"end":9816}},{"label":"Title","span":{"begin":9081,"end":9109}}],"tracks":[{"project":"2_test","denotations":[{"id":"27600351-20379172-69477552","span":{"begin":146,"end":147},"obj":"20379172"},{"id":"27600351-19015125-69477553","span":{"begin":693,"end":694},"obj":"19015125"},{"id":"27600351-12538238-69477554","span":{"begin":1150,"end":1152},"obj":"12538238"},{"id":"27600351-12808153-69477555","span":{"begin":2465,"end":2467},"obj":"12808153"},{"id":"27600351-12655013-69477556","span":{"begin":2468,"end":2470},"obj":"12655013"},{"id":"27600351-23307556-69477557","span":{"begin":2875,"end":2877},"obj":"23307556"},{"id":"27600351-15834006-69477558","span":{"begin":3374,"end":3376},"obj":"15834006"},{"id":"27600351-16171364-69477559","span":{"begin":3377,"end":3379},"obj":"16171364"},{"id":"27600351-19924253-69477560","span":{"begin":3650,"end":3652},"obj":"19924253"},{"id":"27600351-23307556-69477561","span":{"begin":5931,"end":5933},"obj":"23307556"},{"id":"27600351-24833231-69477562","span":{"begin":6094,"end":6096},"obj":"24833231"}],"attributes":[{"subj":"27600351-20379172-69477552","pred":"source","obj":"2_test"},{"subj":"27600351-19015125-69477553","pred":"source","obj":"2_test"},{"subj":"27600351-12538238-69477554","pred":"source","obj":"2_test"},{"subj":"27600351-12808153-69477555","pred":"source","obj":"2_test"},{"subj":"27600351-12655013-69477556","pred":"source","obj":"2_test"},{"subj":"27600351-23307556-69477557","pred":"source","obj":"2_test"},{"subj":"27600351-15834006-69477558","pred":"source","obj":"2_test"},{"subj":"27600351-16171364-69477559","pred":"source","obj":"2_test"},{"subj":"27600351-19924253-69477560","pred":"source","obj":"2_test"},{"subj":"27600351-23307556-69477561","pred":"source","obj":"2_test"},{"subj":"27600351-24833231-69477562","pred":"source","obj":"2_test"}]}],"config":{"attribute types":[{"pred":"source","value type":"selection","values":[{"id":"2_test","color":"#93e5ec","default":true}]}]}}