Methods

Retrieval and processing of raw hybridization data
The 24 raw hybridization intensity data files (.CEL files) for Arabidopsis seed development were retrieved from The Arabidopsis Information Resource (TAIR) gene expression data repository (http://www.arabidopsis.org/servlets/TairObject?type=hyb_descr_collection&id=1006710873) [71]. Microarray gene expression data analyses were performed using Bioconductor packages [82] in the open-source statistical R environment [83]. The raw data files were imported into Bioconductor using the Simpleaffy package [84]. The hybridization and RNA sample qualities were assessed using a number of quality control metrics (data not shown), and the raw data were background corrected, normalized and transformed to the log2 values using the GCRMA package [85]. This normalization method is developed on another normalization approach robust multi-array average (RMA; [86]), and uses probe sequence information (G-C content) for estimating hybridization affinity. The number of genes expressed in seeds was filtered using a log2 value of 6.0 as the cutoff for the binary 'present' or 'absent' calls, and any gene with 'present' calls in less than three samples (corresponding to one seed development stage) was considered as "unexpressed" in these seed samples. After filtering, 12,353 genes expressed in at least one of the eight development stages in developing Arabidopsis seeds were used for subsequent high-level analyses. Custom Perl scripts were written to find the annotation of each gene in the latest CSV file ATH1-121501.na30.annot.csv (November 15 2009) released by Affymetrix for the ATH1 Genome Array and revised in some cases through sequence analysis using BLAST [87]. For example, the TF gene WRINKLED1 (AT3G54320) was incorrectly annotated in the Affymetrix file as an aintegumaenta-like protein or ovule development protein aintegumenta (Additional File 1).

Principal component analysis and association test of global gene expression with seed development
The normalized, log2-transformed gene expression data were used for principal component analysis (PCA) using the R prcomp function [83]. For this analysis, expression values of the three replicates for each seed development stage were not combined in order to assess the reproducibility of biological replication. Global testing of the transcriptome with a particular variable (e.g., seed development stage) was carried out using the Globaltest package [51]. This package tests the overall gene expression in group(s) of genes for significant association with a given variable. The test gives one P-value for the whole group instead of one P-value for each gene to avoid the issue of multiple testing corrections.

Gene expression correlation analysis and construction of coexpression networks
For the inference of gene coexpression networks in the transcriptome of developing Arabidopsis seeds, we used the 12,353 genes expressed at moderate or high levels and used the Pearson-based correlation coefficient to measure their expression coherence. We first used the median expression data of the genes in the eight samples to compute pairwise correlation coefficients in the R statistical environment, resulting in a correlation matrix of 12353 × 12353. Then we removed self-pairing and duplication, and applied a correlation cutoff of 0.90, which retained over 1.7 million gene pairs representing 11,698 distinct genes for construction of the coexpression network for the Arabidopsis seed genes. This stringent correlation threshold was chosen to eliminate potential spurious correlations in a coexpression network. Network properties were determined using custom scripts. Coexpression networks are visualized using Cytoscape [88]. For time-course clustering analysis, the gene expression values were standardized to have a mean value of zero and a standard deviation of one for each gene profile. This standardization of data ensures that genes with similar temporal profiles are close in Euclidean space during clustering, regardless of their absolute expression levels. The transformed expressions were then clustered using the fuzzy c-means (FCM) clustering algorithm in the Bioconductor Mfuzz package [89]. We determined six clusters can well separate the expression patterns inherent in the dataset, and another FCM parameter m = 1.75, which allows for investigation of the clustering robustness. FCM assigns a membership value in the range of 0-1 for each gene as an indicator of how representative a gene profile is for a specific cluster, and profiles with different membership values were differently coloured.

Computational analyses of transcription factor binding sites
The genomic sequences 1000 bp upstream plus 200 bp 5' untranslated regions (UTR) for the genes involved in storage reserve biosynthesis were retrieved from the RSAT server [73]. If the intergenic region with the upstream neighbouring gene is <1000 bp long, we only retrieved upstream sequence available in order to prevent using the 3'-end sequence of the adjacent gene in the upstream. Putative TF binding sites on both strands were identified with two software tools, TFBS [76] and fdrMotif [77]. Briefly, the 118 TF binding profiles (position-specific weight matrix, or PWM) were compiled from the literature [27,74] and the JASPAR database [78], and converted into a format suitable for each software tool (Additional File 4). In the TFBS search, an 80% similarity cutoff was adopted. In fdrMotif search, for each input sequence 10 background sequences were generated from a 4th-order Markov model and an upper boundary of false discovery rate (FDR) of 0.15 as suggested by fdrMotif was adopted to control FDR. Only putative binding sites predicted by both tools were retained for subsequent analysis. To ascertain the predictive performance, detected motifs were compared with curated motifs in AtcisDB and AGRIS databases [90,91]. Sequence logos for the predicted motifs for a TF binding profile were created with WebLogo [92].