2 SPECIFYING IDENTIFICATION PROBLEMS AS OPTIMIZATION PROBLEMS We now outline how we define identification problems mathematically. Our first and most important concern has been that an identification problem must be defined unambiguously as an optimization problem. This ensures that a problem has a well defined solution independently of any method for solving it, and that the task of modelling identification problems is separated from the algorithmic task of solving given problems. It is to be noted that if we fail to explicitly specify the entire problem, missing parts have to be supplied by the user or be implicitly defined by the algorithm, resulting in confusion and lack of reproducibility. Our second concern has been to find a reasonably simple standard form that can still represent a wide range of identification problems. We have then chosen the following way to structure and specify our problems. Data. Time series data for one or several experiments. In the case of several experiments, they may differ with respect to initial values of the variables and/or input functions. For each data-point, the standard deviation is also given in the problem specification. Model space. The model space determines the allowed form of the right-hand side of the ODEs. For models based on traditional chemical rate equations, each ODE in the model is assumed to be a sum of a number of reactions. The possible reactions must belong to a subset of predefined reaction types, where each allowed reaction type is specified by its name, a subset of possible input variables and ranges of allowed parameter values. The allowed reaction types can be specified individually for each state variable. As an example, in Figure 1B, the allowed reactions types are a unimolecular mass action reaction, a Michaelis–Menten reaction, and a simplified Hill equation. For reactions having multiple input variables, e.g. a bimolecular mass action reaction with equation k1 Xi Xj, it is implicitly assumed that i and j are not equal (to consider equality as in problem osc1 and osc2, we define an additional reaction type k1 Xi2). The model space of an S-system is simply defined by lower and upper bounds for each element in the parameter vectors (α and β) and matrices (g and h). Sometimes, additional constraints are required. For example, three of the benchmark problems include an additional constraint of type {gi,j∈[−3, 3], gi,j≠0} (there is an interaction between variable i and j but the direction is unknown). Finally, we also define lower and upper bounds for the initial data-point in each time series. For noisy data, these bounds were set to ±2 SDs. Hence, for noisy data there is one additional parameter for each time series, but these parameters are typically bound tighter compared with the model parameters. Initial model. It is convenient to allow definition of an initial model, corresponding to prior knowledge of the system. The initial model is described as known reactions (terms) on the right-hand side of the ODEs. Also reactions from outside the model space can be included in the initial model. In principle, one can also think of prior information in the form of starting points for iterative algorithms and thus not technically a part of the defined problem. No such information is assumed known in our current problems. Error function. We have chosen to minimize (2) The first term is the negative log-likelihood of the experimental data, and the second term is a term that penalizes structural complexity of the model. This kind of error function is common, and is related to several different proposed methods for handling model complexity (Crampin et al., 2004). In detail, L is the log-likelihood, denotes the experimental data, k is a vector of parameters, λ is a constant and K is the number of parameters. By assuming independent and normally distributed measurement errors and disregarding constant terms we can express the log-likelihood for one time series as (3) where i indexes the measurement points, and where Xj, and σj denotes simulated data, experimental data and SD for variable j, respectively. The total log-likelihood is defined by summing over all variables and all experiments. For models based on chemical rate equations, K is simply the total number of parameters on the right-hand side of the ODEs. For S-systems, it is natural to define K as the total number of non-zero elements in g and h plus the number of parameters in α and β. To establish the relationship with standard optimization terminology, the model space and and the initial model define the feasible set, and the data together with the error function define the objective function. 2.1 File format for identification problems In order to work with identification problems and to provide them as input to identification algorithms, we also need to represent such problems in files, and it is highly desirable that an entire problem can be represented in a single file. Since an identification problem is an optimization problem and not a model, common model formats such as SBML are not applicable, and no existing format known to us can be used for this purpose. However, the output of an identification algorithm is a model and can be represented in SBML. Also, if a partially known initial model is available in SBML, an identification algorithm can input that part in SBML and the rest of the identification problem in our format. The file format we have designed for the identification problems can be seen as a special case of a more general format that is currently developed as a separate project. The format attempts to be self documenting and easy to read both by the human and the computer. Compared with a typical XML-equivalent, more structure is explicit and it is easier to read. It bears some resemblance to how data structures are built in a computer programme, but without the explicit use of pointers. An extract of the format is given below: This particular extract describes first some variables and then a possible reaction in the model space. Finally, a sample from the first experiment is given. The format can also be used to specify a parameter estimation problem only. Method specific parameters (like random seed) can be defined in the same file or separately. Finally, we can easily extend the format to describe new classes of problems, for example with compartment modelling and system modifications like gene deletions. We anticipate that the exact mathematical form of the identification problems, as well as the file format, will be extended over time. Up-to-date information and detailed documentation is therefore available on the web site, as well as a simple parser for the file format.