Skip Navigation


Briefings in Functional Genomics and Proteomics Advance Access originally published online on February 10, 2006
Briefings in Functional Genomics and Proteomics 2006 4(4):331-342; doi:10.1093/bfgp/eli004
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
4/4/331    most recent
eli004v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kauffman, K. J.
Right arrow Articles by Edwards, J. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kauffman, K. J.
Right arrow Articles by Edwards, J. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For permissions please email: journals.permissions@oxfordjournals.org

Designing experiments that aid in the identification of regulatory networks

Kenneth J. Kauffman, Babatunde A. Ogunnaike and Jeremy S. Edwards

Corresponding author. Jeremy S. Edwards, Department of Molecular Genetics and Microbiology, Cancer Research and Treatment Center, University of New Mexico Health Sciences Center; and Department of Chemical and Nuclear Engineering, University of New Mexico, Albuquerque, NM 87313, USA. Tel: +1 50 5272 5465; Fax: +1 50 5272 6029; E-mail: jsedwards{at}salud.unm.edu


    ABSTRACT
 TOP
 ABSTRACT
 BACKGROUND
 EXPERIMENTAL DESIGN...
 PERSISTENT EXCITATION
 SAMPLING FREQUENCY
 MULTIVARIATE INPUTS
 MODEL SYSTEM
 RESULTS AND DISCUSSION
 CONCLUSIONS
 Acknowledgements
 References
 
Predictive mathematical models of the interactions of a genetic network can provide insight into the mechanisms of gene regulation, the role of various genes within a network and how multiple genes interact leading to complex traits. However, identification of the parameters and interactions is currently a limiting step in the development of such models. This work reviews the state of the art for design of experiments in biological systems and demonstrates the need for improved design of experiments through the use of a model system. Appropriate design of experiments has a profound impact on the ability to identify a model and on the quality of resulting identified model. Key issues include the selection of appropriate input sequences (e.g. random, independent multivariate inputs) and the selection of the sampling frequencies. This work demonstrates that these issues are especially important in the identification of biochemical networks and that the traditional biochemical approach is incapable of truly identifying the behavior present in such networks.

Keywords: sampling frequency, nonlinear dynamics, experimental design, biological regulatory networks


    BACKGROUND
 TOP
 ABSTRACT
 BACKGROUND
 EXPERIMENTAL DESIGN...
 PERSISTENT EXCITATION
 SAMPLING FREQUENCY
 MULTIVARIATE INPUTS
 MODEL SYSTEM
 RESULTS AND DISCUSSION
 CONCLUSIONS
 Acknowledgements
 References
 
One of the key challenges facing biology today is model identification. Model identification involves finding a mathematical model of a biological network given a class of tentative model structures and experimental data (Figure 1) [1]. Many of the data currently available in the literature are insufficient for biologists to propose candidate models and to identify the parameters to predict cellular physiology [2]. To overcome the current limitations, experimental design must be addressed; namely, experiments must be designed to elucidate as much information as possible about the biological system at reasonable costs. Historically, biologists have applied a reductionist approach where components of cellular systems are studied one species and/or one variable at a time. The entire network was then treated as a sum of the parts. However, the advent of high-throughput technologies has led to the development of more holistic approaches. As high-throughput technologies become more commonplace and more accurate, it is essential that methods and tools be developed to take advantage of these valuable data.


Figure 1
View larger version (23K):
[in this window]
[in a new window]
 
Figure 1: Iterative nature of system identification. Initially, a genome sequence provides a list of the component parts that will be modelled. Preliminary experiments are required that characterize the interactions between these parts. Following the preliminary characterization, refined experiments designed on the preliminary interaction estimates are used to identify more reliable parameter estimates. Refining of the experiments continues until the identified model is capable of accomplishing the desired objective.

 
High-throughput technologies have been developed for the measurement of all RNA species in a cell [3–5], certain metabolites [5, 6] and the protein levels within a cell [4, 7, 8]. Of these various profiling technologies, microarrays are the most extensively used. Microarrays can be used to assay relative mRNA levels between two experimental conditions [9]. In many experimental studies, mRNA profiles provide an estimate of the steady-state changes in mRNA levels for all of the species surveyed. By examining mRNA species that change from one condition to the next, attempts have been made to identify genes that are co-regulated or inversely regulated. For example, Jenkins and coworkers explored the effect of calcium on mammalian cell cycle regulation in this manner [10]. Additionally, Tavazoie and coworkers used K-mean clustering algorithms to determine which genes appear to move in a concerted manner under a variety of conditions [11]. These clusters of genes are assumed to be co-regulated and the upstream regions were then searched for potential cis-regulatory sites whose existence could then be verified experimentally [11].

These experimental approaches have been beneficial in generating hypotheses that have led to increased understanding of genetic regulation. However, previous experience in systems engineering reveals several potential pitfalls to avoid in applying such approaches to generating dynamic models of regulation. First, in order to understand the dynamic features of interest, the underlying system must be persistently excited to excite all the dynamics. Inappropriate excitation will not provide insight into desired dynamics. Second, sampling times for inferring co-regulation must be carefully chosen. Sampling at too high or too low frequency will mask dynamic features. Finally, single variable perturbations will not provide information on multivariate interactions, and hence multivariate perturbations should be utilized if the effects of such interactions are to be captured [12].

These points are well accepted in the systems engineering community. As biological systems are far more complex than the systems typically explored in the systems engineering community, these experimental design issues are even more critical for biological systems. However, as indicated in a recent review on yeast genomic expression studies, these issues are frequently overlooked in the design of biological experiments [13]. This is further evidenced by the complete lack of or cursory discussion of experimental design in many recent books on experimental procedures in transcriptomics [14–19], metabolomics and proteomics [17, 20]. Some books even advocate experimental designs that cannot accurately identify the types of networks the experiments seek to explore [21] or refer to quality control issues as design of experiments while overlooking essential factors [22, 23]. Therefore, there is a pressing need to explore experimental design issues with respect to identifying biological networks. Each of the experimental design issues raised above will be discussed in greater detail below and explored through a model system.


    EXPERIMENTAL DESIGN CONSIDERATIONS
 TOP
 ABSTRACT
 BACKGROUND
 EXPERIMENTAL DESIGN...
 PERSISTENT EXCITATION
 SAMPLING FREQUENCY
 MULTIVARIATE INPUTS
 MODEL SYSTEM
 RESULTS AND DISCUSSION
 CONCLUSIONS
 Acknowledgements
 References
 
Draghici provides a detailed set of guidelines for experimental design, summarized in Table 1 [23]. Each of the steps outlined is essential—however, in step 7 (design your experiment), several other factors need to be taken into account. The challenges and issues that need to be addressed arise because biological networks are transient in nature. To truly understand a biological regulatory network, some understanding of the transient interactions is necessary. These types of experiments cannot be obtained from steady-state data, but instead require time courses. Thus, the introductory design of experiments typically presented in most textbooks [24–26] is insufficient and a more in-depth analysis is required [27, 28].


View this table:
[in this window]
[in a new window]
 
Table 1: Draghici's steps for experimental design [23]

 

    PERSISTENT EXCITATION
 TOP
 ABSTRACT
 BACKGROUND
 EXPERIMENTAL DESIGN...
 PERSISTENT EXCITATION
 SAMPLING FREQUENCY
 MULTIVARIATE INPUTS
 MODEL SYSTEM
 RESULTS AND DISCUSSION
 CONCLUSIONS
 Acknowledgements
 References
 
Persistent excitation refers to the frequent stimulation of a system to prevent near steady-state behaviors from dominating the observations [12]. Often in biological experiments, single step changes are made in one variable and the system response is followed as the system moves from an initial steady-state to the final steady-state. Such changes lead to results that are easily interpreted from a biochemical standpoint and are easily implemented experimentally. However, step changes do not persistently excite the network. Therefore, if the data is evenly sampled over the entire dynamic window, most of the data will be biased towards the approach to steady-state. As a result, many of the dynamic features cannot be practically identified, even with extensive prior knowledge [2].

Consider a first order, linear differential equation model of the mRNA levels within a cell controlled by an inducer. The response is completely characterized by its gain and time constant. The gain determines the final endpoint that results from the step change in inducer concentration. The time constant, {tau}, is defined by how long it will take for a system to reach 63% of the steady-state. Consider the case where eight data points spaced by {tau} min are taken starting at t = 0, when the step change is implemented (Figure 2A). The largest movement occurs between the first three measurements while the remaining data essentially describe the new steady-state. For more complex systems, much of the interesting behavior is lost by only looking at the approach to a single steady-state as most of the data emphasize the gain over the time-constant.


Figure 2
View larger version (6K):
[in this window]
[in a new window]
 
Figure 2: Key dynamic features. (A) Changes in mRNA levels for a first order response to a step change in inducer concentration. (B) Inverse response of a biological system: the human red blood cell model. The ADP levels initially increase following an ATP load. However, eventually the levels decrease. This behavior is due to two opposing mechanisms occurring at different time scales and with different gains. It would be missed without adequate sampling.

 
Persistent excitation overcomes this issue by making repeated changes to the system inputs at close intervals (<5{tau}). The system, therefore, never completely approaches a new steady-state but instead is always in transit. This ensures that, in addition to the steady-state gain, a better description of the dynamics can be obtained and that complex effects (resulting from interactions and feedback common in biological systems) can be better described.


    SAMPLING FREQUENCY
 TOP
 ABSTRACT
 BACKGROUND
 EXPERIMENTAL DESIGN...
 PERSISTENT EXCITATION
 SAMPLING FREQUENCY
 MULTIVARIATE INPUTS
 MODEL SYSTEM
 RESULTS AND DISCUSSION
 CONCLUSIONS
 Acknowledgements
 References
 
Even if persistent excitation is used, data sampling frequencies must be appropriate to capture the inherent system dynamics of interest. Dynamic regulatory effects can be missed by considering only steady-state data or data collected at a very low frequency [29]. For example, a commonly observed biological phenomenon is inverse response. Inverse response occurs when the initial direction of the system's response to a step change in the input is opposite to the direction of the final steady-state [12]. A biological example of this can be found in glycolysis when an adenosine triphosphate (ATP) load is introduced to the cell [30]. Adenosine diphosphate (ADP) levels initially increase rapidly in response to the load (Figure 2B). Eventually, the ATP load leads to decreased adenosine phosphates as the cell compensates. These two competing effects occurring on distinctly different time scales give the inverse response effect. The sampling frequencies that only capture the approach to steady-state would miss this crucial dynamic feature.

As the sampling frequency increases, the material costs of the experiments increase. Further, above a certain sampling frequency additional information will not help (and may actually hurt). This effect arises due to noise in the measurements. The upper limit on the sampling frequency can be estimated from prior knowledge of the inherent dynamics of the biological system and/or the end use of the desired model.


    MULTIVARIATE INPUTS
 TOP
 ABSTRACT
 BACKGROUND
 EXPERIMENTAL DESIGN...
 PERSISTENT EXCITATION
 SAMPLING FREQUENCY
 MULTIVARIATE INPUTS
 MODEL SYSTEM
 RESULTS AND DISCUSSION
 CONCLUSIONS
 Acknowledgements
 References
 
Despite persistent excitation and appropriate sampling frequency, the combined effects of multiple, interacting inputs can be missed if only single inputs are perturbed one at a time. In complex systems, multivariate interaction effects are impossible to predict from single variable perturbations [12, 31]. Consider the case of:


Formula 1

Changing u1 alone allows for estimation of the lumped parameter {alpha}1 + {alpha}12u2. Similarly, changing u2 alone allows for estimation of the lumped parameter {alpha}2 + {alpha}12u1. However, an estimate of {alpha}12 cannot be obtained unless both u1 and u2 are varied simultaneously. These types of interactions are common in biological systems and can only be fully characterized by perturbing multiple inputs simultaneously.

Ljung raised two pivotal questions a modeller must address when attempting to apply identification methods; ‘How should I design the identification experiment?’ and ‘Within what set of models should I look for a suitable description of the system?’ [32]. The answers to these questions depend in part on what the model is to be used for and the prior knowledge available about the system [1]. These represent constraints on the modelling exercise that are necessary to ensure that an appropriate solution can be found. The answer to the question of whether or not the resulting model is good enough will depend highly on the end use of the model. Questions such as how much error can be tolerated, which states need to be most accurately estimated, and what types of bias can be accepted are all functions of the model's use and need to be carefully explored prior to, during, and after, model identification. These issues will limit the applicability of the resulting model and will determine how many data are required to identify the model parameters [33].

Predictive mathematical models of the interactions and regulations of genetic networks can provide insight into mechanisms of gene regulation, the role of various genes within a network, and how several genes interact to produce complex traits. For these broad classes of problems, having a mathematical model can be useful for interpreting existing experimental data, designing new experiments, and proposing new hypotheses. To realize the full potential of such models, an understanding of the underlying experimental challenges of system identification is essential. In particular, the potential pitfalls of inappropriate excitation, sampling and multivariate interactions necessitate guidance regarding how and what types of data should be collected to identify biological models. This manuscript explores these three experimental design issues in greater detail. To provide perspective, both of Ljung's questions are explored in the context of a benchmark model system discussed below, with the greatest detail provided on the design of appropriate experiments. Issues such as input sequence design, frequency of perturbation and measurement, and multivariate interactions are explored.


    MODEL SYSTEM
 TOP
 ABSTRACT
 BACKGROUND
 EXPERIMENTAL DESIGN...
 PERSISTENT EXCITATION
 SAMPLING FREQUENCY
 MULTIVARIATE INPUTS
 MODEL SYSTEM
 RESULTS AND DISCUSSION
 CONCLUSIONS
 Acknowledgements
 References
 
Figure 3 schematically shows a hypothetical three-state network with feedback. By using a system with a known mathematical description, a more complete analysis can be done than could be carried out using an experimental system. The benchmark network models the response of three arbitrary transcription factors to changes in three inducers. Each of the proteins influences the rate of synthesis of at least one other protein. The full nonlinear equations describing this network are:


Formula 2

All inputs were nominally set to 0.5. The model was linearized around the nominal stable steady-state [0.9905 0.9894 0.3490]T of using a first order Taylor series expansion. The resulting linear system is given as:


Formula 3

This linear approximation of the full nonlinear model demonstrates good agreement to the behavior of the full model over a reasonable window of time (Figure 4). A series of gradually changing random inputs (environmental changes in the inducer levels) were introduced over a period of 50 min. The linear model shows strong agreement with the nonlinear model (Figure 4B) as demonstrated by the small (<1) percentage error between the full nonlinear model and the linear model that resulted from the input sequence (Figure 4C). For the purposes of this work, Equation (3) represents the desired, theoretical values of the linear model.


Figure 3
View larger version (10K):
[in this window]
[in a new window]
 
Figure 3: Model biochemical network used to demonstrate key issues in experimental design. The three inputs correspond to inducers that can be added to the media to affect protein expression. The three outputs are the levels of the proteins of interest, each of which regulates the synthesis of at least one other protein in the network by influencing how much of that protein is produced in the presence of a given level of inducer. The model also includes some ‘leaking’ promoter effects where changes in the inducer levels u1 and u2 influence the production of y3.

 

Figure 4
View larger version (28K):
[in this window]
[in a new window]
 
Figure 4: Comparison of the full nonlinear model to the linear model identified using a Taylor series expansion. (A) Input profiles for each of the three inputs used to excite the dynamics for the true nonlinear model and the candidate linear model (*u1, {bigtriangledown}u2, {Delta}u3). Comparison of linear model predictions (symbols) versus nonlinear model values (solid lines) (B) {bigcirc}y1,lin, +y2,lin, (C) *y3,lin.

 

    RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 BACKGROUND
 EXPERIMENTAL DESIGN...
 PERSISTENT EXCITATION
 SAMPLING FREQUENCY
 MULTIVARIATE INPUTS
 MODEL SYSTEM
 RESULTS AND DISCUSSION
 CONCLUSIONS
 Acknowledgements
 References
 
In this work, the emphasis was on the identification of the parameters of the linear model. It should be noted that the emphasis could have been placed on the development of candidate model structures [1]. To identify model structure, physical principles, empirical relations and other a priori information play a key role. Typically, these model structures require nonlinear identification routines, which exacerbate the identification issues raised below.

In each case explored, identification was carried out as described in the methods below. The default state-space structure for the MATLABTM System Identification Toolbox assumes the form:


Formula 4

The full state-space model described in Equation (4) could not be used because nearly all of the datasets were insufficient to identify the full parameter space (only the random multivariate inputs were able to identify the full structure). However, as is often the case, knowledge about the specific system can be used to specify additional structure. In general, any known information about the model structure should be included prior to parameter estimation. This reduces the number of parameters to be estimated and, therefore, the amount of data required for the identification. For a given data set, reducing the number of parameters to be identified allows for better estimation of the remaining parameters. In this particular case, the matrices C and D were specified as identity and zero matrices, respectively. This corresponds to the assumptions that measured variables directly correspond to the states of the cell and that the inputs do not instantaneously affect the outputs.

Two issues arise that necessitate validation studies. For completeness, these issues are briefly discussed. First, the parameters identified are those that best fit the data set using a specified model structure. To test the assumptions and prior knowledge that were used to achieve the desired model structure, validation against additional datasets is essential. Second, conclusions drawn from analysis of a dataset, whether qualitative (such as interactions) or quantitative (such as parameter estimates) are only as good as the dataset that was used to arrive at the knowledge. Improper experimental design can lead to invalid conclusions that can only be found by careful validation of the model against additional well designed experiments.

Random, independent inputs allow better identification
A common practice in experimental biology has been to vary one input to the biological system at a time and see how the system responds; e.g. heat shock [34], acetate stress [35], galactose addition [36], or diauxic shift [37]. Often, these changes are introduced as a step change in environmental conditions. Figure 5A (panels A and B) demonstrates the effectiveness of step response data in generating linear models.


Figure 5
View larger version (13K):
[in this window]
[in a new window]
 
Figure 5: Model performance for the identified models. All datasets contained 16% Gaussian noise. (A) Random perturbations to each of the three inputs spaced 15 min apart with sampling every 7.5 min identified a model (+) performing more like a Taylor series expansion of the nonlinear model (best case, —) than various combinations of step changes (single step change experiments {bigcirc}, single steps plus simultaneous step change in all three variables *, single steps plus offset step changes in each of the three variables x). (B) Changing sampling frequency only slightly impacted the quality of the model prediction from step response data (single step change experiments +, twice the sampling frequency {bigcirc}, three times the sampling frequency *). (C) An optimal sampling frequency exists for models identified from random input data (random perturbation experiment +, 1.5 times the sampling frequency {bigcirc}, 2.5 times the sampling frequency *).

 
Three linear models were identified from step response data (Figure 5A). The first data set consisted of three experiments. In each experiment, only one of the three system inputs was changed. Measurements were taken every fifteen minutes starting at t = 0 and continuing until t = 60 min with a step change introduced at t = 15 min. The second data set added to this an additional experiment in which all three inputs underwent a simultaneous step change at t = 15 min. Sampling was done once every 15 min for 60 min. This explored the effect of simultaneously varying all three inputs. The third data set explored the effect of sequentially varying each of the inputs. This data set consisted of the same three experiments in the first data set plus a fourth experiment in which a single step change was sequentially introduced to each of the three system inputs, spaced 15 min apart. Sampling was again done once every 15 min for 60 min.

The model performance for each of the models identified by each of the three step change datasets described above was assessed. The results demonstrate that step changes were ineffective at capturing the inherent dynamics of the underlying system (Figure 5A). For each of the three step change models identified, substantial errors were observed compared with the linear Taylor series model. Whereas the Taylor series model demonstrated less than 1% error for the validation sequence (Figure 4A), the step response models demonstrated as much as 5% error.

The model performance based on data from a series of random inputs to each of the three inputs was compared with the model performance derived from the step changes (Figure 5A). The random inputs were introduced once every 15 min and data were sampled once every 7.5 min. To make the comparison fair, the random input profile data were collected over 105 min. This corresponded to a total of 15 data points for each concentration species, the same as in the first step inputs experimental case. The performance of the linear Taylor series model was more closely approximated by the random inputs model than by the step change model. In particular, the random inputs model demonstrated less error and a tighter distribution around the true predicted values than the models identified from step change data. For the random input model, less than 2% error was observed between the identified linear model and the full nonlinear model.

In this study, it was assumed that the model structure was known (i.e. a linear state-space model) but that reliable estimates of the model parameters were not available. In such a case, where the degree of interaction is not fully characterized and the inherent dynamic properties of the system are not available, random input sequences such as those described above will perturb the system in most directions (i.e. relative linear combinations of inputs), thus increasing the probability of exciting the system along (or near) its eigenvectors. These random sequences will thus provide estimates of the key parameters within the network. Further, performing a series of step experiments to identify the model parameters will not provide reliable estimates [38, 39]. Therefore, a random input profile will provide the greatest insight.

Sampling frequency impacts model quality for random inputs
It could be argued that the observed differences between the step experiments and the random input profiles may be due to the different sampling frequencies between the two cases (15 min for the step changes and 7.5 min for the random input changes). To address this concern, the effect of increasing the sampling frequency for the step case on the quality of the predictions was determined (Figure 5B). Using the sampling frequency twice (once every 7.5 min) did not significantly improve the quality of the resulting identified model. Increasing to once every 5 min gave similar results. This indicated that the improved identification was due to dynamic excitation and not to measurement frequency.

However, with the random inputs, sampling frequency had a profound effect on the model identification (Figure 5C). Increasing the sampling frequency from once every 7.5 min to once every 5 min led to the identification of a model that more closely approximated the behavior of the linear Taylor series model. Further increasing the sampling frequency to once every 3 min actually led to decreased prediction quality (Figure 5C). At higher frequencies, noise in the dataset masks true dynamic features. As a consequence, erroneous parameter estimates are obtained that capture the dynamics of the noise. This demonstrated that there exists an optimal sampling frequency based on the perturbation frequency, the inherent system dynamics, and the noise in the dataset. Efficient and effective experimental design requires a priori identification of these optimal frequencies.

Identification from random perturbations more robust to noise
All of the models in Figure 5 were identified by adding random Gaussian noise with a standard deviation of 16% to the exact predictions of the nonlinear model. To explore the sensitivity of model quality to the level of noise, the standard deviation of the noise was varied from 2% up to 16% (Figure 6). For the step responses, no significant effect was observed as the noise level changed. The identified model was equally poor in each case. As the noise level decreased in the random input datasets, the identified model approached the linear Taylor series approximation. Thus, random perturbations were able to better extract underlying dynamic features from noisy data. Again, the number of data points was conserved across comparisons: each model was identified with a total of 15 data points.


Figure 6
View larger version (21K):
[in this window]
[in a new window]
 
Figure 6: Effect of noise on identified model quality. Data sets contained random Gaussian noise of the indicated level. (A) Performance of models identified from a series of three experiments, each consisting of a step change in one input. (B) Performance of models identified from random, independent input perturbations.

 
Optimal frequency of perturbation
The effect of the perturbation frequency on model quality was investigated. In this case, the data again contained random Gaussian noise with a standard deviation of 16%. In addition to the optimal sampling frequency identified above, an optimal perturbation frequency exists (Figure 7). Perturbing once every 10 min and measuring once every 5 min identified a linear model similar to the linear Taylor series model. Analysis of the linear Taylor series model provided insight into why this was the case. The eigenvalues of the system presented in Equation (3) were [–0.0503, –0.1006, –0.1502], indicating a stable system with characteristic time constants of approximately 19.9, 9.9 and 6.7 min. With a sampling time of 5 min and perturbations every 10 min, the first dynamic mode achieved approximately 40% of its change, the second mode achieved approximately 64% of its change, and the third mode achieved approximately 78% of its change before the next set of perturbations were introduced. Increasing the perturbation and measurement frequencies masked the dynamics of the slowest mode by limiting it to a much smaller fraction of change. Lower perturbation and measurement frequencies missed dynamic information by allowing the fastest mode to approach steady-state. Thus, a tradeoff between the competing modes existed. For complex biological systems, multiple perturbation frequencies may be required to explore all of the modes of interest.


Figure 7
View larger version (11K):
[in this window]
[in a new window]
 
Figure 7: An optimal frequency of perturbation exists for identification of the inherent dynamics. Perturbation frequencies below or above the optimum identified models with significant error. Increasing the length of the perturbation run improved model quality, but had a less pronounced effect than identification of the optimal perturbation and measurement frequencies. (best case —; random inputs with perturbations every 15 min, measurements every 7.5 min +; perturbations every 10 min, measurements every 5 min {bigcirc}; perturbations every 7.5 min, measurements every 3.75 min *; perturbations every 10 min, measurements every 5 min for twice as long x).

 
Of the three perturbation frequency cases explored, perturbing once every 10 min and measuring once every 5 min identified the best model (Figure 7). Additional data should improve model quality. To test this, the same experiment was carried out for twice as long, as shown in Figure 7. The resulting model more closely approximated the linear Taylor series model. However, this effect was significantly less profound than the identification of the appropriate perturbation and measurement frequencies. This indicated that the identification of the appropriate time constant of perturbation is an essential component of the experimental design.

This work highlights the issues that need to be addressed with respect to the design of the perturbation frequencies. However, much additional work is needed. Namely, this work demonstrated that information needs to be present for all of the dynamic modes in order to obtain the most accurate model possible. Perturbation sequences that were too fast missed the slower dynamics in the system whereas slow perturbation sequences missed the fast dynamic features. Although it was not discussed here, significant work has been done in the systems engineering field that treats this very issue, typically by exploring the frequency domain of the process of interest [12, 31].

For most cases, the perturbation frequency can be identified based on the end use of the model. Knowledge of what the model will be used for can often identify the key dynamic time scales of interest. For example, if a model is to be used to predict responses on the scale of 1 h, dynamic events occurring on time scales faster than 1 min or slower than 10 h are likely to be of little importance. As models are typically developed for some end use, specification of the timescales for which the model is to be used is essential. Perturbations can then be designed around these time scales. If the resulting model is of inadequate quality, additional consideration of the design of the perturbation sequence and/or of the model structure may be required. The inadequacies may be due to missed interactions and/or nonlinear effects. To overcome these issues, the system may need to be perturbed at additional frequencies, the underlying model structure may need to be updated, or additional data may be required to allow for better estimates of the parameters.

Generalizing the approach
For a given model structure with a given set of parameters, there exists an optimal input sequence to obtain the best estimates of those parameters with the least amount of experimental work. However, that optimal input sequence design is dependent on both the model structure and on the actual parameter values. As the parameter values are not known a priori, this optimal input sequence design cannot be known a priori. Thus, every experiment is likely to be suboptimal. The goal then becomes to pick an experiment or a series of experiments that will be likely to provide useful information.

As indicated in Figure 1, identification is an iterative procedure. Once better estimates of the experimental parameters are known, a more refined input sequence can be used to excite the dynamics. The design of these refined input sequences is described elsewhere [39–42]. The point here is that, while any given input sequence is incapable of providing network identification regardless of system properties, certain inputs are better able to discern between candidate interactions. Further, other inputs sequences are fundamentally incapable of accomplishing this task.


    CONCLUSIONS
 TOP
 ABSTRACT
 BACKGROUND
 EXPERIMENTAL DESIGN...
 PERSISTENT EXCITATION
 SAMPLING FREQUENCY
 MULTIVARIATE INPUTS
 MODEL SYSTEM
 RESULTS AND DISCUSSION
 CONCLUSIONS
 Acknowledgements
 References
 
The appropriate design of experiments has a profound impact on the ability to identify a model and on the quality of resulting identified model. The typical approach for elucidating dynamic features of biological regulatory networks has been to make step changes in one variable at a time, observe the resulting dynamic response, and then attempt to back out qualitative and/or quantitative models to describe the observed dynamics. This work demonstrates that such models do not identify key dynamic features. Capturing the multivariate nature inherent to biological regulatory networks requires multivariate random perturbations. This is especially true when the underlying data contain high levels of noise, as in the case of most biological data.

Designing these multivariate random perturbations presents a challenge that needs to be addressed. This work demonstrates that the key issues are the identification of the perturbation and measurement frequencies and the number of measurements to take. Because of the effect of noise, the maximum measurement frequency should be no faster than approximately ten times the frequency of the fastest dynamics of interest. From the Shannon frequency, the measurement frequency should also be no less than twice the perturbation frequency. The frequency of perturbation is strongly dependent on the time scale of interest for the end use of the model and on the inherent dynamics of the underlying system. Using these guidelines as a starting point, more informative experiments can be designed and implemented to enhance the modelling and identification of regulation in biological networks.

Methods
The identification problem was formulated as: Given the datasets y(k), {k = 0, 1, 2, ...} generated by implementing the input sequence u(k), {k = 0, 1, 2, ...} on the true but unknown systems in Equation (2), obtain estimate values for the parameters A and B in the approximate linear model


Formula 5

that best describes the observed data. Formula and Formula , the identified estimates of A and B, were then compared against the ‘theoretical’ values indicated in Equation (3).

Seventeen virtual experiments were carried out using the full nonlinear model as a data generator. These experiments are summarized in Table 2. The virtual experiments can be classified into three categories: step responses in one input at a time, step responses in multiple inputs, and independent, random variation in each input. For each experiment, u(k) was specified a priori and introduced to the full nonlinear model. The impact of noise on the quality of the resulting model was explored by adding 2–16% noise to the nonlinear model predictions, resulting in the datasets y(k).


View this table:
[in this window]
[in a new window]
 
Table 2: Virtual experiments carried out on full nonlinear model. [t0:ts:tf] denotes [initial time: sampling interval:ending time]. Fs, and Fp denote sampling and perturbation frequency, respectively.

 
The MATLABTM System Identification toolbox was used to identify continuous, linear state-space models for each of several combinations of virtual experiments. Each of the identified state-space models was then compared with the full nonlinear model in order to test model performance. The same series of inputs shown in Figure 4A were used to excite each of the models. The percentage error was defined for each state as:


Formula 6

where yi denotes a concentration value for each protein as given by the nonlinear (nl) or linear (linear) candidate model. This calculation was done for each of the three outputs for each time point in a time course study. All of the error calculations were then combined into a distribution plot. The distributions of these errors provided insight into the accuracy of the resulting identified models compared with the full nonlinear model. The specific criterion used to evaluate the individual models was similar to the error distribution of the theoretical linear model.


Key Points

  • Appropriate design of input sequences is essential for accurate model identification.
  • Pseudorandom variation of each input simultaneously is better than step changes.
  • For biological networks, sampling frequency and perturbation frequency affect the quality of results.

 


    Acknowledgements
 TOP
 ABSTRACT
 BACKGROUND
 EXPERIMENTAL DESIGN...
 PERSISTENT EXCITATION
 SAMPLING FREQUENCY
 MULTIVARIATE INPUTS
 MODEL SYSTEM
 RESULTS AND DISCUSSION
 CONCLUSIONS
 Acknowledgements
 References
 
This work was supported by NASA GSRP Fellowship NGT5 50367 (KJK), NIH Grant 5P20 RR15588-02, and the US Department of Energy Division of Biological Sciences Microbial Cell Project.


    FOOTNOTES
 
Kenneth J. Kauffman completed his Ph.D. at the University of Delaware under the guidance of Jeremy S. Edwards in 2003.

Babatunde A. Oguinanaike is the William L. Friend Professor of Chemical Engineering at the University of Delaware.

Jeremy S. Edwards is an assistant professor in the Department of Molecular Genetics and Microbiology, University of New Mexico Health Sciences Center and also in Department of Chemical and Nuclear Engineering, University of New Mexico, Albuquerque, NM 87313, USA.


    References
 TOP
 ABSTRACT
 BACKGROUND
 EXPERIMENTAL DESIGN...
 PERSISTENT EXCITATION
 SAMPLING FREQUENCY
 MULTIVARIATE INPUTS
 MODEL SYSTEM
 RESULTS AND DISCUSSION
 CONCLUSIONS
 Acknowledgements
 References
 

  1. Bohlin T. Interactive System Identification: Prospects and Pitfalls. New York: Springer-Verlag 1991.
  2. Zak DE, Gonye GE, Schwaber JS, et al. Importance of input perturbations and stochastic gene expression in the reverse engineering of genetic regulatory networks: insights from an identifiability analysis of an in silico network. Genome Res 2003; 13:2396–405.[Abstract/Free Full Text]
  3. Napoli C, de Nigris F, Sica V. New advances in microarrays: finding the genes causally involved in disease. Methods Mol Med 2005; 108:215–33.[Medline]
  4. Merrick BA, Madenspacher JH. Complementary gene and protein expression studies and integrative approaches in toxicogenomics. Toxicol Appl Pharmacol 2005; 207:Suppl 2, 189–94.[CrossRef][Medline]
  5. Lelliott CJ, Lopez M, Curtis RK, et al. Transcript and metabolite analysis of the effects of tamoxifen in rat liver reveals inhibition of fatty acid synthesis in the presence of hepatic steatosis. Faseb J 2005; 19:1108–19.[Abstract/Free Full Text]
  6. Sriram G, Fulton DB, Iyer VV, et al. Quantification of compartmented metabolic fluxes in developing soybean embryos by employing biosynthetically directed fractional (13)C labeling, two-dimensional [(13)C, (1)H] nuclear magnetic resonance, and comprehensive isotopomer balancing. Plant Physiol 2004; 136:3043–57.[Abstract/Free Full Text]
  7. Streitz JM Jr., Madden MT, Marimanikkuppam SS, et al. Analysis of protein expression patterns in Barrett's esophagus using MALDI mass spectrometry, in search of malignancy biomarkers. Dis Esophagus 2005; 18:170–6.[CrossRef][Medline]
  8. Dieguez-Acuna FJ, Gerber SA, Kodama S, et al. Characterization of mouse spleen cells by subtractive proteomics. Mol Cell Proteomics 2005; 4:1459–70.[Abstract/Free Full Text]
  9. Kal AJ, van Zonneveld AJ, Benes V, et al. Dynamics of gene expression revealed by comparison of serial analysis of gene expression transcript profiles from yeast grown on two different carbon sources. Mol Biol Cell 1999; 10:1859–72.[Abstract/Free Full Text]
  10. Jenkins RE, Hawley SR, Promwikorn W, et al. Regulation of growth factor induced gene expression by calcium signalling: integrated mRNA and protein expression analysis. Proteomics 2001; 1:1092–104.[CrossRef][Medline]
  11. Tavazoie S, Hughes JD, Campbell MJ, et al. Systematic determination of genetic network architecture. Nat Genet 1999; 22:281–5.[CrossRef][ISI][Medline]
  12. Ogunnaike BA, Ray WH. Process Dynamics, Modeling and Control New York: Oxford University Press 1994.
  13. Gasch AP. Yeast genomic expression studies using DNA microarrays. In: Guthrie C, Fink GR (eds). Methods in Enzymology. New York: Academic Press, 2002 393–414.
  14. Matson R. Applying Genomic and Proteomic Microarray Technology in Drug Discovery. New York: CRC Press 2004.
  15. Shoemaker J, Lin S. Methods of Microarray Data Analysis IV. New York: Springer-Verlag 2005.
  16. Kamberova G, Shah S. DNA Array Image Analysis Nuts and Bolts. Eagleville, PA: DNA Press 2002.
  17. Hardiman G. Microarrays Methods and Applications. Eagleville, PA: DNA Press 2003.
  18. Johnson K, Lin S. Methods of Microarray Data Analysis III. New York: Kluwer Academic Publishers 2003.
  19. Simon R, Korn E, McShane L, et al. Design and Analysis of DNA Microarray Investigations. Eagleville, PA: Springer-Verlag 2003.
  20. Hamdan M, Righetti P. Proteomics Today: Protein Assessment and Biomarkers using Mass Spectrometry, 2D Electrophoresis, and Microarray Technology. New York: John Wiley and Sons 2005.
  21. Baldi P, Hatfield G. DNA Microarrays and Gene Expression: From experiments to data analysis and modeling. Cambridge, UK: Cambridge University Press 2002.
  22. Speed T. Statistical Analysis of Gene Expression Microarray Data. New York: Chapman & Hall/CRC 2003.
  23. Draghici S. Data Analysis Tools for DNA Microarrays. New York: Chapman & Hall/CRC 2003.
  24. Anderson M, Whitcomb P. DOE Simplified: Practical Tools for Effective Experimentation. Portland, OR: Productivity, Inc 2000.
  25. Bevington P, Robinson D. Data Reduction and Error Analysis for the Physical Sciences. 3rd edn New York: McGraw Hill 2003.
  26. Montgomery D. Design and Analysis of Experiments. 6th edn New York: John Wiley and Sons 2005.
  27. Box G, Box J, Hunter W. Statistics for Experimenters: Design, Innovation, and Discovery. 2nd edn New York: John Wiley and Sons 2005.
  28. Box GEP, Jenkins GM, Reinsel GC. Time Series Analysis: Forecasting and Control. 3rd edn Upper Saddle River, NJ: Prentice Hall 1994.
  29. Zak DE, Doyle FJ III, Gonye GE, et al. Simulation studies for the identification of genetic networks from cDNA array and regulatory activity data. Proc 2nd Intn Conf Systems Biol. 2001 pp. 231–8.
  30. Kauffman KJ, Pajerowski JD, Jamshidi N, et al. Description and analysis of metabolic connectivity and dynamics in the human red blood cell. Biophy J 2002; 83:646–62.[Abstract/Free Full Text]
  31. Skogestad S, Postlethwaite I. Multivariate Feedback Control: Analysis and Design. New York: John Wiley & Sons, Inc. 1996.
  32. Ljung L. Aspects on the system identification problem. Signal Processing 1982; 4:445–56.
  33. Selinger DW, Wright MW, Church GM. On the complete determination of biological systems. Trends Biotechnol (in press).
  34. Richmond CS, Glasner JD, Mau R, et al. Genome-wide expression profiling in Escherichia coli K-12. Nucleic Acids Res 1999; 27:3821–35.[Abstract/Free Full Text]
  35. Lasko DR, Schwerdel C, Bailey JE, et al. Acetate- specific stress response in acetate-resistant bacteria: an analysis of protein patterns. Biotechnol Prog 1997; 13:519–23.[CrossRef][Medline]
  36. Dudley A, Aach J, Steffen M, et al. Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range. Proc Nat Acad Sci USA 2002; 99:(11)7554–9.[Abstract/Free Full Text]
  37. Joseph E, Danchin A, Ullmann A. Regulation of galactose operon expression: glucose effects and role of cyclic adenosine 3',5'-monophosphate. J Bacteriol 1981; 146:149–54.[Abstract/Free Full Text]
  38. Bendat JS. Nonlinear System Analysis and Identification from Random Data. New York: John Wiley & Sons 1990.
  39. Pearson RK. Discrete-Time Dynamic Models. New York: Oxford University Press 1999.
  40. Eykhoff P. System Identification: Parameter and State Estimation. New York: John Wiley & Sons 1974.
  41. Ljung L. System Identification: Theory for the User. Upper Saddle River, NJ: Prentice Hall PTR 1999.
  42. Pearson RK, Pottmann M. Gray-box identification of block-oriented nonlinear models. Journal of Process Control 2000; 10:301–15.

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
4/4/331    most recent
eli004v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kauffman, K. J.
Right arrow Articles by Edwards, J. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kauffman, K. J.
Right arrow Articles by Edwards, J. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?