Abstract

Motivation: Proteins play a crucial role in biological activity, so much can be learned from measuring protein expression and post-translational modification quantitatively. The reverse-phase protein lysate arrays allow us to quantify the relative expression levels of a protein in many different cellular samples simultaneously. Existing approaches to quantify protein arrays use parametric response curves fit to dilution series data. The results can be biased when the parametric function does not fit the data.

Results: We propose a non-parametric approach which adapts to any monotone response curve. The non-parametric approach is shown to be promising via both simulation and real data studies; it reduces the bias due to model misspecification and protects against outliers in the data. The non-parametric approach enables more reliable quantification of protein lysate arrays.

Availability: Code to implement the proposed method in the statistical package R is available at: http://odin.mdacc.tmc.edu/jhu/lysatearray-analysis/

Contact:jhu@mdanderson.org

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Although DNA microarray technologies are able to simultaneously monitor the mRNA levels of thousands of genes, the post-translational modifications that result in the complexity of the proteome cannot be detected at the genomic level. It is often necessary to measure proteins directly in biological and medical research, because most of the known biological effector molecules and diagnostic markers are proteins. Current technologies used for detecting protein biomarkers include 2D polyacrylamide gel electrophoresis (Bernard et al., 2004; Hojlund et al., 2003), mass spectrometry (Ong et al., 2003) and enzyme-linked immunosorbent assays (Kakizawa et al., 2004; Toney et al., 2003). These technologies require large amounts of input material, and need good ways to identify markers. To overcome those shortcomings, various protein microarray technologies have been developed (Cahill et al., 2003; Ivanov et al., 2004; MacBeath and Schreiber, 2000).

Protein microarrays use forward-phase or reverse-phase approaches. The forward-phase approach allows for the simultaneous measurement of many different proteins across single samples. In contrast, the reverse-phase approach aims to measure a common protein across multiple samples. A commonly used reverse-phase assay is the tissue microarray (Camp et al., 2000; Kononen et al., 1998; Liu et al., 2002; Packeisen et al., 2003; Torhorst et al., 2001). However, its major drawback is that protein levels are typically assessed manually by a pathologist, which can potentially produce subjective bias and is only semi-quantitative.

In this article, we focus on quantification of another type of reverse-phase array which uses homogenized samples. These arrays are called tissue lysate or protein lysate arrays, as described in detail by Paweletz et al. (2001) (also see MacBeath, 2002). Mechanically, lysed tissue samples are first robotically spotted onto nitrocellulose-coated slides in a dilution series. The samples are then hybridized with a primary validated antibody tailored to recognize the protein of interest. In order to attach measurable labels, the arrays are developed by any number of techniques most frequently incubated with a biotinylated secondary antibody which recognizes the primary antibody. Finally, streptavadin-linked labels (which can be dyes, or precipitates, or others) are introduced and the streptavadin binds to the biotin. The array can then be scanned to produce an image file, and spots in the image are quantified using any of a variety of software packages.

Lysate arrays can be used to address a wide variety of questions by printing multiple samples on the same slide. For example, by printing samples from treated and untreated cell lines, the response to therapeutic agents can be assessed. Lysate arrays have been used to study signal pathway profiling related to prostate cancer (Herrmann et al., 2003; Pawe-letz et al., 2001). Nishizuka et al. (2003a) applied lysate arrays to profile 60 human cancer cell lines (NCI-60) and found that there was a high correlation between mRNA and protein levels for structure-related proteins. Jiang et al. (2006) used glioma protein lysate arrays to identify proteins that are involved in signaling pathways of cell proliferation, cell survival, apoptosis, angiogenesis and cell invasion. Ramaswamy et al. (2005) discussed verification of presence and quantification of human tissue samples for protein markers and compared the results with those obtained with ELISAs. They concluded that protein lysate arrays could show high sensitivity, reproducibility and high-throughput fashion. Some other cancer and clinical-related studies using this technology and its variations include Amit (2006); Chan et al. (2004); Cheng et al. (2005); Espina et al. (2003); Espina et al. (2004); Lin et al. (2003); Murph et al. (2006); Nishi-zuka et al. (2003b); Sheehan et al. (2005); Tibes et al. (2006); Wulfkuhle et al. (2003); Yan (2003); Zha et al. (2004).

Regardless of the biological questions to be addressed, the relative amounts of protein in the arrayed samples need to be quantified as accurately as possible. The current commercial analysis package MicroVigene (http://www.vigenetech.com/product.htm) extracts the foreground and background intensity levels from the raw array image, and then estimates the protein expression level by fitting a 4-parameter logistic model to each dilution series, with the replicates being averaged or combined in the analysis. Another sample-by-sample approach (Mircean et al., 2005) models the background-corrected spot values as linear when the log intensity is plotted against the dilution step number.

The quantification methods mentioned earlier work on the data one sample at a time. With reverse-phase tissue lysate arrays, however, the same antibody binding is taking place at each spot, so it is reasonable to assume that protein expression levels of different samples corresponding to the same antibody have similar chemistry, and can thus be fit using a common response curve increasing the confidence of the curve fit. The joint estimation approach assuming a parametric form of the mean response curve has been discussed extensively in Tabus et al. (2006). In particular, the logistic model is chosen due to its sigmoidal shape, which is believed to be consistent with the expected relationship between the observed intensity of a spot and the true protein concentration, due to quenching at high levels and background noise at low levels. This also forms the core of the newly developed R package called ‘SuperCurve’, publicly available at http://bioinformatics.mdanderson.org

The logistic model used in SuperCurve is of the form:  

(1)
formula
where yij is the observed expression level at the ith dilution level of the jth sample, i = 1,…, I and xi defines the corresponding dilution level index at ith step, typically, xi = i−(1 + I)/2 is used. The median effective concentration (denoted by EC50), which is the 50th percentile of the series, is the single quantity per sample or per dilution series to be estimated. The random errors ɛij are assumed to be independently normally distributed with mean 0 and variance σ2. The global model parameters α, β and γ and the EC50s can be obtained via a non-linear least squares procedure.

The joint logistic modeling procedure was shown to improve the accuracy and dynamic range of the estimated protein expression levels over sample-by-sample estimation based on simulation studies and a real experiment conducted at the M.D. Anderson Cancer Center (Tabus et al., 2006). However, the rigid assumption of the logistic curve may lead to estimation bias for some samples.

We propose a more flexible non-parametric approach by assuming that the median of yij is equal to g(xi + EC50j) for some monotonically increasing function g, without specifying a parametric form for g. The non-parametric model can be highly adaptive to the data and avoid bias due to parameterization. Both simulation studies and real data examples show that the non-parametric approach is highly promising, as it produces consistent and robust results regardless of whether a logistic curve fits the data.

2 METHODS

2.1 Monotone smoothing

To avoid a rigid parametric form on the protein response curve g, we choose a monotone smoothing technique to estimate the curve given a set of trial values on EC50j. The literature on monotone smoothing is rich, but we choose to use the R package cobs based on constrained b-spline approximations of He and Shi (1998); He and Ng (1999). We shall briefiy describe this methodology in this section, but refer to He and Ng (1999) for the algorithmic details.

Given n pairs of realizations forumla with a = x0<x1 …<xn<xn + 1 = b, we wish to model the relationship as yi = g(xi) + errori for some smooth and monotone, but otherwise unknown, function g. The idea of cobs is to approximate g by a quadratic spline that minimizes the following objective function  

(2)
formula
where  
formula
measures infidelity to data, P (g) is a roughness penalty on the function g and λ is a smoothing parameter that controls the trade-off between infidelity to the data and roughness of the fit. When the roughness penalty P (g) = maxx|g ′′ (x)| is chosen, the solution to (2) is a quadratic natural smoothing spline. To add a monotonicity constraint on g, all we need is to solve (2) subject to g(xi) ≤g(xi + 1). As shown in He and Ng (1999), this constrained optimization problem can be easily re-expressed as a linear program. As a result, the monotone function g can be estimated with very efficient algorithms.

The cobs package used a BIC-type criterion for selecting λ. To further save computing time, cobs limits the number of potential knots to a prespecified value. In all our empirical work with lysate arrays, we choose 40 potential knots in the construction of splines. The effective dimensions of the splines, as determined by the λ selection, are typically much smaller than 10.

2.2 Non-parametric modeling

We propose the non-parametric model of the form:  

(3)
formula
where yij is the observed expression level at the ith dilution step of the jth sample, and xi defines the corresponding dilution level index at ith step; we use xi = i−(1 + I)/2 in our procedure. The median effective concentration (denoted by EC50) is the single quantity to be estimated for each dilution series or each sample. The median of the random errors, ɛ, is assumed to be 0, while g is an unknown but monotonically increasing function.

We propose the following quantification algorithm to obtain the EC50's, where g is estimated iteratively together with the EC50's.

  1. Obtain initial estimates of EC50 for each dilution series assuming that the protein expression intensities are linearly related to the true concentration level, i.e. yij = α + β (xi + EC50j) + noise. The estimate forumla is the minimum of the observed protein expression intensities of all samples on an array. The estimate of β is the median slope over all the dilution series. Then an estimate of EC50j is obtained as the median of forumla.

  2. Based on the initial estimates of EC50, the R package

    cobs
    is used to obtain the monotonically increasing function g by regressing yij on xi + EC50j with all (i, j) included.

  3. Conditional on the estimated curve g, update the estimate of EC50j by minimizing ∑j|yijg(xi + EC50j)|, the total distance for the jth dilution series. This step can be carried out using the function ‘

    nlrq
    ’ (Koenker and Park, 1994) implemented in the R package ‘quantreg’.

  4. To refine the estimate of EC50, we iterate Steps 2 and 3 until a convergence criterion is met. For example, an intuitive criterion is to stop the procedure if the difference of the deviance of the estimated curve to the observed data in two consecutive iterations is smaller than 0.1. However, we found in empirical studies that a small number of iterations (say, five) is sufficient to obtain reasonably good estimates of EC50s.

We shall discuss further some of the issues that are worth exploring in the proposed algorithm.

  1. There are a number of ways to obtain reasonable initial values in Step 1. For example, Tabus et al. (2006) used a linear relationship between certain order statistics in computing the slopes. The SuperCurve package used a logistic curve instead of a linear relationship. Both approaches worked well in our experience.

  2. The cobs algorithm in Step 2 uses |yijg(xi + EC50j)| in its objective function (for a given spline space for g). In Step 3, the same objective function is used to find EC50 given g. The use of the L1-norm in the objective functions has some important properties.

    • We assume that the median protein expression as a function of the true protein concentration is non-decreasing and second-order differentiable, but we need to make no assumption of symmetry or additivity of the error terms. In contrast, the least squares approach minimizing ∑i(yijg(xi + EC50j))2 could lead to biased estimates if the noise is not additive. To make this point clearer, consider the simple model of yi = eig(xi + EC50) for some multiplicative error ei with a log-normal distribution. The true value of EC50 does minimize the expected value of ∑i|yijg(xi + c)| over c, but it does not minimize the expected value of ∑i|yig(xi + c)|2.

    • When g and EC50's are estimated iteratively, the objective function ∑ij |yijg(xi + EC50j)| decreases with every step of the iteration to ensure convergence. Although reasonable initial values are still needed to ensure convergence to the global minimum of the objective function, the monotonicity of the objectives along the iteration is a very desirable property.

    • The use of the L1-norm allows cobs to use efficient linear programming algorithms to solve for g in Step 2 with the monotonicity constraint.

  3. A parametric model interacts with the scale of y. Sometimes the logistic curve fits well for protein intensity, but sometimes it fits better on the log-scale. With the non-parametric approach, the results are much less sensitive to monotone transformations on y.

3 SIMULATION STUDIES

As a function of the protein concentration level, the protein response curve has been modeled by a logistic curve in the existing literature. The shape of the logistic curve makes good sense, but careful thoughts about the logistic curve reveal some limitations of this approach. In this section, the logistic fitting approach and the proposed non-parametric approach are compared through a series of simulation studies.

We carried out several simulation studies with data generated from various parametric forms of the response curve. We intended to mimic the patterns of real lysate array data. Throughout the study, we used a common set of true EC50s of 40 samples, generated from the normal distribution with mean 1 and variance 3.52. We generated protein expression data of each sample in duplicate in most cases, unless otherwise specified.

The response curve in the first case is similar to a logistic curve but allows different scale parameters at the two tails. This is a realistic situation, because the lower end of background intensities and the upper end of saturation intensities are not necessarily symmetric. In this case, data was generated based on Model (1) with α = 3000, β = 10 000. But we took γ = 0.7 for the positive concentration measures (on the log2 scale, as always) and γ = 2.1 otherwise. We also used different noise distributions for different concentration regions. For positive concentration values, the data were generated subject to random noise from the t-distribution with three degrees of freedom and a scale parameter σ = 6000(true concentration + 5)−1; otherwise, the random noise was generated from the normal distribution with mean 0 and variance 6002. The random noise is heterogeneous and decreases with the increasing positive true concentrations.

The scatter plot of one data set (against the true concentration levels) is exhibited in the upper left panel of Figure 1. The logistic curve (dashed curve) failed to fit the data on either tail due to its built-in symmetry on the two tails. However, the non-parametric curve we obtained (dotted curve) matches the true curve well.

Fig. 1.

Plots for simulation study 1. The upper left panel shows the logistic fit (dashed curve) and a non-parametric fit (dotted curve in red), as compared to the true response curve (solid curve). The upper right panel shows the EC50 estimates using logistic fitting versus the true EC50 values. The lower left panel shows the EC50 estimates using the non-parametric approach. The solid line (black) is y = x. The lower right panel shows the box-and-whisker plot of the distribution of the measurements Dtot using the two approaches. Colour version of this figure is available as Supplementary material online.

Fig. 1.

Plots for simulation study 1. The upper left panel shows the logistic fit (dashed curve) and a non-parametric fit (dotted curve in red), as compared to the true response curve (solid curve). The upper right panel shows the EC50 estimates using logistic fitting versus the true EC50 values. The lower left panel shows the EC50 estimates using the non-parametric approach. The solid line (black) is y = x. The lower right panel shows the box-and-whisker plot of the distribution of the measurements Dtot using the two approaches. Colour version of this figure is available as Supplementary material online.

The performance of the two approaches can be directly assessed by comparing the estimated EC50s against the true EC50s. The straight lines in the other three panels of Figure 1 correspond to the 45 degree line y = x. The estimates obtained by the logistic fit (upper right panel) deviate substantially from the true EC50s at both the lower and the upper tails; several samples showing fairly large differences in true concentrations were given similar estimates around five. This implies that some interesting differential expression patterns among the biological samples are missed by the use of logistic curves in the quantification method. In contrast, the non-parametric approach (lower left panel) shows much better agreement with the true EC50s, except for one sample at the lower end of the intensity scale. Variations at the lower tail where the data are dominated by background noise are of less biological importance in the studies.

An important application of protein lysate arrays is to compare the protein content in different biological samples/agents on one slide. So it is important to examine if a quantification method can preserve the relative differences of protein content among different samples. We conducted 50 simulation trials, in each of which the sample with the median EC50 was used as the baseline. Then for any other sample, we computed the difference forumla between its estimated EC50 and the estimated EC50 value for the reference sample to see how close it is to the true relative difference D. The sum of |forumlaD| over all samples, denoted by Dtot is used as an overall measurement for the quantification method. The smaller Dtot is, the better. The box-and-whisker plot of the distribution of the measurements Dtot over 50 trials in this case were shown in the lower right panel of Figure 1, for both the logistic approach and the non-parametric approach. It turned out that the non-parametric approach did better in all 50 trials, with an average reduction of nearly 50% in Dtot.

In the second case, we generated the expression data from the model:  

formula
A multiplicative random noise ɛ with a log-normal eN (0, σ2) distribution was incorporated in this model instead of an additive normal random noise in Model 1). We used α = 1000, β = 4000, γ = 0.8 and σ = 0.2 in the simulation. In the upper left panel of Figure 2, the dots in gray show the intensity data versus the true concentration in one data set. The estimated response curve by the non-parametric approach (dotted line) matched the true curve (solid line) well, while the logistic curve (dashed line) showed poor fit, especially at the upper tail. It fails to reflect the curvature shape of the true model by placing all the data in the linear range of the logistic curve. It is partly because the rigid parametric structure of the logistic model led to a sacrifice in the upper tail in order to obtain relatively good fit of the lower tail. This can be seen more clearly from the comparison between the estimated EC50s and the true EC50s (the middle left panel). The bias gets larger as the true concentration increases in the upper tail. In fact, the logistic approach yielded similar estimates for several samples with large relative differences in true concentration.

Fig. 2.

Results for simulation study 2. The left columns correspond to 2 replicates per sample and the right columns 6 replicates. In each column, the top panel shows the fitted logistic curve (dashed curve) and non-parametric approach (dotted curve in red). The solid curve is the true response curve. The middle panel shows the EC50 estimates from the logistic approach versus the true EC50 values. The bottom panel shows the EC50 estimates from the non-parametric approach. The solid line (black) is y = x. Colour version of this figure is available as Supplementary material online.

Fig. 2.

Results for simulation study 2. The left columns correspond to 2 replicates per sample and the right columns 6 replicates. In each column, the top panel shows the fitted logistic curve (dashed curve) and non-parametric approach (dotted curve in red). The solid curve is the true response curve. The middle panel shows the EC50 estimates from the logistic approach versus the true EC50 values. The bottom panel shows the EC50 estimates from the non-parametric approach. The solid line (black) is y = x. Colour version of this figure is available as Supplementary material online.

When the number of replicates increases, the variability of the estimates drops, but the bias due to mid-specified models would not diminish, so the advantage in bias for our proposed method becomes more evident. The second column of Figure 2 corresponding to the case with 6 replicates for each sample illustrated this point.

In the third case, we considered a piece-wise linear model, consisting of a constant value a for the response curve at negative concentration levels (on the log2 scale) and a straight line with the intercept a and slope b for positive concentrations. This type of curve is often observed in applications where saturation is not attained. An additive random noise from N(0, σ2) was introduced. We used a = 3000, b = 500 and σ = 1000 in the simulation. It was shown in the top panel of Figure 3 that logistic approach produced artificial saturation at the upper tail, while the non-parametric approach provides a good fit to the data.

Fig. 3.

Results for simulation studies 3 (upper panel) and 4 (lower panel). Both show the data points at the estimated concentration levels with fitted logistic curve (dashed) and non-parametric curve (dotted in red). The solid curve is the true response curve. Colour version of this figure is available as Supplementary material online.

Fig. 3.

Results for simulation studies 3 (upper panel) and 4 (lower panel). Both show the data points at the estimated concentration levels with fitted logistic curve (dashed) and non-parametric curve (dotted in red). The solid curve is the true response curve. Colour version of this figure is available as Supplementary material online.

In the fourth case, we are interested in assessing the performance of non-parametric approach when the true response curve is logistic. In the simulation, we used Model (1) with α = 1000, β = 4000, γ = 0.8 and σ = 500. The result as shown in the bottom panel of Figure 3 indicates a competitive performance of the non-parametric approach relative to the logistic approach when the data are in favor of the latter.

The simulation studies demonstrated that the non-parametric quantification approach produces consistent and robust relative protein concentration estimates, while the success of the logistic curves depends on the underlying relationship. When a rigid parametric model is mis-specified, bias becomes a serious concern with the parametric model. We now consider whether this concern is practically important.

4 ANALYSIS OF TWO LYSATE ARRAYS

We used two lysate arrays produced at M.D. Anderson for demonstration in this article. In this experiment, cell lysates were diluted at eight levels and spotted on an array in duplicate. The layout of the array consisted of a 4 × 10 grid with each grid composed of a 4 × 4 subgrid. Each subgrid contained the duplicate dilution series for a single sample. Each of the identical top two rows of a subgrid contained the four most concentrated dilution steps (in decreasing concentrations from left to right), and the identical bottom two rows contained the four least concentrated dilution steps. The upper panel in Figure 4 shows the spot intensity array image for protein AKT in 40 cell lines. The zoned-in subgrids for two arbitrarily selected samples are shown in the lower panel.

Fig. 4.

Spot intensity image for protein AKT.

Fig. 4.

Spot intensity image for protein AKT.

AKT has emerged as a critical enzyme in signal transduction pathways involved in cell proliferation, apoptosis, angiogenesis and diabetes (Dudek et al., 1997). For example, 40% of the patients with ovarian cancer show high levels of active AKT in tumor. KRT17 is involved in nail bed, hair follicle, sebaceous glands and other epidermal appendages. Abnormal activities of KRT17 lead to Jackson–Lawler type pachyonychia congenita and steatocystoma multiplex (Schweizer et al., 2006).

We applied both the logistic (using the package ‘SuperCurve’) and the non-parametric approaches to these two lysate arrays. For the AKT array (shown in the left column of Figure 5, both curves seemed to fit the data reasonably well, and there is high concordance between these two approaches (left bottom panel).

Fig. 5.

The left column corresponds to AKT and the right column corresponds to KRT17. In each column, the top panel shows the logistic fits (curve in black); The middle panel shows the non-parametric fits (curve in black); The data points are represented by the circled points, based on the estimated concentrations; The bottom panel shows the EC50 estimates obtained from the non-parametric approach versus those from the logistic fit. The solid line (black) is y = x.

Fig. 5.

The left column corresponds to AKT and the right column corresponds to KRT17. In each column, the top panel shows the logistic fits (curve in black); The middle panel shows the non-parametric fits (curve in black); The data points are represented by the circled points, based on the estimated concentrations; The bottom panel shows the EC50 estimates obtained from the non-parametric approach versus those from the logistic fit. The solid line (black) is y = x.

The lysate array of protein KRT17 tells a different story, and the estimates obtained from the two approaches are quite different; see the right column of Figure 5. The estimates of protein concentrations in the lower tail is not critical because they fall into the region dominated by background intensities. However, in the upper tail, the logistic curve did not fit the intensity data well (right top panel), and as a result, the estimated EC50 for those samples were higher from the logistic approach than from the non-parametric approach.

To look into the differences, we now take a dilution series with high EC50 estimate. Table 1 lists the intensity measures for sample ref and sample ext with 2 replicates each. The quantity of interest is the relative difference between these two dilution series. The estimate based on SuperCurve is higher than that from the non-parametric approach by almost 1 unit on the log2 scale. From Table 1, we noticed that there is a single outlier in one dilution series (i.e. value 9723 in sampleext,rep2) in the expression data. This single outlier lifted the EC50 estimate of the series when Super-Curve was used. If this outlier was dropped, the EC50 estimate from SuperCurve was dropped by about 0.5, whereas the non-parametric estimate was changed only by about 0.15. It demonstrates that the non-parametric approach based on the median is much more robust to outliers in the intensity measures.

Table 1.

Observed intensities of two samples under discussion

 
Sampleref,rep1 2449 2771 2895 3295 2152 2979 4336 6460 
Sampleref,rep2 2745 2916 2954 3165 2901 3434 4975 7225 
Sampleext,rep1 1529 1446 2083 2678 3801 6546 11036 17127 
Sampleext,rep2 1812 2151 2567 3417 9723 4865 12780 18119 
 
Sampleref,rep1 2449 2771 2895 3295 2152 2979 4336 6460 
Sampleref,rep2 2745 2916 2954 3165 2901 3434 4975 7225 
Sampleext,rep1 1529 1446 2083 2678 3801 6546 11036 17127 
Sampleext,rep2 1812 2151 2567 3417 9723 4865 12780 18119 

When we chose the log-transformation on the expression data before using either the logistic or non-parametric models, we obtained a new set of estimates for EC50 for all samples. The correlation coefficient between the two sets of EC50 estimates (with y as intensity or log-intensity) is 0.86 for the logistic approach but 0.94 for the non-parametric approach, confirming better stability of the non-parametric approach against data transformations. Our experience shows that the use of neither the intensity nor the log intensity as y is better for all samples in lysate array studies.

From these two lysate array examples, we have noticed that the rigid logistic curves do not always fit the data well, and the non-parametric approach is valuable for robust quantification.

5 DISCUSSION

In this article, we proposed a non-parametric approach to quantify the protein expression intensities. This procedure aims at removing estimation bias in a parametric approach. Biased protein quantification results put scientists at the risk of drawing false biological conclusions. For example, in comparing the responses of treated and untreated cell lines to a certain therapeutic agent, the logistic approach may give similar protein concentration estimates due to its rigid and global parametric form, even when the difference of responses between the treated and untreated cell lines can be clearly captured by the non-parametric approach. Therefore, the proposed approach can improve the quality of protein expression quantification that is an essential step in making lysate arrays a useful tool in biological and medical research.

Note that the proposed non-parametric approach also differs from the SuperCurve in the objective functions, as the former uses the L1 distance (absolute difference between the estimated and observed protein intensities) while the logistic modeling uses the L2 distance (squared difference). As described in Section 2.2, the L1 distance works better for nonadditive errors, whereas the L2 distance leads to the most efficient estimates for Gaussian errors.

We recommend the non-parametric approach given in the article as a general quantification method for protein lysate arrays to reduce model bias and to protect against outliers. If the number of replicates and the length of the dilution series are small, e.g. the total number of measurements that can be used to estimate EC50 for each sample is no more than 10, we suggest using the least squares method to estimate EC50 after the proposed L1 distance has been used to obtain the estimate of g.

A possible improvement to the proposed estimation method is to use weights in the objective function to reflect heteroscedasticity in the data so that the L1 norm is weighted according to variability of the intensities around the median curve with the purpose of obtaining EC50 estimates with smaller variances. Specific implementation of this approach requires further work.

The accuracy of the estimated EC50 varies with multiple factors, including the quality of the dilution series, the number of replicates and the estimation method used. In order to make inference on the relative difference in protein content across samples, the SEs need to be estimated. With either the parametric approach (e.g. logistic curve) or the non-parametric approach, the analytic derivations of the SEs are difficult, especially given the possible dependence of measurements within dilution series and across replicates. In the SuperCurve package, a bootstrap method is used to obtain confidence intervals. The same approach can be used for the non-parametric approach proposed in this article. The need to accommodate heteroscedasticity of the data in inference will be investigated in future work.

ACKNOWLEDGEMENTS

This research work is partially supported by the Kleberg Center for Molecular Markers, P50CA83639 and U24CA126477 to G.B.M. and by the National Science Foundation Grant DMS-0604229 and DMS-0706818.

Conflict of Interest: none declared.

REFERENCES

Amit
I
, et al.  . 
A module of negative feedback regulators defines growth factor signaling
Nat. Genet.
 , 
2006
, vol. 
39
 (pg. 
503
-
512
)
Beran
R
Simulated power functions
Ann. Stat.
 , 
1986
, vol. 
14
 (pg. 
151
-
173
)
Bernard
KR
, et al.  . 
Methods in functional proteomics: two-dimensional polyacrylamide gel electrophoresis with immobilized pH gradients, in-gel digestion and identification of proteins by mass spectrometry
Method Mol. Biol.
 , 
2004
, vol. 
250
 (pg. 
263
-
282
)
Cahill
DJ
Nordhoff
E
Protein arrays and their role in proteomics
Adv. Biochem. Eng. Biotechnol.
 , 
2003
, vol. 
83
 (pg. 
177
-
187
)
Camp
RL
, et al.  . 
Validation of tissue microarray technology in breast carcinoma
Lab. Invest.
 , 
2000
, vol. 
80
 (pg. 
1943
-
1949
)
Chan
SM
, et al.  . 
Protein microarrays for multiplex analysis of signal transduction pathways
Nat. Med.
 , 
2004
, vol. 
10
 (pg. 
1390
-
1396
)
Cheng
KW
, et al.  . 
Assay of Rab25 function in ovarian and breast cancers
Method Enzymol.
 , 
2005
, vol. 
403
 (pg. 
202
-
215
)
Dudek
H
, et al.  . 
Regulation of neuronal survival by the serine-threonine protein kinase Akt
Science
 , 
1997
, vol. 
275
 (pg. 
661
-
663
)
Espina
V
, et al.  . 
Protein microarrays: molecular profiling technologies for clinical specimens
Proteomics
 , 
2003
, vol. 
3
 (pg. 
2091
-
2100
)
Espina
V
, et al.  . 
Protein microarray detection strategies: focus on direct detection technologies
J. Immunol. Methods
 , 
2004
, vol. 
290
 (pg. 
121
-
133
)
He
X
Ng
P
COBS: qualitatively constrained smoothing via linear program
Comput. Stat.
 , 
1999
, vol. 
14
 (pg. 
315
-
337
)
He
X
Shi
PD
Monotone B-spline smoothing
J. Am. Stat. Assoc.
 , 
1998
, vol. 
93
 (pg. 
643
-
650
)
Herrmann
PC
, et al.  . 
Mitochondrial proteome: altered cytochrome c oxidase subunit levels in prostate cancer
Proteomics
 , 
2003
, vol. 
3
 (pg. 
1801
-
1810
)
Hojlund
K
, et al.  . 
Proteome analysis reveals phosphorylation of ATP synthase beta-subunit in human skeletal muscle and proteins with potential roles in type 2 diabetes
J. Biol. Chem.
 , 
2003
, vol. 
278
 (pg. 
10436
-
10442
)
Ivanov
SS
, et al.  . 
Antibodies immobilized as arrays to profile protein post-translational modifications in mammalian cells
Mol. Cell Proteomics
 , 
2004
, vol. 
3
 (pg. 
788
-
795
)
Jiang
R
, et al.  . 
Pathway alterations during glioma progression revealed by reverse phase protein lysate arrays
Proteomics
 , 
2006
, vol. 
6
 (pg. 
2964
-
2971
)
Kakizawa
H
, et al.  . 
The relationship between glycemic control and plasma vascular endothelial growth factor and endothelin-1 concentration in diabetic patients
Metabolism
 , 
2004
, vol. 
53
 (pg. 
550
-
555
)
Koenker
R
Park
BJ
An interior point algorithm for nonlinear quantile regression
J. Econom.
 , 
1994
, vol. 
71
 (pg. 
265
-
283
)
Kononen
J
, et al.  . 
Tissue microarray for high-throughput molecular profiling of tumor specimens
Nat. Med.
 , 
1998
, vol. 
4
 (pg. 
844
-
847
)
Lin
Y
, et al.  . 
Detection of multiple cytokines by protein arrays from cell lysate and tissue lysate
Clin. Chem. Lab. Med.
 , 
2003
, vol. 
41
 (pg. 
139
-
145
)
Liu
C
, et al.  . 
Software tools for high-throughput analysis and archiving of immunohi-stochemistry staining data obtained with tissue microarrays
Am. J. Pathol.
 , 
2002
, vol. 
161
 (pg. 
1557
-
1565
)
MacBeath
G
Protein microarrays and proteomics
Nat. Genet.
 , 
2002
, vol. 
32
 (pg. 
526
-
532
)
MacBeath
G
Schreiber
SL
Printing proteins as microarrays for high-throughput function determination
Science
 , 
2000
, vol. 
289
 (pg. 
1760
-
1763
)
Mircean
C
, et al.  . 
Robust estimation of protein expression ratios with lysate microarray technology
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
1935
-
1942
)
Murph
MM
, et al.  . 
Individualized molecular medicine: linking functional proteomics to select therapeutics targeting the PI3K pathway for specific patients.
Ovarian Cancer: State of the Art and Future Directions in Translational Research
 , 
2006
Coukos,G. Springer Verlag
Nishizuka
S
, et al.  . 
Proteomic profiling of the NCI-60 cancer cell lines using new high-density reverse-phase lysate microarrays
Proc. Natl Acad. Sci. USA
 , 
2003
, vol. 
100
 (pg. 
14229
-
14234
)
Nishizuka
S
, et al.  . 
Diagnostic markers that distinguish colon and ovarian adenocarcinomas: identification by genomic proteomic and tissue array profiling
Cancer Res.
 , 
2003
, vol. 
63
 (pg. 
5243
-
5250
)
Ong
SE
, et al.  . 
Mass spectrometric-based approaches in quantitative proteomics
Methods
 , 
2003
, vol. 
29
 (pg. 
124
-
130
)
Packeisen
J
, et al.  . 
Demystified … tissue microarray technology
Mol. Pathol.
 , 
2003
, vol. 
56
 (pg. 
198
-
204
)
Paweletz
CP
, et al.  . 
Reverse phase protein microarrays which capture disease progression show activation of pro-survival pathways at the cancer invasion front
Oncogene
 , 
2001
, vol. 
20
 (pg. 
1981
-
1989
)
Ramaswamy
A
, et al.  . 
Application of protein lysate microarrays to molecular marker verification and quantification
Proteome Sci.
 , 
2005
, vol. 
3
 pg. 
9
 
Schweizer
J
, et al.  . 
New consensus nomenclature for mammalian keratins
J. Cell. Biol.
 , 
2006
, vol. 
174
 (pg. 
169
-
174
)
Sheehan
KM
, et al.  . 
Use of reverse phase protein microarrays and reference standard development for molecular network analysis of metastatic ovarian carcinoma
Mol. Cell. Proteomics
 , 
2005
, vol. 
4
 (pg. 
346
-
355
)
Tabus
I
, et al.  . 
Nonlinear modeling of protein expressions in protein arrays
IEEE Trans. Signal Processing
 , 
2006
, vol. 
54
 (pg. 
2394
-
2407
)
Tibes
R
, et al.  . 
Reverse phase protein array (RPPA): validation of a novel proteomic technology and utility for analysis of primary leukemia specimens and hematopoetic stem cells (HSC)
Mol. Cancer. Ther.
 , 
2006
, vol. 
5
 (pg. 
2512
-
2521
)
Toney
JH
, et al.  . 
A ‘mix and read’ assay for insulin using fluorometric microvolume assay technology
Assay Drug Dev. Technol.
 , 
2003
, vol. 
1
 (pg. 
521
-
525
)
Torhorst
J
, et al.  . 
Tissue microarrays for rapid linking of molecular changes to clinical endpoints
Am. J. Pathol.
 , 
2001
, vol. 
159
 (pg. 
2249
-
2256
)
Wulfkuhle
JD
, et al.  . 
Signal pathway profiling of ovarian cancer from human tissue specimens using reverse-phase protein microarrays
Proteomics
 , 
2003
, vol. 
3
 (pg. 
2085
-
2090
)
Yan
F
, et al.  . 
Protein microarrays using liquid phase fractionation of cell lysates
Proteomics
 , 
2003
, vol. 
3
 (pg. 
1228
-
1235
)
Zha
H
, et al.  . 
Similarities of prosurvival signals in Bcl-2-positive and Bcl-2-negative follicular lymphomas identified by reverse phase protein microarray
Lab. Invest.
 , 
2004
, vol. 
84
 (pg. 
235
-
244
)

Author notes

Associate Editor: John Quackenbush

Comments

0 Comments