In other words, activities of signature genes could be used to predict the drug sensitiv ity. In addition, one may extend this hypothesis further such that this prediction of pharmacological selleck chemical levels in cell type could be extrapolated to other cell types. Applications of these hypotheses have been developed in many studies. One of the most notable work is the connectivity map project, where 4 human cell lines were treated by 1,309 chemical compounds at different dosages, and their expression profiles were generated. A prediction algo rithm based on gene set enrichment analysis was also developed to rank compounds based on input sig nature genes obtained from tumor comparison. This pro ject has been widely adapted and developed in the drug discovery area.
Several treatment candidates have been dis covered for cancer cell lines in the cMap project by directly applying the cMap approach. With the idea of searching for inverse signature to the phonotype of inter est, this approach has been extended to predict treatment potentials of compounds not included in the cMap project. In addition to the original cMap approach, multi ple other methods have been developed based on cMap data for new drug repositioning approaches or improving the performance of exist cMap. Although cMap has been widely applied, problems remain to be resolved for reliable prediction. First, cMap does not differentiate cell lines in its prediction. Often times, the top ranked drugs were from cell lines different from the query cell line.
However, our investigation suggested that the drug effect is cell line dependent and the higher ranks of the drugs from other cell lines would be more of cell line effects as opposed to drug effects. As a result, considering drug samples from other cell lines introduces only noise to drug prediction. Second, the quality of the data samples in cMap is inconsistent. Some samples from the same drug treatment can behave considerably different from the rest. These samples will inevitably present erroneous predictions. Third, the query signature gene set in cMap is chosen to include the top up and down regulated genes. However, size of the gene set is determined quite ad hoc. As a result, one might miss the important signature genes by choosing a smaller gene set, or on the contrary, bring in unrelated genes that would only serve to degrade the prediction.
As an exam ple, we used the expression data for estradiol treated MCF7 cell line as a query to cMap and genes corre sponding to the highest 100 and lowest 100 fold changes were used as the query gene set. Naturally, we would expect that E2 ranked high in the predicted list of drugs. However, E2 Dacomitinib was only ranked 828 among over 1,200 drugs. The reason for this low cancer ranking is because the result is a summary of the rankings of all cell lines of E2 samples, which are mixed.