The Pitfall of “Cleaned” Data — Why FPKM and TPM Are Not Enough: Insights from GSE159751

Casestudy of GSE159751

Download

Why You Should Use Gene Counts Instead of TPM/FPKM in RNA-Seq Analysis

Understanding the Difference Between TPM and FPKM Is Not Enough for RNA-Seq Analysis

TPM and FPKM are commonly seen as expression values in RNA-Seq data. For this reason, many people search for terms such as “TPM FPKM,” “difference between FPKM and TPM,” or “RNA-Seq analysis using TPM.”

Understanding the difference between TPM and FPKM is not useless. However, in RNA-Seq data analysis, if you stay within the perspective of “which should I use, TPM or FPKM?”, you may miss a more important issue. Here, let us shift the perspective and consider RNA-Seq analysis as a series of decisions that includes data preparation, normalization, visualization, differential expression analysis, and result checking.

For Differential Expression Analysis, Use Gene Counts Instead of TPM or FPKM

Widely used tools for RNA-Seq differential expression analysis, such as DESeq2, edgeR, and limma-voom, are designed to start from Gene Counts. Using TPM or FPKM as the main input data for differential expression analysis does not fit the underlying concept of these standard methods.

This warning is well known among people experienced in RNA-Seq analysis. However, TPM and FPKM tables are often provided in public databases or supplementary files of published papers. As a result, beginners may easily assume that “if TPM or FPKM values are available, they can be used directly for analysis.”

In other words, if you are performing differential expression analysis, the starting point is not to choose between TPM and FPKM, but to prepare Gene Counts. Then, depending on the analysis method, normalization or transformation should be applied, and the results need to be checked.

For PCA and Clustering, There Is No Need to Switch to TPM or FPKM

Here, “using Gene Counts” does not mean applying raw Gene Counts directly to PCA or clustering. It means starting from Gene Counts and using appropriately normalized or transformed values according to the purpose of the analysis, while checking library size, data distribution, and relationships among samples.
Appropriately normalized and transformed Gene Counts are, of course, also useful for PCA and clustering.

FPKM and TPM are sometimes said to be usable for visualization, such as PCA or clustering. However, in many cases, this simply means that they are better than using unnormalized raw Gene Counts directly. It does not mean that FPKM or TPM is more suitable than appropriately normalized and transformed Gene Counts for sample comparison or for checking analysis results.

If differential expression analysis is performed from Gene Counts, then when checking the results by PCA or clustering, it is more consistent to use appropriately normalized and transformed Gene Counts.

Using Gene Counts for differential expression analysis, but FPKM or TPM for visualization and result checking, does not help interpretation. Rather, it makes the relationship between the analysis results and their visual confirmation harder to understand.

TPM and FPKM are useful for looking at relative expression levels among genes within the same sample. However, for sample comparison, differential expression analysis, and checking the results of that analysis, there is now little reason to use TPM or FPKM.

Even “Normalized” TPM or FPKM Values Cannot Always Be Trusted As They Are

Another problem is that TPM and FPKM are often treated as “already normalized” values. It is true that TPM and FPKM are calculated by considering gene length and library size. However, that alone does not necessarily make the data suitable for comparison among samples. Even supposedly normalized TPM or FPKM values may still be affected by differences in data distribution, data size, RNA quality, compositional bias, and other factors, making them unreliable for direct comparison.

In other words, even in the sense of being “normalized,” TPM and FPKM are not always sufficient for sample comparison or for checking analysis results. Additional normalization or correction may often be required. At that point, there is little reason to use TPM or FPKM as the main data for analysis.

In this article, using real data (GSE159751), we show the risk of trusting FPKM or TPM values and proceeding with analysis. Once the data are visualized, you may feel that something is wrong and ask, “Can this really be called normalized data?” This is also a risk of following command-line-based textbooks or workflows without visually checking the state of the data.

In RNA-Seq analysis, you should not feel safe simply because a command finished without errors, or because a procedure is called “normalization.” It is important to proceed with analysis while checking the data from multiple angles using various visualization tools.

Nonlinear Distortions Can Remain Even After Converting to TPM or FPKM

TPM and FPKM correct expression values by considering factors such as read count and gene length. However, adjusting for differences in read count or data size is a linear normalization. It can partially correct for differences in data amount, but it cannot remove all complex distortions contained in RNA-Seq data.

RNA-Seq data are affected by multiple intertwined factors, including RNA quality, sample composition, variation among low-expression genes, and read bias toward specific gene groups. Therefore, differences among samples may appear not only as simple scaling differences, but also as differences in the shape of the entire distribution. Such nonlinear distortions are not resolved simply by converting the data to TPM or FPKM.

In this dataset, the shapes of the FPKM distributions differ substantially among samples. Some samples show distributions close to unimodal, indicating nonlinear distortion that cannot be handled by simple scaling correction.

subioplatform_displays_histograms

Such distributional differences are easily overlooked if you only see that TPM or FPKM values have been output. However, if the data are properly visualized and checked, it is not difficult to notice samples with questionable quality or samples whose distributions differ greatly from the others.

What matters is that the analyst understands that questionable samples are included, decides how to handle them, and interprets the analysis results based on that decision. In analyses that neglect visualization, this kind of evidence can easily be missed.

Adding More Normalization or Correction Does Not Always Make the Data Clean

Then, should we simply apply another normalization or correction method to the distortions remaining in TPM or FPKM? In reality, it is not that simple. Even methods such as TMM, VST, ComBat, and Quantile Normalization do not always make the data clean in the expected way.

Normalization and correction are not magic procedures that mechanically restore data to a “correct” state. Depending on the type and magnitude of distortion in the original data, sample quality, batch structure, and how these factors overlap with group differences, interpretation problems may remain even after correction. Also, even if the corrected data look cleaner, there is no guarantee that the change reflects the original biological state.

For more details on the limitations of applying TMM, VST, ComBat, Quantile Normalization, and other methods to RNA-Seq data, see the follow-up article Limitations of Batch Effect Correction and Normalization in RNA-Seq.

Quantile Normalization Is Not a Cure-All

In this video, we also try Quantile Normalization to forcibly align the FPKM distributions. As a result, the distribution shapes among samples appear more similar at first glance. However, clustering shows that aligning the distributions alone does not solve the problem. In other words, even if the distribution looks cleaner, that does not mean the data have become suitable for biological comparison.

Assume Distortion Exists, and Find an Explainable Analysis Strategy

In RNA-Seq data analysis, nonlinear distortions and sample-to-sample differences are frequently present. Therefore, it is necessary to proceed with the mindset that such issues exist, and to decide how to handle them during the analysis.

What matters is not to ignore whether distortion exists, nor to expect that algorithms can completely remove it. You need to understand precisely what changed after applying a given method, and then consider which samples to use, which normalization or correction method to adopt, and how far the results can be interpreted as biological differences. RNA-Seq data analysis may be described as the process of finding an analysis strategy that can be explained to others, while working through this kind of ambiguity.

What Experimental Biologists Should Do Before Relying on Algorithms

Thanks to the efforts of bioinformaticians, many algorithms are now available for addressing highly complex problems. However, wise experimental biologists should not forget that they must verify for themselves whether an algorithm truly works for their own data.

At present, it is safer to proceed carefully with good experimental design and planning to obtain high-quality raw data, and with visualization tools that accurately monitor the state of the data, than to blindly rely on “advanced” algorithms.

Once strong systematic errors are introduced into the data, removing their effects and performing a reliable analysis may become difficult or even impossible. To avoid such situations, Subio considers pre-assessment of the measurement technology to be used and experimental planning to reduce risk to be important.

Before regretting it after the data are generated, let Subio assess your plan from a professional perspective. Based on our experience trying to rescue thousands of “failed datasets,” we can propose experimental plans designed to avoid failure. [Contact us]

What you need to learn is not command operation or how to use tools, but “data analysis.”

Support

Help - Theory & Case Study

The Pitfall of “Cleaned” Data — Why FPKM and TPM Are Not Enough: Insights from GSE159751