TCGA数据的规律【更新中】

长期积累TCGA数据中的规律。。。

## TCGA条码（barcode）信息
[TCGA条码](https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/) 由一组标识符组成。每个都专门标识一个TCGA数据元素。有关元数据标识符如何组成条形码的说明，请参见下图：

![](/media/202301/2023-01-10_111222&hn4B57yZFz8wgs6pxDo1.png)

![](/media/202301/2023-01-10_111244&dYfbOiMeLsgIn0UhS57W.png)

### 用R语言解决一下TCGA的id中隐藏的分组信息

![](/media/202301/2023-01-10_103713&OBb1Hv7p9UDx6YeQLTIw.png)

图中展示的是TCGA样本id，分组信息是在这个id的第14-15位，01-09是tumor，10-29是normal。我拿了一个示例数据，请在生信星球公众号回复0129获取。
根据这个生成一个分组信息它是一个向量形如

![](/media/202301/2023-01-10_103721&5kREKYZp4Pbwn7I8HM3L.png)

如果在id的14-15位在1-9之间就标记tumor，10-29之间就标记normal。这里面涉及到字符串截取、数据类型转换，%in%函数以及ifelse函数（if-else循环的变体）
将示例数据放到你的工作目录

```R
load(file="id.Rdata")
table(substring(id,14,15)) #table看有多少重复值
num <- as.numeric(substring(id,14,15)) #截取出来是字符串，要转为数字
#屡试不爽的ifelse
group_list=ifelse(num %in% 1:9,"Tumor","Normal")
```
如果你要用自己的数据试试，就把样本编号命名为id，来跑一跑就ok。这一列信息后面作图会用。

## RNA-Seq数据格式更新
> 2022年4月发现，TCGA的RNAseq数据悄悄更新了，选择一个project，例如TCGA-LUSC，公开数据，选择RNA-Seq后，workflow.type里只有"STAR- Counts"了，没有HTSeq-Counts了。

![](/media/202301/2023-01-12_143304&yXgGlTn4Hbxi8dpUeZEs.png)

![](/media/202301/2023-01-12_143351&OqmUA1LlCFDaXQ5KwWMz.png)

你会发现STAR-Counts里面有1106个文件，其中553个是Gene Expression Quantification（.rna_seq.augmented_star_gene_counts.tsv），这是我们合并表达谱所需要的文件。另外553个文件是Splice Junction Quantification（*.rna_seq.star_splice_junctions.tsv.gz），这个主要是检测新的转录本或者融合的文件，且他们都是无权限查看的（controlled）。

![](/media/202301/2023-01-12_143951&L3YEzs21DMWe4gXNPjKi.png)

下载着553个Gene Expression Quantification文件：
```
query_TCGA = GDCquery(project="TCGA-LUSC", data.category="Transcriptome Profiling", experimental.strategy="RNA-Seq", workflow.type="STAR - Counts", access="open")
GDCdownload(query = query_TCGA, files.per.chunk = 10)
```

## STAR-counts的文件说明

每个文件夹里面会有一个star_gene_counts.tsv，我们可以随便打开一个看一下，这个文件的内容跟老版本的完全不一样，包含的信息更多。甚至包含了RNA类型，这样就能很容易的区分mRNA和lncRNA了，另外还包含的基因名和ID。

![](/media/202301/2023-01-12_144744&yc8Ki7k2pjLvd0XqaSlQ.png)

文件中，除了有STAR-counts，还有TPM，FPKM和FPKM_UQ。这几个数据的具体计算方法可以参考 [TCGA官方文档](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/)

> STAR-counts的计算比较简单，表示有几条reads比对到相应的基因上面：

![](/media/202301/2023-01-12_145015&Zq5Ng0QdlCoTfMObx8WU.png)

> dFPKM 
The fragments per kilobase of transcript per million mapped reads (FPKM) calculation aims to control for transcript length and overall sequencing quantity.

> wFPKM-UQ（Upper Quartile FPKM）
The upper quartile FPKM (FPKM-UQ) is a modified FPKM calculation in which the protein coding gene in the 75th percentile position is substituted for the sequencing quantity. This is thought to provide a more stable value than including the noisier genes at the extremes.

> iTPM 
The transcripts per million calculation is similar to FPKM, but the difference is that all transcripts are normalized for length first. Then, instead of using the total overall read count as a normalization for size, the sum of the length-normalized transcript values are used as an indicator of size.

![](/media/202301/2023-01-12_145243&n2GuINidRz5LvCHEsYoF.png)

示例解释：
```
Examples 
Sample 1: Gene A

Gene length: 3,000 bp
1,000 reads mapped to Gene A
1,000,000 reads mapped to all protein-coding regions
Read count in Sample 1 for 75th percentile gene: 2,000
Number of protein coding genes on autosomes: 19,029
Sum of length-normalized transcript counts: 9,000,000

FPKM for Gene A = 1,000 * 10^9 / (3,000 * 50,000,000) = 6.67

FPKM-UQ for Gene A = 1,000) * 10^9 / (3,000 * 2,000 * 19,029) = 8.76

TPM for Gene A = (1,000 * 1,000 / 3,000) * 1,000,000 / (9,000,000) = 37.04
```