1、The Cancer Genome Atlas (TCGA) is a public funded project that aims to catalogue and discover major can- cer-causing genomic alterations to create a comprehensive “atlas” of cancer genomic profiles. So far, TCGA researchers have analysed large co- horts of over 30 human tumours through large-scale g
2、enome sequenc- ing and integrated multi-dimensional analyses. Studies of individual can- cer types, as well as comprehensive pan-cancer analyses have extended current knowledge of tumorigenesis. A major goal of the project was to pro- vide publicly available datasets to help improve diagnostic metho
3、ds, treat- ment standards, and finally to prevent cancer. This review discusses the cur- rent status of TCGA Research Network structure, purpose, and achievements. Key words: The Cancer Genome Atlas (TCGA), cancer genomics, big data analysis. Contemp Oncol (Pozn) 2015; 19 (1A): A68A77 DOI: 10.5114/w
4、o.2014.47136 Review The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge Katarzyna Tomczak 1,2 , Patrycja Czerwiska 1,2 , Maciej Wiznerowicz 2,3 1 Postgraduate School of Molecular Medicine, Medical University of Warsaw, Warsaw, Poland 2 Laboratory of Gene Therapy, Department of Cancer
5、 Immunology, The Greater Poland Cancer Centre, Poznan, Poland 3 Department of Cancer Immunology and Diagnostics, Chair of Medical Biotechnology, Poznan University of Medical Sciences, Poznan, Poland New roads to conquer cancer Cancer is considered the most complex disease that mankind has to face. M
6、ore than 200 forms of cancer have been described and each type can be characterised by different molecular profiles requiring unique therapeutic strategies. Cancer involves dynamic changes in the genome 1. The architec- ture of occurring genetic aberrations such as somatic mutations, copy num- ber v
7、ariations, changed gene expression profiles, and different epigenetic al- terations, is unique for each type of cancer. The demand for better diagnosis, treatment, and prevention of cancer has appeared, and strongly correlates with a better understanding of genetic changes in the tumour. The latest
8、progress in the technological development of genome-wide sequencing and bioinformatics has shed new light on the cancer genome 24. In 2005, The Cancer Genome Atlas (TCGA) and in 2008 the International Cancer Genome Consortium (ICGC) were launched as the two main projects accelerating the comprehensi
9、ve understanding of the genetics of cancer using innovative genome analysis technologies, helping to generate new cancer therapies, diagnostic methods, and preventive strategies 5, 6. The National Institute of Health (NIH) launched TCGA Pilot Project to cre- ate a comprehensive “atlas” of cancer gen
10、omic profiles. The TCGA is a public funded project that aims to catalogue and discover major cancer-causing genome alterations in large cohorts of over 30 human tumours through large-scale genome sequencing and integrated multi-dimensional analy- ses. Providing publicly available cancer genomic data
11、sets will allow the im- provement of diagnostic methods, treatment standards, and finally cancer prevention. Phase I of the project (a 3-year pilot study) aimed to develop and test the research infrastructure based on the characterisation of chosen tumours having poor prognosis: brain, lung, and ova
12、rian cancers. Since 2009 (phase II) analyses have expanded to additional types reaching 30 different tumour types analysed by 2014. The TCGA project engaged scientists and managers from NIHs National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI) funded by the US governm
13、ent, as well as cooperating with institutions across the USA and Europe. To run the pro- ject, the NCI as well as the NHGRI each invested $50 million for the 3-year pi- lot study. Additional funding was also provided from different sources, such as the American Recovery and Reinvestment Act (ARRA),
14、to help stimulate the US economy in the context of biomedicine 57. In this review, we provide a short description of TCGA structure and the major goals of the project. Furthermore, we intend to expound on current knowledge of platforms, analytical tools, and visualisation methods that were applied f
15、or TCGA data generation. As it would be overwhelming to A69 The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge discuss all the updates of the new discoveries in cancer profiling, we have focused on the updates of the main tumour types with poor overall prognosis in patients. We hope
16、 that an understanding of some of the fundamentals, recent updates of cancer genomic profiles, and new dis- coveries utilising open access TCGA data will afford each researcher to extend their current knowledge in this area and therefore help to find new roads for cancer treatment and prevention. Th
17、e Cancer Genome Atlas Research Network The structure of TCGA is well organised and involves several cooperating centres responsible for collection and sample processing, followed by high-throughput se- quencing and sophisticated bioinformatics data analyses (Table 1). First, different Tissue Source
18、Sites (TSSs) collect the required biospecimens (blood, tissue) from eligible cancer patients and deliver them to the Biospecimen Core Resource (BCR). Next, the BCR catalogue, process, and ver- ify the quality and quantity of samples, and then submit clinical data and metadata to the Data Coordinatin
19、g Center (DCC) and provide molecular analytes for the Genome Characterization Centers (GCCs) and Genome Sequencing Centers (GSCs) for further genomic characterisation and high-throughput sequencing. Then, sequence-related data are deposited in the DCC. The Genome Characterisation Centers also submit
20、 trace files, sequences, and alignment mappings to NCIs Cancer Genomics Hub (CGHub) secure repository. The generated genomic data is made available to the research community and Genome Data Analysis Centers (GDACs). The GDACs provide new information-processing, analysis, and visualisation tools to t
21、he entire research com- munity to facilitate broader use of TCGA data. Furthermore, the information generated by the TCGA Research Network is centrally managed at the DCC and entered into public free-access databases (TCGA Portal, NCBIs Trace Archive, CGHub), allowing scientists to continually acces
22、s the can- cer datasets and to speed advancements in cancer biology and linked technologies (Fig. 1) 8. Platforms and data types To provide comprehensive analysis of cancer genome profiles, TCGA applied high-throughput technologies based on microarrays (to test nucleic acids and proteins) and next-g
23、eneration sequencing methods (for global anal- ysis of nucleic acids). The research network structure in- cludes many centres utilising different platforms to pro- vide global information of cancer genomics. Some of the applied methods are briefly described below. RNA sequencing (RNAseq) is a high-t
24、hroughput tech- nology for transcriptome (total RNA) profiling, deriving strand information with very high precision. RNAseq is able to rapidly identify and quantify rare and common transcripts, isoforms, novel transcripts, gene fusions, and non-coding RNAs, among a wide range of samples, includ- in
25、g low-quality samples 9. For transcriptome analysis TCGA uses a platform based on the Illumina system. The TCGA deposited data contains information about both nucleotide sequence and gene expression. RNA sequence alignment provides different levels of information such as RNA sequence coverage, seque
26、nce variants (e.g. fusion genes), expression of genes, exon, or junction. The NCBI dbGaP database is the official repository for the actual se- quence data 10. MicroRNA sequencing (miRNAseq) is a type of RNA- Seq, utilising material enriched in small RNAs, allowing the detection of specific sets of
27、short, noncoding RNAs (mi RNAs) that have the capacity to regulate hundreds of genes within and across diverse signalling pathways. Moreover, miRNA-sequencing defines tissue-specific miR- NA expression profiles, their isoforms, connection with diseases, and the discovery of unreported miRNAs 1115. D
28、NA sequencing (DNAseq) is a high-throughput meth- od for determining the nucleotides within a DNA molecule, providing information about DNA alterations, such as in- sertions, deletions, polymorphism as well as copy number variation, mutation frequencies, or viral infection events. To catalogue the g
29、enomic diversity across cancer types, TCGA Genome Sequencing Centers utilise DNA sequenc- ing systems based on Sanger Sequencing 1618. SNP-based platforms are used to analyse genome-wide structural variation across multiple cancer genomes. The TCGA researchers have chosen the most powerful geno- typ
30、ing tools. Array-based detection of single nucleotide polymorphisms (SNPs) included platforms able to define SNP , CNV , and loss of LOH across multiple samples 19, 20. Array-based DNA methylation sequencing is a high- throughput, genome-wide analysis of DNA methylation profile providing information
31、 of epigenetic changes in the genome. Abnormal profile of DNA methylation of CpG sites is among the earliest and most frequent alterations in cancer 21, 22. The TCGA utilises DNA methylation as- say mainly based on the Illumina platform, assuring sin- gle-base-pair resolution, high accuracy, easy wo
32、rkflows, and low input DNA requirements. Methylation profiling technologies are based on highly multiplexed genotyp- ing of bisulphite-converted genomic DNA. The TCGA DNA methylation data files contain information of signal inten- sities (raw and normalised), detection confidence, and cal- culated b
33、eta values for methylated (M) and unmethylated (U) probes 23. Reverse-phase protein array (RPPA) is a highly sen- sitive (detecting nanograms of proteins), reproducible, high-throughput, functional and quantitative proteomic method for large-scale protein expression profiling, bio- marker discovery,
34、 and cancer diagnostics. Reverse-phase protein array is an antibody-based technique allowing for the analysis of 1000 samples with up to 500 different antibodies at a time. Protein arrays contain data of pro- tein expression and concentration. The data archives are deposited to the TCGA DCC and incl
35、ude original images of protein arrays, calculated raw signals, relative concentra- tions of proteins, and normalised protein signals 2428. Each platform can potentially produce many kinds of data (data types), such as the following: gene expression, exon expression, miRNA expression, copy number var
36、ia- tion (CNV), single nucleotide polymorphism (SNP), loss of heterozygosity (LOH), mutations, DNA methylation, and protein expression. Generated data are categorised not A70 contemporary oncology Table 1. The Cancer Genome Atlas (TCGA) organisation centres. Based on 7 Centre Name Centre Description
37、 Localisation Tissue Source Sites (TSSs) Collection of the samples (blood and tissue from tumour and normal controls) and clinical metadata from patients (donors) Shipment of the annotated biospecimens to Biospecimen Core Resources (BCR) https:/wiki.nci.nih.gov/display/TCGA/ Tissue+Source+Site https
38、:/tcga-data.nci.nih.gov/datareports/codeTablesReport. htm?codeTable=tissue%20source%20site Biospecimen Core Resource (BCR) Coordination of sample delivery and data collection, cataloguing, processing, and verifying the quality and quantity Isolation and distribution of RNA and DNA from biospecimens
39、to other institutions for genomic characterisation and high-throughput sequencing http:/cancergenome.nih.gov/abouttcga/overview/ howitworks/bcr http:/www.nationwidechildrens.org/biospecimen- core-resource-about-us Research Institute at Nationwide Childrens Hospital in Columbus, Ohio Genome Sequencin
40、g Centers (GSCS) High-throughput sequencing (data are available in TCGA Data Portal or at NIHs database of Genotype and Phenotype) Identification of the DNA alterations http:/cancergenome.nih.gov/abouttcga/overview/ howitworks/sequencingcenters Broad Institute Sequencing Platform in Cambridge Human
41、Genome Sequencing Center, Baylor College of Medicine in Houston The Genome Institute at Washington University Cancer Genome Characterisation Centers (GCCs) Utilisation of novel technologies and multiple platforms Comprehensive description of the genomic changes: alterations in miRNA and gene express
42、ion, SNP , CNV , and others http:/cancergenome.nih.gov/abouttcga/overview/ howitworks/characterizationcenters Copy Number Alteration (Brigham and Womens Hospital and Harvard Medical School in Boston, The Broad Institute in Cambridge) Epigenomics (University of Southern California in Los Angeles, Joh
43、ns Hopkins University in Baltimore) Gene (mRNA) Expression (University of North California at Chapel Hill) miRNA Analysis (British Columbia Cancer Agency in Vancouver) Targeted Sequencing Center (Baylor College of Medicine in Houston) Functional Proteomics (MD Anderson Cancer Center) Proteome Charac
44、terization Centres (PCCs) Identification of cancer-specific proteins http:/cancergenome.nih.gov/abouttcga/overview/ howitworks/proteomecharacterization Cancer Proteomic Center Center for Application of Advanced Clinical Proteomic Technologies for Cancer Proteo-Genomic Discovery Prioritization and Ve
45、rification of Cancer Biomarkers Proteome Characterisation Centre and Vanderbilt Proteome Characterization Center Data Coordinating Center (DCC) Management of all generated data and transfer them to public databases (TCGA Data Portal and Cancer Genomics Hub) http:/cancergenome.nih.gov/abouttcga/overv
46、iew/ howitworks/datasharingmanagement Cancer Genomics Hub (CGHub) Storage, catalogue, and access to lower levels of cancer genome sequences and alignments http:/cancergenome.nih.gov/ abouttcga/overview/howitworks/ SharingAndManagingLowerLevelSeqData University of California Santa Cruz Genome Data An
47、alysis Centers (GDACs) Development of novel informatics tools to facilitate with processing and integrating data analyses across the entire genome http:/cancergenome.nih.gov/abouttcga/overview/ howitworks/dataanalysiscenters Broad Institute, Cambridge, Massachusetts Institute for Systems Biology, Se
48、attle, Washington, University of Texas MD Anderson Cancer Center, Houston, Texas Memorial Sloan-Kettering Cancer Center, New York, New York Oregon Health and Science University, Portland, Oregon University of California, Santa Cruz, California Buck Institute for Research on Aging, Novato, California
49、 University of North Carolina at Chapel Hill, Chapel Hill, North Carolina University of Texas MD Anderson Cancer Center, Houston, TexasA71 The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge only by data type but also by data level. Raw, non-nor- malised data (Level I), processed data (Level II), and seg- mented/interpreted data (Level III) apply to individual samples, while summarised data (Level IV) refer to analy- ses across sample sets. Importantly, data of level III and IV are freely available from the publicly accessible