Map > Problem Definition > Data Preparation > RNA
 

Data Preparation - RNA

The source of RNA transcriptomics data can be mRNA (messenger RNA), microRNA, lncRNA (long non-coding RNA) and more. We can use Microarray or RNA-Seq gene expression platforms to perform experiments and generate transcriptomics data.
 
DepMap Portal
The goal of the Dependency Map (DepMap) portal is to empower the research community to make discoveries related to cancer vulnerabilities by providing open access to key cancer dependencies analytical and visualization tools.
 
DepMap Expression (mRNA) data
In order to process DepMap Expression data we need to download the follwoing datasets from DepMap website.

 
Cell Line Sample Info
  1. DepMap_ID: Static primary key assigned by DepMap to each cell line
  2. cell_line_name
  3. stripped_cell_line_name: Cell line name with alphanumeric characters only
  4. CCLE_Name: Previous naming system that used the stripped cell line name followed by the lineage; no longer assigned to new cell lines
  5. alias: Additional cell line identifiers (not a comprehensive list)
  6. COSMIC_ID: Cell line ID used in Cosmic cancer database
  7. sex: Sex of tissue donor if known
  8. source: Source of cell line vial used by DepMap
  9. Achilles_n_replicates: Number of replicates used in Achilles CRISPR screen passing QC
  10. cell_line_NNMD: Difference in the means of positive and negative controls normalized by the standard deviation of the negative control distribution
  11. culture_type: Growth pattern of cell line (Adherent, Suspension, Mixed adherent and suspension, 3D, or Adherent (requires laminin coating))
  12. culture_medium: Medium used to grow cell line
  13. cas9_activity: Percentage of cells remaining GFP negative on days 12-14 of cas9 activity assay as measured by FACs
  14. RRID: Cellosaurus research resource identifier
  15. WTSI_Master_Cell_ID
  16. sample_collection_site: Tissue collection site
  17. primary_or_metastasis: Indicates whether tissue sample is from primary or metastatic site
  18. primary_disease: General cancer lineage category
  19. Subtype: Subtype of disease; specific disease name
  20. age: If known, age of tissue donor at time of sample collection
  21. Sanger_model_ID: Sanger Institute Cell Model Passport ID
  22. depmap_public_comments
  23. lineage: Cancer type classifications in a standardized form
  24. lineage_subtype
  25. lineage_sub_subtype
  26. lineage_molecular_subtype
 
Expression 
RNA-Seq TPM gene expression data (Log2 transformed) for just protein coding genes using RSEM (RNA-Seq by Expectation Maximization).
  • Rows: cell lines (Broad IDs)
  • Columns: genes (HGNC symbol and Entrez ID)
  • 19177 Genes
  • 1389 Cell Lines
  • 33 Primary Diseases
  • 37 Lineages
 
Data Processing 
Not all DepMap_IDs in "sample_info.csv" file are present in "CCLE_expression.csv" file. Moreover, it is better to have a separate file for features/genes/probes based on the following data model. You can download a file by clicking on its file name.

 
Bioada SmartArray 
This video shows how you can upload the CCLE_gene_cn files to Bioada SmartArray and then explore, analyze, visualize and build predictive models significantly faster and easier.