10x Genomics: 3' scRNA Data
Last Update: February 05, 2024
Introduction:
This page shows analysis of a 3’ 10x single cell expression data for a PBMC library. The pipeline converts an Ultima CRAM file into 2 simulated paired-end fastq files. The process is described here for a 10x 3’ library, but can be easily adapted to other similar read structures.
Data and code location:
File | Size | Location |
---|---|---|
Input cram | 156.86 GB | s3://ultimagen-feb-2024-scrna/10x/035843-scRNA_3-Z0007-CGATTCATGCTCGAT_rq1.cram |
Output fastq R1 | 15.78GB | s3://ultimagen-feb-2024-scrna/10x/035843-scRNA_3-Z0007-CGATTCATGCTCGAT_rq1_S1_L001_R1_001.fastq.gz |
Output fastq R2 | 50.37 GB | s3://ultimagen-feb-2024-scrna/10x/035843-scRNA_3-Z0007-CGATTCATGCTCGAT_rq1_S1_L001_R2_001.fastq.gz |
Statistics csv | 4.47 KB | s3://ultimagen-feb-2024-scrna/10x/035843-scRNA_3-Z0007-CGATTCATGCTCGAT_rq1_combined_statistics.csv |
WDL:
Path to wdl: https://github.com/Ultimagen/UltimaGenomicsApplications/tree/main/single_cell
Input template for wdl: [single_cell/Input_templates/single_cell_general_template.10x-atac.json]
Prerequisite Files and Skills:
- Cram file generated on Ultima Genomics UG 100™ tool
- Familiarity with the Linux command line
- Familiarity with the SAM/BAM/CRAM format
Software and Packages Used:
- Capability to execute Workflow Description Language (WDL)
- The main step of the pipeline involves running Trimmer
- https://github.com/Ultimagen/UltimaGenomicsApplications/tree/main/trimmer)
Objectives:
- Produce simulated paired-end reads that can be processed using other software packages (e.g., STARsolo and Cellranger)
- Generate statistics (CSV file) which contain metrics regarding trimmed sequences, aligned sequences, data quality, etc.
Running Analysis pipelines:
Overview of processing steps
The analysis pipeline processes the single-ended reads to produce simulated paired-end reads that can be processed using other software packages (e.g., STARsolo and Cellranger).The following steps are run:
- The relevant json for the pipeline is taken as input (e.g., 10x 5’ GEX, Parse Biosciences WT, Fluent pipseq v3; for this example 10x 3’ GEX). This jsonincludes a Trimmer input json, which describes the relevant read structure. The json also includes information regarding which sequence segments are in the barcode read.In addition, optionally, downstream analyses can be configured to be run once the paired reads are created. These include STARsolo, STAR, sorting of aligned output, andFastQC.
- Using Trimmer, adapters are trimmed, and cell barcode sequences are matched to a reference whitelist and trimmed.The untrimmed sequence will comprise the insert.The cell barcode read is stored as a tag within the trimmed cram, and can be created by combining matched cell barcodes, UMIs, and expected linkers.
- The trimmed cram is used to create the two fastqfiles. To this end, the trimmed cram is first converted to fastq format using Demux, with the barcode read sequencesaved in the fastq header. Next the barcode read portion is written as the barcode read (since this is a synthetic read, the quality is set to “I” for all barcode read bases), and the insertis retained as the insert read.
- Optionally, STARsolo, STAR, sorting of alignment output, and/orFastQC are run.
- The Trimmer and other statistics (STARsolo/STAR, if run) are gathered into a single csv.
Running locally
The steps for generating the simulated paired end reads can be run using two docker files:
1. Running Trimmer
- The latest trimmer docker: us-central1-docker.pkg.dev/ganymede-331016/ultimagen/trimmer:master_b679323
- The following are the inputs:
- {trimmer_input_json} : The 3’ 10x gex json input format is contained within the trimmer docker.The path in the Trimmer docker is /trimmer/dev-formats/single_cell_trimmer_formats_10x_3p_v3_gex.json
- {trimmer_input_format} : The format to use within {trimmer_input_json} .For 3’ 10x GEX, the input format used is "10x V3 3' 9bp UMI" , which takes the first 9bp of the 12bp UMI sequence.
- {read_structure_for_barcode_read} : For 10x 3’ v3, this should be “br:Z:%1%2TTT”.This will result in creating a br tag in the output cram, which is composed of the Cell barcode (defined as token 1 in the Trimmer json), the UMI (defined as token 2 in the Trimmer json), and the sequence TTT , which effectively results in a 12bp UMI that has the last 3bp masked as T.
- {barcode_whitelist_path} : The path with the cell barcode list files.The 10x 3’ v3 whitelist files can be copied from: gs://concordanz/single_cell/10x-3M-february-2018.csv .
- {input_bam} : A cram or bam as input.If a cram is provided, it can be passed with the Trimmer input argument: --input.If a bam, then the input can be piped with samtools.
- {n_threads} : The number of threads to use
- File names for Trimmer outputs:
- {trimmer_stats_csv} : Statistics regarding the trimming of each component in the read.
- {trimmer_failure_code_csv} : Detailed statistics with a breakdown of the reasons each component failed Trimming.
- {output_ucram} : An unaligned, trimmed cram.The reads contain the reverse complement of the cDNA portion of the read and the matched cell barcode and UMI (9bp + TTT) are stored within the br tag.
- The command for running Trimmer:
samtools view -h {input_bam} -@ 32 | \ /trimmer/trimmer \ --description={trimmer_input_json} \ --format={trimmer_input_format} \ --statistics={trimmer_stats_csv} \ --directory={barcode_whitelist_path} \ --skip-unused-pattern-lists=true \ --discard \ --output-field {read_structure_for_barcode_read} \ --failure-code-file {trimmer_failure_code_csv} \ --progress \ --nthreads={n_threads} \ --cram true \ --output {trimmed_ucram}
2. Run demux to create a fastq file with the cell barcode and UMI in the fastq header.Next, save the reads in the fastq and R2 and the CBC+UMI as R1.
- The latest sorter docker (which contains the demux software): us-central1-docker.pkg.dev/ganymede-331016/ultimagen/sorter:master_4ebb634
- The following are the inputs:
- {trimmed_ucram} : The trimmed ucram, created in the previous step
- {output_path} : The format to use within {trimmer_input_json} .For 3’ 10x GEX, the input format used is "10x V3 3' 9bp UMI" , which takes the first 9bp of the 12bp UMI sequence.
- {barcode_whitelist_path} : The path with the cell barcode list files.The 10x 3’ v3 whitelist files can be copied from: gs://concordanz/single_cell/10x-3M-february-2018.csv .
- {input_bam} : A cram or bam as input.If a cram is provided, it can be passed with the Trimmer input argument: --input.If a bam, then the input can be piped with samtools.
- {n_threads} : The number of threads to use
- File names for demux outputs:
- {output_path} : The output path for creating a fastq
- {basename} : The prefix to use for the fastq output file
- File names for the command which splits the fastq into two reads
- {fastq_output_from_demux} : The output filename created by demux
- {r1_fastq} : The r1 output filename (CBC + 9bp UMI + TTT, the quality is set to “I” for all bases, since the CBC has been matched to the whitelist)
- {r2_fastq} : The r2 output filename (cDNA, revcom)
- The command for running demux:
samtools view -h {input_bam} -@ 32 | \ /trimmer/trimmer \ --description={trimmer_input_json} \ --format={trimmer_input_format} \ --statistics={trimmer_stats_csv} \ --directory={barcode_whitelist_path} \ --skip-unused-pattern-lists=true \ --discard \ --output-field {read_structure_for_barcode_read} \ --failure-code-file {trimmer_failure_code_csv} \ --progress \ --nthreads={n_threads} \ --cram true \ --output {trimmed_ucram}
- To create the paired end reads, tee is used to pipe the demux-generated fastq into two commands: 1) for generating the r2 (which is the same as the input file, while the CBC+UMI is removed from the header) and 2) r1 (which contains the CBC+UMI):
zcat {fastq_output_from_demux} | \ tee >( awk '{if (NR % 4 == 1) {print substr($0,0,length($0)-length($15))""} else {print}}' FS=: |\ pigz > {r2_fastq} ) | \ awk 'NR % 4 == 1 {print substr($0,0,length($0)-length($15))"""\n"$15"\n+";\ for (i = 0; i < length($15); i++) {printf "I"}; printf "\n"}' FS=: |\ pigz > {r1_fastq}
Running the wdl:
The wdl input json fields:
Field | Input | Comments |
---|---|---|
input_file | Input cram file | Needs to be edited |
base_file_name | The base name to be used in output files | Needs to be edited |
demux_extra_args | Arguments to demux | Set to add underscores for missing fields in the fastq header, and to take the barcode read sequence from the br tag in the trimmed cram |
fastqc_limits | Input parameter to FastQC | FastQC is run on the insert. |
barcode_fastq_file_suffix | The suffix to add to the file name of the barcode read. | The default addition is needed in order to run Cellranger |
insert_fastq_file_suffix | The suffix to add to the file name of the insert read. | The default addition is needed in order to run Cellranger |
Trimmer parameters: The parameters to pass to Trimmer. If using a different read structure, the trimmer parameters (including the formats_description) must be changed.
Field | Input | Comments |
---|---|---|
local_formats_description | The path, within the Trimmer software to the Trimmer json | |
formats_description | If there is not a local format, then a format can be passed with this parameter | |
untrimmed_reads_action | Must be left as "discard" | |
format | Which format to use (by label) with the Trimmer json | |
extra_args | Additional arguments to pass to Trimmer | "--output-field br:Z:%1%2TTT" describes the structure of the barcode read. In this case, the barcode read is constructed by taking token 1 (%1; the matched cell barcode) + token 2 (%2; the first 9bp of the UMI sequence) + "TTT". This creates a sequence with the cell barcode and a UMI sequence with the last 3bp masked as T. |
pattern_files | The cell barcode whitelist which is used by Trimmer | |
downstream_analysis | Set as star_solo | If you do not want to run star_solo, set this as empty. |
star_solo_params | Has params to pass to STARsolo, including a reference genome (zipped). | For explanations regarding the additional params, see the STARsolo documentation. |