Ouro-Tools - long-read scRNA-seq toolkit

Ouro-Tools is a novel, comprehensive computational pipeline for long-read scRNA-seq with the following key features. Ouro-Tools (1) normalizes mRNA size distributions and (2) detects mRNA 7-methylguanosine caps to integrate multiple single-cell long-read RNA-sequencing experiments across modalities and characterize full-length transcripts, respectively.

Table of Contents
Introduction
- What is long-read scRNA-seq?
Installation
Before starting the tutorial
- Download our toy long-read scRNA-seq datasets
- Basic settings for running the entire pipeline
step 1) Raw long-read pre-processing module
step 2) Spliced alignment
step 3) Barcode extraction module
step 4) Biological full-length molecule identification module
step 5) Size distribution normalization module
step 6) Single-cell count module
step 7) Visualization
wrap-up) Running the entire pipeline using a wrapper function
Pre-built indices of unwanted genomic sequences for pre-processing
An Ouro-Tools count module index
SAM Tags
Bitwise flags

Introduction

The Ouro-Tools pipeline comprises five main modules, allowing seamless integration with existing bulk and single-cell long-read RNA-seq pipelines and tools. Every main module of Ouro-Tools utilizes efficient parallelization for compute-intensive tasks to facilitate the processing of large datasets. Additionally, each Ouro-Tools module employs filesystem-based locks for parallel processing of a large number of samples across multiple machines for scalability.

What is long-read scRNA-seq?

(Figure adapted from Volden & Vollmers, Genome Biol. 23:47 (2022), and made available under Creative Commons license 4.0 by Oxford Nanopore Technologies plc.)

In 2013, 2019, and 2022, “single-cell sequencing,” “single-cell multimodal omics,” and “long-read sequencing” were chosen as “Method of the Year” by Nature Methods journal, respectively, highlighting the urgent need to understand biology at the resolution of individual cells and individual biological molecules. Long-read scRNA-seq is a method that combines the single-cell RNA sequencing and long-read sequencing (Nanopore and PacBio) methods.

Installation

The latest stable version of Ouro-Tools is available in PyPI, BioConda, and BioContainers.

PyPI Installation (as a Python package)

pip install ourotools

Bioconda Installation (as an Anaconda package)

conda install bioconda::ourotools

BioContainers Installation (as a Docker container)

# Download the latest Ouro-Tools Docker image from BioContainers
docker pull quay.io/biocontainers/ourotools:0.2.8--pyhdfd78af_0 

# Run the Ouro-Tools Docker image
# please change '/your/local/folder' to the local path containing the input files
docker run -v /your/local/folder:/data -it quay.io/biocontainers/ourotools:0.2.8--pyhdfd78af_0

Install the latest (but possibly unstable) version via GitHub

git clone https://github.com/ahs2202/ouro-tools.git
cd ouro-tools
pip install .

Test the installation

Ouro-Tools can be used in command line, in a Python script, or in an interactive Python interpreter (e.g., Jupyter Notebook).

To print the command line usage example of each module from the bash shell, please type the following command.

Bash shell

ourotools LongFilterNSplit -h

IPython environment (Jupyter Notebook)

ourotools.LongFilterNSplit?

Before starting the tutorial

Download our toy long-read scRNA-seq datasets

Each toy dataset contains a subsampled long-read sequencing (ONT R10.4.1) result of an Ouro-Seq library (please check our BioRxiv pre-print for more details). 3 cell types (100 cells are sampled for each cell type) and 3 chromosomes are selected for subsampling. Additionally, the artifact molecules were subsampled and included in the toy dataset.

# download toy datasets from mouse ovary and testis
wget https://ouro-tools.s3.amazonaws.com/tutorial/mOvary.subsampled.fastq.gz
wget https://ouro-tools.s3.amazonaws.com/tutorial/mTestis2.subsampled.fastq.gz

Alternatively, you can download directly using your browser using the following links: mOvary and mTestis

Basic settings for running the entire pipeline

import ourotools


# global multiprocessing settings
ourotools.bk.int_max_num_batches_in_a_queue_for_each_worker = 1 # [NOTE] For WSL, changing this variable to 1 is necessary to prevent deadlock(s) during IPC. 
n_workers = 2 # employ 2 workers (since there are two samples, 2 workers are sufficient)
n_threads_for_each_worker = 8 # use 8 CPU cores for each worker


# datasets-specific setting
path_folder_data = '/home/project/Single_Cell_Full_Length_Atlas/data/pipeline/20220331_Ouroboros_Project/pipeline/20230208_Mouse_Long_Read_Single_Cell_Atlas/pipeline/20230811_mouse_long_read_single_cell_atlas_v202308/tutorial_data/20240728_ovary_testis_tutorial/'
l_name_sample = [
    'mOvary.subsampled',
    'mTestis2.subsampled',
]


# scRNA-seq technology-specific settings
path_file_valid_barcode_list = '/home/project/Single_Cell_Full_Length_Atlas/data/pipeline/20210728_development_ouroboros_qc/example/3M-february-2018.txt.gz' # GEX v3 CB


# species-specific settings
path_file_minimap_index_genome = '/home/shared/ensembl/Mus_musculus/index/minimap2/Mus_musculus.GRCm38.dna.primary_assembly.k_14.idx'
path_file_minimap_splice_junction = '/home/shared/ensembl/Mus_musculus/Mus_musculus.GRCm38.102.paftools.bed'
path_file_minimap_unwanted = '/home/project/Single_Cell_Full_Length_Atlas/data/accessory_data/cDNA_depletion/index/minimap2/MT_and_rRNA_GRCm38.fa.ont.mmi'
path_folder_count_module_index = '/home/project/Single_Cell_Full_Length_Atlas/data/pipeline/20211116_ouroboros_short_read_public_data_mining/scarab_annotations/Mus_musculus.GRCm38.102.v0.2.4/' # path to the Ouro-Tools count module index

To find the barcode whitelist specific to your scRNA-seq experiment, please refer to the official 10x Genomics article. Pre-built Ouro-Tools count module index can be downloaded here. Pre-built indices of unwanted sequences (ribosomal DNA repeats and mitochondrial DNAs) can be downloaded here.

step 1) Raw long-read pre-processing module (QC module)

# run LongFilterNSplit
ourotools.LongFilterNSplit(
    path_file_minimap_index_genome = path_file_minimap_index_genome,
    l_path_file_minimap_index_unwanted = [ path_file_minimap_unwanted ],
    l_path_file_fastq_input = list( f"{path_folder_data}{name_sample}.fastq.gz" for name_sample in l_name_sample ),
    l_path_folder_output = list( f"{path_folder_data}LongFilterNSplit_out/{name_sample}/" for name_sample in l_name_sample ),
    int_num_samples_analyzed_concurrently = n_workers,
    n_threads = n_workers * n_threads_for_each_worker,
)

As the first module of the Ouro-Tools pipeline, the raw long-read pre-processing module LongFilterNSplit has a dual function for (1) providing comprehensive quality control metrics of a long-read scRNA-seq experiment and (2) pre-processing of raw long-read sequencing data for the downstream analysis.

According to the classification results, cDNA molecules are organized into separate output FASTQ files. For the cDNA molecules that contains a single (external or internal) poly(A) tail, the read is re-oriented so that it has the same orientation as its original mRNA transcript, with the poly(A) tail at its 3’ end; the resulting long-reads of cDNAs can be utilized for strand-specific long-read RNA-seq analysis.

step 2) Spliced alignment

# align using minimap2 (require that minimap2 executable can be found in PATH)
# below is a wrapper function for minimap2
ourotools.Workers(
    ourotools.ONT.Minimap2_Align, # function to deploy
    int_num_workers_for_Workers = n_workers, # create 'n_workers' number of workers
    # below are arguments for the function 'ourotools.ONT.Minimap2_Align'
    path_file_fastq = list( f"{path_folder_data}LongFilterNSplit_out/{name_sample}/aligned_to_genome__non_chimeric__poly_A__plus_strand.fastq.gz" for name_sample in l_name_sample ), 
    path_folder_minimap2_output = list( f"{path_folder_data}minimap2_bam_genome/{name_sample}/" for name_sample in l_name_sample ), 
    path_file_junc_bed = path_file_minimap_splice_junction, 
    path_file_minimap2_index = path_file_minimap_index_genome,
    n_threads = n_threads_for_each_worker,
)

Minimap2 can be used for annotation-guided alignment based on the transcript annotations prepared by the researcher. Here, the reference annotation from Ensembl (Ensembl release 102) was utilized.

step 3) Barcode extraction module

# run LongExtractBarcodeFromBAM
l_path_folder_barcodedbam = list( f"{path_folder_data}LongExtractBarcodeFromBAM_out/{name_sample}/" for name_sample in l_name_sample )
ourotools.LongExtractBarcodeFromBAM(
    path_file_valid_cb = path_file_valid_barcode_list,
    l_path_file_bam_input = list( f"{path_folder_data}minimap2_bam_genome/{name_sample}/aligned_to_genome__non_chimeric__poly_A__plus_strand.fastq.gz.minimap2_aligned.bam" for name_sample in l_name_sample ), 
    l_path_folder_output = l_path_folder_barcodedbam,
    int_num_samples_analyzed_concurrently = n_workers, 
    n_threads = n_workers * n_threads_for_each_worker,
)

The barcode extraction module LongExtractBarcodeFromBAM identifies cell barcode (CB) and unique molecular identifier (UMI) sequences for each read and exports the results as a “barcoded” BAM file, a BAM file containing corrected CB and UMI sequences for each read using the predefined SAM tags.

step 4) Biological full-length molecule identification module

# run full-length ID module
# survey 5' sites for each sample
ourotools.LongSurvey5pSiteFromBAM(
    l_path_folder_input = l_path_folder_barcodedbam,
    int_num_samples_analyzed_concurrently = n_workers, 
    n_threads = n_workers * n_threads_for_each_worker,
)
# combine 5' site profiles across samples and classify each 5' profile
ourotools.LongClassify5pSiteProfiles( 
    l_path_folder_input = l_path_folder_barcodedbam,
    path_folder_output = f"{path_folder_data}LongClassify5pSiteProfiles_out/",
    n_threads = n_threads_for_each_worker,
)
# append 5' site classification results to each BAM file
ourotools.LongAdd5pSiteClassificationResultToBAM(
    path_folder_input_5p_sites = f'{path_folder_data}LongClassify5pSiteProfiles_out/',
    l_path_folder_input_barcodedbam = l_path_folder_barcodedbam,
    int_num_samples_analyzed_concurrently = n_workers, 
    n_threads = n_workers * n_threads_for_each_worker,
)
# filter artifact reads from each BAM file
ourotools.Workers(
    ourotools.FilterArtifactReadFromBAM, # function to deploy
    int_num_workers_for_Workers = n_workers, # create 'n_workers' number of workers
    # below are arguments for the function 'ourotools.FilterArtifactReadFromBAM'
    path_file_bam_input = list( f'{path_folder_data}LongExtractBarcodeFromBAM_out/{name_sample}/5pSiteTagAdded/barcoded.bam' for name_sample in l_name_sample ), 
    path_folder_output = list( f'{path_folder_data}LongExtractBarcodeFromBAM_out/{name_sample}/5pSiteTagAdded/FilterArtifactReadFromBAM_out/' for name_sample in l_name_sample ), 
)

The biological full-length identification module collects the lengths of guanosine homopolymers at the 5’ ends of cDNAs to identify genuine TSSs that produce capped mRNAs, depleting truncated cDNA molecules in silico. The module is implemented as a workflow consisting of LongSurvey5pSiteFromBAM, LongClassify5pSiteProfiles, LongAdd5pSiteClassificationResultToBAM, and FilterArtifactReadFromBAM.

step 5) Size distribution normalization module

# run mRNA size distribution normalization module
# survey the size distribution of full-length mRNAs for each sample
l_full_length_bam = list( f'{path_folder_data}LongExtractBarcodeFromBAM_out/{name_sample}/5pSiteTagAdded/FilterArtifactReadFromBAM_out/valid_3p_valid_5p.bam' for name_sample in l_name_sample )
ourotools.Workers( 
    ourotools.LongSummarizeSizeDistributions,
    int_num_workers_for_Workers = n_workers, # create 'n_workers' number of workers
    path_file_bam_input = l_full_length_bam,
    path_folder_output =  list( f'{path_folder_data}LongExtractBarcodeFromBAM_out/{name_sample}/5pSiteTagAdded/FilterArtifactReadFromBAM_out/valid_3p_valid_5p.LongSummarizeSizeDistributions_out/' for name_sample in l_name_sample ),
)
# normalize size distributions
path_folder_size_norm = f"{path_folder_data}LongCreateReferenceSizeDistribution_out/"
ourotools.LongCreateReferenceSizeDistribution(
    l_path_file_distributions = list( f'{path_folder_data}LongExtractBarcodeFromBAM_out/{name_sample}/5pSiteTagAdded/FilterArtifactReadFromBAM_out/valid_3p_valid_5p.LongSummarizeSizeDistributions_out/dict_arr_dist.pkl' for name_sample in l_name_sample ),
    l_name_file_distributions = l_name_sample,
    path_folder_output = path_folder_size_norm,
    float_max_ratio_to_arr_dist_guassian_filter_min_sigma_for_dynamic_gaussian_filter_selection = 2,
    float_sigma_gaussian_filter_min = 8,
    int_min_total_read_count_for_a_peak = 30 ,
)
# based on the output, set the confident size range
str_confident_size_range = ourotools.get_confident_size_range( path_folder_size_norm )

The size distribution normalization module is implemented using the LongSummarizeSizeDistributions and LongCreateReferenceSizeDistribution workflows. First, using the LongSummarizeSizeDistributionsworkflow, a full-length, UMI-deduplicated cDNA size distribution is obtained from the valid_3p_valid_5p barcoded BAM file (representing in vivo full-length mRNAs) for each sample. Next, the reference mRNA size distribution is constructed for all the samples using the LongCreateReferenceSizeDistribution workflow.

step 6) Single-cell count module

# run the single-cell count module 
ourotools.LongExportNormalizedCountMatrix( 
    path_folder_ref = path_folder_count_module_index, 
    l_path_file_bam_input = l_full_length_bam,
    l_path_folder_output = list( f'{path_folder_data}LongExportNormalizedCountMatrix_out/{name_sample}/' for name_sample in l_name_sample ),
    l_name_distribution = l_name_sample,
    path_folder_reference_distribution = path_folder_size_norm,
    l_str_l_t_distribution_range_of_interest = [ ','.join( [ "raw", str_confident_size_range ] ) ],
    flag_enforce_transcript_start_site_matching_for_long_read_during_realignment = True, 
    flag_enforce_transcript_end_site_matching_for_long_read_during_realignment = True,  
)

The single-cell long-read count module LongExportNormalizedCountMatrix is largely composed of three parts: constructing an index (only required once for each set of genes, transcripts, repeat elements, regulatory elements and the reference genome), assigning each read to various buckets (each bucket represent one of the genes, transcripts, exons, splice junctions, TEs, tCREs, and individual genomic tiles), and exporting a size distribution-normalized count matrix for each bucket (later these count matrixes are combined into a single size distribution-normalized count matrix as an output).

step 7) Visualization

# TBD

wrap-up) Running the entire pipeline using a wrapper function

# version 2025-01-25 by Hyunsu An @ GIST-FGL
import ourotools

ourotools.bk.int_max_num_batches_in_a_queue_for_each_worker = 1 # [NOTE] For WSL, changing this variable to 1 is necessary to prevent deadlock(s) during IPC. 

ourotools.run_pipeline(
    # dataset setting
    path_folder_data = '/home/project/Single_Cell_Full_Length_Atlas/data/pipeline/20220331_Ouroboros_Project/pipeline/20230208_Mouse_Long_Read_Single_Cell_Atlas/pipeline/20230811_mouse_long_read_single_cell_atlas_v202308/tutorial_data/20240813_ovary_testis_tutorial2/',
    l_name_sample = [
        'mOvary.subsampled',
        'mTestis2.subsampled',
    ],
    # scRNA-seq technology-specific 
    path_file_valid_barcode_list = '/home/project/Single_Cell_Full_Length_Atlas/data/pipeline/20210728_development_ouroboros_qc/example/3M-february-2018.txt.gz', # GEX v3 CB
    # species-specific settings
    path_file_minimap_index_genome = '/home/shared/ensembl/Mus_musculus/index/minimap2/Mus_musculus.GRCm38.dna.primary_assembly.k_14.idx',
    path_file_minimap_splice_junction = '/home/shared/ensembl/Mus_musculus/Mus_musculus.GRCm38.102.paftools.bed',
    path_file_minimap_unwanted = '/home/project/Single_Cell_Full_Length_Atlas/data/accessory_data/cDNA_depletion/index/minimap2/MT_and_rRNA_GRCm38.fa.ont.mmi',
    path_folder_count_module_index = '/home/project/Single_Cell_Full_Length_Atlas/data/pipeline/20211116_ouroboros_short_read_public_data_mining/scarab_annotations/Mus_musculus.GRCm38.102.v0.2.4/', # path to the Ouro-Tools reference
    # run setting
    n_workers = 2, # employ 2 workers (since there are two samples, 2 workers are sufficient)
    n_threads_for_each_worker = 8, # use 8 CPU cores for each worker
    # additional settings
    args = dict(
        LongCreateReferenceSizeDistribution = dict(
            float_max_ratio_to_arr_dist_guassian_filter_min_sigma_for_dynamic_gaussian_filter_selection = 2,
            float_sigma_gaussian_filter_min = 8,
            int_min_total_read_count_for_a_peak = 30 ,
        ),
        LongExportNormalizedCountMatrix = dict(
            flag_enforce_transcript_start_site_matching_for_long_read_during_realignment = True, 
            flag_enforce_transcript_end_site_matching_for_long_read_during_realignment = True,  
        ),
    ),
)

Pre-built indices of unwanted genomic sequences for pre-processing

The pre-built indices of unwanted sequences can be downloaded using the following links:

Human (GRCh38) : Minimap2-index-file, FASTA-file, GTF-file

Mouse (GRCm38) : Minimap2-index-file, FASTA-file, GTF-file

An Ouro-Tools count module index

The single-cell count module of Ouro-Tools utilizes genome, transcriptome, and gene annotations to assign reads to genes, isoforms, and genomic bins (tiles across the genome). The index building process is automatic; there is no needs to run a separate command in order to build the index. Once Ouro-Tools processes these information before analyzing an input BAM file(s), the program saves an index in order to load the information much faster next time.

We recommends using Ensembl reference genome, transcriptome, and gene annotations of the same version (release number).

Pre-built index

pre-built index can be downloaded using the following links (should be extracted to a folder using tar -xf command):

Human (GRCh38, Ensembl version 105)

Mouse (GRCm39, Ensembl version 105)

Mouse (GRCm38, Ensembl version 102)

Zebrafish (GRCz11, Ensembl version 104)

Thale cress (TAIR10, Ensembl Plant version 56)

Building index from scratch

An Ouro-Tools index can be built on-the-fly from the input genome, transcriptome, and gene annotation files. For example, below are the list of files that were used for the pre-built Ouro-Tools index "Human (GRCh38, Ensembl version 105)".

required annotations (Ensemble version 105):

path_file_fa_genome : https://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
- A genome FASTA file. Either gzipped or plain FASTA file can be accepted.
path_file_gtf_genome : https://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.gtf.gz
- A GTF file. Either gzipped or plain GTF file can be accepted. Currently GFF3 format files are not supported.
- Following arguments can be used to set attribute names for identifying gene and transcript annotations in its attributes column.
  - str_name_gtf_attr_for_id_gene : (default: 'gene_id')
  - str_name_gtf_attr_for_name_gene : (default: 'gene_name')
  - str_name_gtf_attr_for_id_transcript : (default: 'transcript_id')
  - str_name_gtf_attr_for_name_transcript : (default: 'transcript_name')
- An example of GTF annotation file for gene annotations:

1	ensembl_havana	gene	1211340	1214153	.	-	.	gene_id "ENSG00000186827"; gene_version "11"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
1	ensembl_havana	transcript	1211340	1214153	.	-	.	gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000379236"; transcript_version "4"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS11"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
1	ensembl_havana	exon	1213983	1214153	.	-	.	gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000379236"; transcript_version "4"; exon_number "1"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS11"; exon_id "ENSE00001832731"; exon_version "2"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";

path_file_fa_transcriptome : https://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
- A transcriptome FASTA file. Either gzipped or plain FASTA file can be accepted.

optional input annotations

path_file_tsv_repeatmasker_ucsc : Table Browser (ucsc.edu) [click "get output" to download the annotation]
- repeat masker annotations from the UCSC Table Browser
path_file_gff_regulatory_element : https://ftp.ensembl.org/pub/current_regulation/homo_sapiens/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20221007.gff.gz
- The latest regulatory build from Ensembl.
- Annotations from other sources, or custom annotations can be used. Currently only the GFF3 file format is supported (with .gff extension).
- The following argument can be used to set the attribute name for identifying regulatory region
  - str_name_gff_attr_id_regulatory_element : (default: 'ID')
- An example of GFF annotation file for regulatory elements:

18	Regulatory_Build	enhancer	35116801	35120999	.	.	.	ID=enhancer:ENSR00000572865;bound_end=35120999;bound_start=35116801;description=Predicted enhancer region;feature_type=Enhancer
8	Regulatory_Build	TF_binding_site	37967115	37967453	.	.	.	ID=TF_binding_site:ENSR00001137252;bound_end=37967531;bound_start=37966339;description=Transcription factor binding site;feature_typ
6	Regulatory_Build	enhancer	90249202	90257999	.	.	.	ID=enhancer:ENSR00000798348;bound_end=90257999;bound_start=90249202;description=Predicted enhancer region;feature_type=Enhancer
3	Regulatory_Build	CTCF_binding_site	57689401	57689600	.	.	.	ID=CTCF_binding_site:ENSR00000687477;bound_end=57689600;bound_start=57689401;description=CTCF binding site;feature_type=CTCF

SAM Tags

SAM tag name	data type	Description	Module name
CB	Z	the corrected cell barcode sequence	Barcode Extraction
UB	Z	the corrected UMI sequence after the UMI clustering process	Barcode Extraction
UR	Z	the uncorrected UMI sequence before the UMI clustering process	Barcode Extraction
XR	i	the number of errors for identification of R1 adapter (marks the 3’ end of cDNA). -1 indicates that the adapter was not identified	Barcode Extraction
XT	i	the number of errors for identification of TSO adapter (marks the 5’ end of cDNA). -1 indicates that the adapter was not identified	Barcode Extraction
CU	Z	the uncorrected raw CB-UMI sequence	Barcode Extraction
IA	i	the length of detected internal poly(A) tract on the genome	Barcode Extraction
LE	i	the total number of genome-aligned base pairs	Barcode Extraction
AG	i	the number of consecutive G nucleotides, starting from the 5’ site in the aligned region of the read	Full-Length Identification
UG	i	the number of consecutive G nucleotides, starting from the 5’ site in the unaligned region of the read (soft-clipped sequence)	Full-Length Identification
VS	i	“1” if the 5’ site is identified as a valid transcript start site (TSS), “0” if the 5’ site is identified as an invalid TSS, representing 5’ sites of the PCR/RT artifacts (including 5p degradation products of full-length transcripts)	Full-Length Identification
AU	i	the inferred number of unreferenced G nucleotides aligned to the genome	Full-Length Identification
XC	i	the bitwise flags (see the table below for more details)	Single-Cell Count
XR	Z	the repeat element ID	Single-Cell Count
YR	i	the total number of base pairs overlapping with the repeat element to which the read is confidently assigned	Single-Cell Count
XG	Z	the gene ID	Single-Cell Count
YG	i	the total number of base pairs overlapping with the exons of the gene to which the read is confidently assigned	Single-Cell Count
XP	Z	the promoter ID	Single-Cell Count
YX	i	the total number of base pairs overlapping with any exons that overlap with the read	Single-Cell Count
YF	i	the total number of base pairs overlapping with any repeat elements that overlap with the read (considering only filtered repeat elements)	Single-Cell Count
XE	Z	the regulatory element ID	Single-Cell Count
YU	i	the total number of base pairs overlapping with any repeat elements that overlap with the read (considering all repeat elements)	Single-Cell Count
YE	i	the total number of base pairs overlapping with any regulatory elements that overlap with the read	Single-Cell Count
XT	Z	the transcript ID that is uniquely assigned to the read using the re-alignment process	Single-Cell Count
ZF	i	the flag that indicates the read represents a full-length cDNA with valid 3’ and 5’ ends	Single-Cell Count

Bitwise flags

Binary flag	Feature type	Description
0x1	gene	overlaps with gene(s)
0x2	gene	gene assignment is ambiguous
0x4	gene	completely intronic reads (GEX mode specific)
0x8	gene	exonic reads (GEX mode specific)
0x10	promoter	overlaps with promoter region(s) (ATAC mode specific)
0x20	promoter	promoter assignment is ambiguous (ATAC mode specific)
0x40	repeats	overlaps with repeat element(s)
0x80	repeats	ambiguous assignment to two or more number of repeat elements
0x100	repeats	the entire length of a read overlaps with a single repeat element
0x200	regulatory	overlaps with regulatory element(s)
0x400	regulatory	overlaps with both repeat element(s) and regulatory element(s)
0x800	regulatory	overlaps exclusively with regulatory element(s) (no overlaps with repeat element)
0x1000	regulatory	ambiguous assignment to two or more number of regulatory elements
0x2000	regulatory	the entire length of a read overlaps with a single regulatory element

Ouro-Tools was developed by Hyunsu An and Chaemin Lim at Gwangju Institute of Science and Technology under the supervision of Professor Jihwan Park.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
conda		conda
doc/img		doc/img
docker		docker
ourotools		ourotools
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ouro-Tools - long-read scRNA-seq toolkit

Table of Contents

Introduction

What is long-read scRNA-seq?

Installation

Before starting the tutorial

Download our toy long-read scRNA-seq datasets

Basic settings for running the entire pipeline

step 1) Raw long-read pre-processing module (QC module)

step 2) Spliced alignment

step 3) Barcode extraction module

step 4) Biological full-length molecule identification module

step 5) Size distribution normalization module

step 6) Single-cell count module

step 7) Visualization

wrap-up) Running the entire pipeline using a wrapper function

Pre-built indices of unwanted genomic sequences for pre-processing

An Ouro-Tools count module index

Pre-built index

Building index from scratch

optional input annotations

SAM Tags

Bitwise flags

About

Releases 3

Packages

Languages

License

ahs2202/ouro-tools

Folders and files

Latest commit

History

Repository files navigation

Ouro-Tools - long-read scRNA-seq toolkit

Table of Contents

Introduction

What is long-read scRNA-seq?

Installation

Before starting the tutorial

Download our toy long-read scRNA-seq datasets

Basic settings for running the entire pipeline

step 1) Raw long-read pre-processing module (QC module)

step 2) Spliced alignment

step 3) Barcode extraction module

step 4) Biological full-length molecule identification module

step 5) Size distribution normalization module

step 6) Single-cell count module

step 7) Visualization

wrap-up) Running the entire pipeline using a wrapper function

Pre-built indices of unwanted genomic sequences for pre-processing

An Ouro-Tools count module index

Pre-built index

Building index from scratch

optional input annotations

SAM Tags

Bitwise flags

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages