How to use SAVNet

Basic commands

SAVNet at least required two arguments; sample configuration file and output-prefix.

savnet ${sample_configuration_file} ${output_prefix}

The sample configuration file is the list of file paths to somatic mutations, splicing junctions and intron retentions. The output prefix is the file path prefix to the output file. For more detail, check the help command.

savnet -h

How to prepare sample configuration files

The sample config file should be tab-delimited and start with headers. Currently, there are 5 columns, Sample_Name, Mutation_File, SJ_File, IR_File and Weight. Splicing junctions in SJ_File furnish evidences for Exon-skipping, Alternative 5’SS (splice site) and Alternative 3’SS, whereas information in IR_File constitute evidences for Intron retention. Sample_Name and Mutation_File are required column. Either SJ_File or IR_File need to be specified (When, e.g., IR_File is not specified, SAVNet does not consider mutations associated with intron retention). Weight is optional.

In the default setting, SAVNet assumes that the coordinate used in the input files are based on hg19 (or GRCh37). If your data is based on hg38 (GRCh38), please add --genome_id hg38 argument (genome_id mm10 for GRCm10).

Currently, SAVNet currently automatically recognize chr-prefixed (UCSC type) and non chr-prefixed (Genome Reference Consortium (GRC) type) chromosome names. So you may use files for both coordinate systems as long as all the input files use the same system. For the earlier version of SAVNet (<= 0.3.1), you may want to add --grc option in execution of savnet command when the chromosome names are not chr-prefix (when using GRC type names such as GRCh37).

Sample_Name

The field for sample names. This is used for the result file to show which samples have the identified SAVs.

Mutation_File

The path to mutation calling data for each sample (on hg19, hg38 or mm10 reference genome). SAVNet accepts VCF format and Annovar input file format. The format will be discerned by extensions; when the extension is ‘.vcf’ or ‘.vcf.gz’, then SAVNet will recognize the format as the VCF format. Otherwise, it will assume the format is Annovar input file.

For the VCF format file, the first 8 mandatory columns (#CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO) need to be specified. However, the values of ID, QUAL, FILTER, INFO columns columns are not used and ignored in SAVNet. We recommend to set ID as ‘.’, QUAL as ‘60’, FILTER as ‘PASS’ and INFO as ‘SOMATIC’, respectively.

For the Annovar input file format, only the first five columns (Chr, Start, End, Ref, Alt) are used, and others are ignored. When some of mutation files in the sample configuration file are in Annovar input file format, you need to add the path to the reference genome through --reference argument.

Lines starting with # are skipped (as comments).

Note

Although SAVNet is developed primarily identifying somatic splicing associated variants (somatic SAVs), it can be effectively used for identifying germ-line rare SAVs. However, for germ-line rare SAVs, we recommend to filter out common variants (e.g., allele frequencies >= 0.05) beforehand. The allele frequencies used in this filtering step can be estimated by public data base (e.g., ExAC, 1000 Genomes Project and so on) as well as your in-house germ-line variant database. The reasons we recommend to adopt this filtering step is as follows:

  • The computational cost will be greatly reduced.
  • Important variants linked to diseases are mostly rare.
  • SAVNet is not extensively tested for extracting common SAVs. For this purpose, other splicing QTL framework may work better.

SJ_File

The paths to the splicing junction files (SJ.out.tab) generated by the STAR alignment. The splicing junctions appearing in SJ.out tab files greatly changes by STAR parameters such as --outSJfilter. We currently recommend to use the following parameters (together with other settings such as sorting and thread numbers):

--outSJfilterCountUniqueMin 1 1 1 1 --outSJfilterCountTotalMin 1 1 1 1 \
--outSJfilterOverhangMin 12 12 12 12 --outSJfilterDistToOtherSJmin 0 0 0 0 --alignSJstitchMismatchNmax -1 -1 -1 -1

Currently, we use the 1st (chromosome), 2nd (first base of the intron), 3rd (last base of the intron), 6th (annotated or not) and 7th (number of uniquely mapping reads crossing the junction) columns of SJ.out.tab file. Therefore, when you want to create splicing junction files for SAVNet through other approaches than STAR alignment, please prepare 9-columns tab file specifying the appropriate values for these columns (1st, 2nd, 3rd, 6th and 7th), and values for other columns can be set arbitrary.

Note

In the default setting, annotated splicing junction (whose 6th column value is equal to 1) will be removed at an early step. Although this procedure help reducing computational cost and improving accuracy, there is a risk to remove a small number of genuine cancer specific splicing junctions. (The most striking example is the exon 14 skipping of MET1 gene). To avoid this, please specify the –keep_annotated argument.

Note

Many people use –twopassMode option when using STAR, and it seems that the 6th column will become 1 for most splicing junctions in that case. So when you used –twopassMode option, please specify the –keep_annotated argument!!

IR_File

The paths to the intron retention files generated by intron_retention_utils simple_count command. intron_retention_utils will be installed through SAVNet installation step. intron_retention_utils simple_count command accept a BAM format file (${input_bam} in the example code below), calculate the number of short read spanning both exonic and intronic regions (>= 10 bp for both in the default setting) around each exon-intron boundary, and return the result file (${output_intron_retention_file} in the example code below).

% intron_retention_utils simple_count ${input_bam} ${output_intron_retention}

The size of exonic and intronic bases to be covered by short reads (default: 10) can be tuned by --intron_retention_check_size. In the default setting, the short reads whose pairs are improperly mapped (0x2 bit of SAM flag is equal to zero) are removed. Therefore, you are using single-end RNA-seq, you need to add --keep_improper_pair argument. Otherwise, the output file will become empty.

Weight

This optional field is used for negating the diversity of the numbers of total RNA-seq reads among samples. We currently recommend to set this as the number of uniquely aligned read pairs (derived from the STAR Log.final.out files) divided by 10^7. However, we think reasonable results would be obtained without setting column in most cases.

How to prepare pooled control files

When either one or both of pooled control files for splicing junctions and intron retentions are specified, SAVNet removes the splicing variations registered in these files before associating mutations and splicings. This will greatly help improving the accuracy of SAV calls, focusing on splicing variations that cannot be observed in normal control samples, as well as reducing the computational cost. We recommend to specify these parameters using as many samples as possible (hopefully at least >= 10 control samples).

Splicing junction control files

The pooled control file for splicing junction can be generated by junc_utils merge_control command. We currently recommend to use the following parameters:

% junc_utils merge_control --read_num_thres 2 --keep_annotated --sample_num_thres 1 ${input_list} ${output_file}

The value of --sample_num_thres can be tuned for large number of control samples.

Intron retention control files

The pooled control file for intron retention can be generated by intron_retention_utils merge_control command. We currently recommend to use the following parameters:

% intron_retention_utils merge_control --sample_num_thres 1 ${input_list} ${output_files}

The value of --sample_num_thres can be tuned for large number of control samples.

Workflow for executing SAVNet

Here, we describe an example workflow for executing SAVNet. The directory structure is resembled to QUICK START section for ease of

  1. Make root directory for SAVNet execution workspace.
% mkdir savnet_workspace/resource
% cd savnet_workspace/resource
  1. Collect somatic mutation files (in VCF format) under the mutation directory setting base-names as ${sample}.vcf.
% mkdir vcf
% cp ${mutation_file_for_sample1} vcf/${sample1}.vcf
% cp ${mutation_file_for_sample2} vcf/${sample2}.vcf
  ....
  1. Collect splicing junction files (generated by STAR) under the junction directory setting base-names as ${sample}.SJ.out.tab.
% mkdir junction
% cp ${splicing_junction_for_sample1} junction/${sample1}.SJ.out.tab
% cp ${splicing_junction_for_sample2} junction/${sample2}.SJ.out.tab
  ....
  1. Collect intron retention count files under the intron_retention directory setting base-names as ${sample}.intron_retention.txt.
% mkdir intron_retention
% cp ${intron_retention_for_sample1} intron_retention/${sample1}.intron_retention.txt
% cp ${intron_retention_for_sample2} intron_retention/${sample2}.intron_retention.txt
  ....
  1. Collect quality check files (generated by STAR) under the qc directory setting base-names as ${sample}.Log.final.out.
% mkdir junction
% cp ${quality_check_for_sample1} qc/${sample1}.Log.final.out
% cp ${quality_check_for_sample2} qc/${sample2}.Log.final.out
  ....
  1. Create sample configuration file (e.g., using our in-house script).
# Download the script for creating input list file
% wget https://storage.googleapis.com/friend1ws_package_data/savnet/make_savnet_input.py
# Generate the input list file by running the script
% python make_savnet_input.py --sample_list_file savnet.input.txt --mut_dir vcf --sj_dir junction --ir_dir intron_retention --qc_dir qc
  1. Prepare control files (optional).
#
# Merged control file for splicing junction
#
% mkdir junction_ctrl
% cp ${splicing_junction_for_control1} junction_ctrl/${control1}.SJ.out.tab
% cp ${splicing_junction_for_control2} junction_ctrl/${control2}.SJ.out.tab
  ...
% ls junction_ctrl/*.SJ.out.tab > junction_ctrl_list.txt
% junc_utils merge_control --read_num_thres 2 --keep_annotated --sample_num_thres 1 junction_ctrl_list.txt junction_ctrl.bed.gz

#
# Merged control file for intron retention
#
% mkdir intron_retention_ctrl
% cp ${intron_retention_for_control1} intron_retention_ctrl/${control1}.intron_retention.txt
% cp ${intron_retention_for_control1} intron_retention_ctrl/${control2}.intron_retention.txt
  ...
% ls intron_retention_ctrl/*.intron_retention.txt > intron_retention_ctrl_list.txt
% intron_retention_utils merge_control --sample_num_thres 1 intron_retention_ctrl_list.txt intron_retention_ctrl_list.bed.gz
  1. Execute SAVNet command.
% cd ..
% savnet resource/savnet.input.txt output_dir/output --SJ_pooled_control_file resource/junction_ctrl.bed.gz --IR_pooled_control_file resource/intron_retention_ctrl_list.bed.gz