Tumor
癌症序列变异解释和报告的标准和指南(ASCO和CAP联合推荐)
肺癌分析插件Can_28修复问题
Chom化疗插件及bug修复
Brca插件
FACTERA-fusionGene
SEGF-挖掘NGS中融合基因的新方法
maftools-肿瘤突变数据可视化神器
从数据库中获取免疫相关基因(IRGs)
数据库10KIP—基于ImmPort免疫组学数据挖掘
TCGA数据分析
下载分析TCGA数据库的数据
差异表达分析(limma & edgeR & DESeq2)
TCGA数据的规律【更新中】
生存分析
生存模型构建
突变数据
本文档使用 MrDoc 发布
-
+
up
down
首页
FACTERA-fusionGene
FACTERA: Software and Documentation ## Overview **FACTERA** (Fusion And Chromosomal Translocation Enumeration and Recovery Algorithm) is a tool for detection of genomic fusions in paired-end targeted (or genome-wide) sequencing data. *Command * ``` perl factera.pl [options] tumor.bam exons.bed hg19.2bit [optional: targets.bed] :<<eof USAGE: FACTERA version 1.4.4 by Aaron M. Newman (amnewman@stanford.edu) Purpose: A tool for detection of fusions in paired-end targeted (or genome-wide) sequencing data. Usage: factera.pl [options] <tumor.bam> <exons.bed> <genome.2bit> <optional: targets.bed> <tumor.bam> Paired reads mapped with BWA or another soft-clip capable aligner. NOTE: Must be indexed for factera to function properly. <exons.bed> Genomic coordinates with gene/exon names in fourth column. Data sources include RefSeq, UCSC, and/or GENCODE. <genome.2bit> Two bit reference sequence (e.g., hg19.2bit). NOTE: twoBitToFa application must be in PATH. <targets.bed> Restrict search to specified genome coordinates. Options (defaults in parentheses): -o <dir> output directory (tumor.bam directory) -r <int> minimum number of breakpoint-spanning reads required for output (5) -m <int> minimum number of discordant reads required for each candidate fusion (2) -x <int> maximum number of breakpoints to examine for each pair of genomic regions (5) -s <int> minimum number of reads with the same breakpoint (1) -f <0-1> minimum fraction of read bases required for alignment to fusion template (0.9) -S <0-1> minimum similarity required for read to match fusion template (0.95) -k <int> k-mer size for fragment comparison (12 bases) -c <int> minimum size of soft-clipped region to consider (16 bases) -b <int> number of bases flanking breakpoint for fusion template (500) -p <int> number of threads for blastn search (4; 10 or more recommended) -a <int> number of bases flanking breakpoint to provide in output (50) -e disable grouping of input coordinates by column 4 of exons.bed (off) -v disable verbose output (off) -t disable running time output (off) -C disable addition of 'chr' prefix to chromosome names (off) -F force remake of BLAST database for a particular input (off) For detailed help, see http://factera.stanford.edu eof ``` ## Input 1. **tumor.bam** should consist of paired-end reads aligned by a mapping algorithm capable of soft-clipping, such as BWA. The BAM file does not need to be realigned or deduped, but should be position-sorted and have a corresponding index file (bam.bai created using SAMtools index) in the same directory in order to estimate the total sequencing depth in the neighborhood of each detected fusion. 2. **exons.bed** contains chromosomal coordinates (such as exon boundaries) in 3-column BED format (chr start end). The fourth column contains gene names, exon names, or any arbitrary identifier, and will be used to group corresponding coordinates. This allows the resolution of fusion detection to be restricted to inter-gene or inter-exon fusions, for example. Make sure to use coordinates from the same genome version as the 2bit reference sequence required in the third argument. 3. Users can download this exons.bed file, which combines hg19 RefSeq and Gencodev17 exon coordinates (downloaded from UCSC 02-23-14) with corresponding HUGO gene symbols in column 4. With this file, FACTERA will identify inter-gene fusions. To identify inter-exon fusions in hg19, use this exons.bed file. 4. **hg19.2bit **is a 2 bit encoded human reference genome, used for fast genome subsequence retrieval. Of note, FACTERA is not restricted to human sequences, and any 2bit reference genome can be used as long as coordinates in exons.bed are consistent. To create a 2bit file for a genome of interest, download the FASTA to 2BIT conversion tool from the appropriate system folder ([faToTwoBit](http://hgdownload.cse.ucsc.edu/admin/exe/))(该文最下面有介绍,) and follow these [instructions](http://genome.ucsc.edu/goldenPath/help/twoBit.html). ``` # faToTwoBit /software/BLAT/blatSrc/bin/faToTwoBit /home/lhlv/db/hg19/hg19.fasta /home/lhlv/db/hg19/hg19.2bit ``` 5. **targets.bed** is optional and allows the user to restrict the FACTERA search to genomic regions of interest, such as those targeted by a sequencing capture library. Format is a standard 3-column BED (chr start end). The use of a targets.bed file can greatly improve running time when only a subset of sequenced regions is known to be relevant for fusion detection. ## Output  上图可以看到,运行命令: ``` # factera for sample2 perl factera.pl sample2/sample2.mem.sorted.bam exons.bed /home/lhlv/db/hg19/hg19.2bit ``` sample2.mem.sorted.bam是输入的比对后经过sorted的bam文件;另外生成11个文件,其中有9个主要输出文件,介绍如下: Each FACTERA run produces 9 main output files, each of which is described below: 1. parameters.txt = all input files and parameter values.(设置的各个参数) 2. 如下 ``` >> cat sample2.mem.sorted.factera.parameters.txt Parameter (symbolic argument) Value Input BAM: sample2/sample2.mem.sorted.bam Genomic coordinates: exons.bed Reference genome: /home/lhlv/db/hg19/hg19.2bit Targeted regions: null Output directory (o): sample2 Minimum breakpoint-spanning reads for each fusion (r): 5 Minimum discordant reads for each candidate fusion (m): 2 Maximum breakpoints to examine for each read pair (x): 5 Minimum reads with same breakpoint required (s): 1 Minimum fraction of read bases for fusion template alignment (f): 0.9 Minimum % similarity for read to match fusion template (S): 0.95 K-mer size for fragment comparison (k): 12 Minimum size of soft-clipped read region (c): 16 Number of bases flanking breakpoint for fusion template (b): 500 Number of threads for blastn search (p): 4 Number of bases flanking breakpoint to provide in output (a): 50 Disable grouping of input coordinates by column 4 of input bed (e): 0 Disable verbose output (v): 0 Disable running time output (t): 0 Force remake of blast database (F): 0 ``` 3. discordantpair.depth.txt = ranked list of discordant read clusters. 4. 如下: ``` >>head sample2.mem.sorted.factera.discordantpair.depth.txt FUSION DEPTH CWH43 ZNF33B 416 SRGAP2D UQCRFS1 381 CWH43 SLC9B1P1 283 IGKV2D-40 CWH43 222 AL133216.1 SLC9B1P1 127 LTBP1 RET 116 ZNF33B SLC9B1P1 110 ``` 5. disordantpair.details.txt = discordant read positions. 6. 如下: ``` >>head sample2.mem.sorted.factera.discordantpair.details.txt GENE1 GENE2 FUSION CHR1 CHR2 GENE1_START GENE2_START DIST READ1_POS READ2_POS MT-ND1 MT-CYB MT-ND1-MT-CYB chrM chrM 3307 14747 11440 1 16460 MT-ND1 MT-CYB MT-ND1-MT-CYB chrM chrM 3307 14747 11440 1 16404 MT-ND1 MTRNR2L1 MT-ND1-MTRNR2L1 chrM chr17 3307 22023387 0 1 22020694 MT-ND1 MT-CYB MT-ND1-MT-CYB chrM chrM 3307 14747 11440 1 16370 MT-ND1 MT-CYB MT-ND1-MT-CYB chrM chrM 3307 14747 11440 1 16476 MT-ND1 MT-CYB MT-ND1-MT-CYB chrM chrM 3307 14747 11440 1 16410 ``` 7. fusiontargets.bed = bed coordinates of candidate fusions – used to restrict search space for soft-clipped reads. 8. 如下 ``` >>head sample2.mem.sorted.factera.fusiontargets.bed chr2 42396751 42544692 ALK EML4 chr16 63726852 65688279 LTBP1 CDH11 chr16 72518416 73048813 LTBP1 ZFHX3 chr4 184020644 184136080 WWC2 KHDRBS3 chr8 136165414 137784050 WWC2 KHDRBS3 chr3 14397929 14518850 LTBP1 SLC6A6 ``` 9. blastreads.fa = used to build blast database of soft-clipped, improperly paired, and unmapped reads. 10. 如下: ``` >>head sample2.mem.sorted.factera.blastreads.fa >NB501918:39:HVHH5AFXX:1:11107:5446:7815_0_SC ACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTG >NB501918:39:HVHH5AFXX:1:11108:23884:2756_1_SC ATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGG >NB501918:39:HVHH5AFXX:1:11112:12997:9449_2_SC TTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAG >NB501918:39:HVHH5AFXX:1:11203:21288:8127_3_SC TAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCTATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGT >NB501918:39:HVHH5AFXX:1:11303:3972:6129_4_SC TTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACTTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGAAGTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAG ``` 11. blastquery.fa = file used to search individual candidate fusion sequences (query) for hits in blastreads.fa (target database). 12. 如下: ``` >>head sample2.mem.sorted.factera.blastquery.fa >chr2:150563991-150564491,chr15:22108306-22108806 GAATATTGTTCCCACGTGTTACATTTTTATTTATTTTGTAACTGTGAAGTTATTGTGATGGTTATACTGAAGATTATATAGGAGTATAATCAAAAGCCCTACATTTCTGAATTCTGAATAACTATTTAGAAAATTCAGCCTACATTTTTTTGAACATGTTATCTCTGGTTCTACAAACACAAAATTTTAGTTTTAATTTACATGGTGTAAAATTTCTAAATATATTACTCTAAAGATAAACTTCAGATATAAAAGAATTGGAGAAGTAATTGTTTTTATGTGAGTGTGGACCTATTCTGAGTAGGAAAATATATCAGAACAAAGCAGATGATTTCATGAGTGTTTATGATATACTAGCAAACTAAAACCTCACAGATTCTGAAAGCAAATTTATTTCCTCTGCTTTCCATTCATCTCTAAAATCTTGTGGTTCAGAATCTCCCCATCCAAACCCTTTGTTCTAGCTTGCCTTCTATTCATGCTAGACCTAATATACAATTCTGATCTTCCAAATAAACTGTAATTTCTGTGAGGACAGAGATCTTCTTGTCTTATGCTCGTATGCATTTCCCAAATGGCTTAGCACAGTGCCTTATATGCATTGGAGATTCAATAAGTATTTGTCGACTGTACAGATGAGTGGCTTTCTAATGAACTAGTCTGATCCAGGAAATAATTTGTTTACCCTGCATAGGTCTTGATTTTGCTGCTTTAAATGCCCCCAGCTACTCATGGAGTAAAGAGAAATCCATGAGAAGTGAATTGCTGTGTGGTTTTGAAAGGATATTCTACAATATTCTTAGAATTGCATAATAATGTTTGTGACCCACATTTTCAGTATATGTTTTACTTCTGTCCTAAAAATATTACATATGTGTGTGTATGTGCGTGTAATTGTTTCTTAGTACAAAGTAGTGACATTTATATTCCCTATCCTCATCACACAGTATCCTGTTTCAACCAAGAGTACAGCTCAGTTCTTTTTCTGTCCACGATAGAA ``` 13. fusionseqs.fa = all detected breakpoints with 500bp of additional flanking sequences. 14. 如下 ``` >>cat sample2.mem.sorted.factera.fusionseqs.fa >chr2:41688510-41689010,chr2:29448093-29448593 GTCCTGCTCATGGTCCCCTGAAAAGAAAAGTGTGACAAGGTCTCCAGCGATGGGAAGGAGCAAGTAGTTTTTGTCTCAGTGGGGGAGGTTATTTCTGACTTGGGACCAACTCAAAGGAGACCTGGTGTGGTTGTCAATACCCCAGTGGGACCTGTCTTCCAGTGTCACCCACCCCGGAGCCACACCTGCCACTCTCGCTGATCCTCTCTGTGGTGACCTCTGCCCTCGTGGCCGCCCTGGTCCTGGCTTTCTCCGGCATCATGATTGGTGAGTGCACAGAGCCCCAGGGACTCCCAAGGGGGCAGGAAGGCAGGACTGAATAGTGTCTCAGGCTGTGCCACAGGTGCCAAGGTGTCACTTCGTTATGCTAGTCCCTGGAATTGGGTGGGGTGGTGATTAGGGCAGCCCAGGCCAAGCCAAAACGGAAGCTCCCAACCTTCCCCCCACCAGAGCAGCTGCAGTTCCCTGAGGAGCCCCTGATTCTGCACCTCAGCCCCGTGTGGCTTATGCCTGTAATCCCAGCACTTTGGGAGGCCAAGACGGGCAGATCATGAGGGTCAGGAGTTCGAGACCAGCCTGGCCAACAGGATGAAACCCTGTCTCTACTAAAAATACAAAAATTAGCCAGGTGTGGTGGTGGGCACCTGTAATCCCAGCTACTCGGGACACTGAGGCAGGAGAATCACTTGAACCCAGGAGGTGGAGATTGTAGTGAGCCAACATCGCACCATTGCACTCCAGCCTGGGTGACAGAGTGAGACTCCATCTCAAGAAAAAAAAAAAAAAAAAAAAACAGGTTCGATATATAACACATATTGTAGAATACTTGGGGGTGTTGAAAGTTTCAAGGGGAGTAGTGGAAGTGGGATAGGGAAAGAGCAGTAGAATCCAGAGGGAAGGTTGAAAATAAATTCTCCAAGCATTTGAAATAAAACAAGCTAGTCCTGAATATTTAAAAGGCACAAGTATTTAAGTATTAAGACTGTATGCCAGCAGTCCC >chr2:42506731-42507231,chr2:29447584-29448084 GGACTACAGGTGCCCGCCATCATGCCTGGCTAATTTTTTTTATTTTTTATTTTTTCATATTTGTAGTAGAGACGGGGTTTCACCGTGTTAGCCACGATGGTCTCAATCTCCTGACCTCATGATCTGCCCATCTCAGCCTCCCAAAGTGCTGGAATTACAGGCGTGAGCCACCGCGCCCGGCTGTCTTTTTTTATTATTATTATTATTATTATTAATAGCCATCCGACAGGTATAAGGTAATATTTCATTGTGATTGTGATCTGCATTTCTCTGATGATTAGTGATGTTGAACAGTTTTTTGTATACCTGTTGGCCATTTTTATGTCTTTGGAGAGACTACAAAAGAAAAGATGTCCTTAGTCCATTTTAAAATCTGTTATTTAACTATTGAGTTCCTTATATATTTTTGAGATTAACCCTTTATTAGATATATGGCTTGCAAATATTTTCTCCCGTTCTGTAGGTGGCCTTTTTATTGTTGATTGTTTCCTGGCATAAGACTGGCTGATCAGGGGGTGGGGAGCTCCTTCAGTGTCCATCACGATGGTGAAAGCTCGCCCCCACCCCTAGACGTCACTTCTAGCTCCCACATGCTTCCACCGGCGCAGCTCCTGTTTGGCTCCCACCCTATGTAATGCACTAGCCCACTCTTCCCCAAACCAGCCCTCCACCACCCTCCAGGCAGAGAGATAGGAAAATCGGTTTCTGAGTATATTTCTGTTCAGCCTGTGAGCCAAGGTGAGCTGACCTGCAGGTCACAGAGAACTCAGTGTGGTCCCAACCAGCTCTTACTGCTGGCAGAGACATGCCCAGGACAGATGGGCAGAGGCTTGAAAAGGGCAGAGGGAAAGGCTCTTGAGAGCCCTCGCAGGCCAGGCCCCTGCAGGCAAAGGGATCTGCCGGTAGAAGGGAGATGGCAGCACACACTGTGTCCCCATATGGTGCCATCCCTCAAAGGGACAGGATAATAGGAGCTAACACTTGTTGCATGGTTACTACG ``` 15. fusions.bed = bed output for detected fusions. Useful for comparing runs or somatic vs germline (column 4 is fusion ID). 16. 如下: ``` >> cat sample2.mem.sorted.factera.fusions.bed chr2 29448094 29448094 ALK_C2orf91_chr2:29448094_chr2:41688511 chr2 41688511 41688511 ALK_C2orf91_chr2:29448094_chr2:41688511 chr2 42507231 42507231 EML4_ALK_chr2:42507231_chr2:29448084 chr2 29448084 29448084 EML4_ALK_chr2:42507231_chr2:29448084 ``` 17. fusions.txt = all detected fusion events, including details, described below: |Field | Description| | ---| --- | |Est_Type | Estimated structural variant type: TRA = translocation; INV = inversion; DEL = deletion; '-' = not determined | | Region1| Name of genomic region closest to breakpoint 1 (e.g., gene 1, exon 1, etc.) | | Region2| Name of genomic region closest to breakpoint 2 (e.g., gene 2, exon 2, etc.) | | Break1 Chromosomal breakpoint 1 | | Break2 Chromosomal breakpoint 2 | | Break_support1| Number of reads supporting breakpoint 1 | | Break_support2| Number of reads supporting breakpoint 2 | | Break_Offset Breakpoint adjustment in bases (e.g., owing to microhomology) | | Orientation(新版)v方向,如1+ 2- | | Order1 Orientation of read clipping with respect to breakpoint 1: CN, clipped followed by not clipped; NC, vice versa | | Order2| Same as Order1, but for breakpoint 2 | | Break_depth| Number of breakpoint-spanning reads | | Proper_pair_support| Number of properly paired and previously soft-clipped reads that map to fusion | | Unmapped_support| Number of previously unmapped reads that map to fusion | | Improper_pair_support| Number of previously discordantly paired reads that map to fusion | | Paired_end_depth| Total number of paired-end reads that flank breakpoint | | Total_depth| Mean total depth for regions flanking both breakpoints (+/-500bp by default) | | Fusion_seq| Estimated fusion sequence (50 bases flanking breakpoint by default) | | Non-templated_seq| Non-templated (i.e., non-reference) sequence segment (if any) enclosed in brackets | ``` >>cat sample2.mem.sorted.factera.fusions.txt Est_Type Region1 Region2 Break1 Break2 Break_support1 Break_support2 Break_offset Orientation Order1 Order2 Break_depth Proper_pair_support Unmapped_support Improper_pair_support Paired_end_depth Total_depth Fusion_seq Non-templated_seq INV ALK C2orf91 chr2:29448094 chr2:41688511 27 22 2 1- 2+ CN CN 42 36 0 6 9 456 AGCAGCTGCAGTTCCCTGAGGAGCCCCTGATTCTGCACCTCAGCCCCGTG <> TGGCTTATGCCTGTAATCCCAGCACTTTGGGAGGCCAAGACGGGCAGATC - INV EML4 ALK chr2:42507231 chr2:29448084 5 12 0 1+ 2- NC NC 17 15 0 2 2 456 TCCCGTTCTGTAGGTGGCCTTTTTATTGTTGATTGTTTCCTGGCATAAGA <> CTGGCTGATCAGGGGGTGGGGAGCTCCTTCAGTGTCCATCACGATGGTGA - ``` ## Download Please note that you will need to accept the terms of the license in order to download FACTERA. To download version 1.4.4, click here. Unzip contents (FACTERA Perl code, license, and hg19 exon coordinates) into the same directory. ### Requirements - Unix operating system (Linux, Mac OS X, etc.) - Perl 5, with the following external dependency: Statistics::Descriptive. To install Statistics::Descriptive from CPAN, issue the following command: `sudo cpan Statistics::Descriptive`. Other Perl dependencies are included in the Perl 5 Core Modules and should already be installed: IPC::Open3, List::Util, File::Spec, Symbol, Getopt::Std, File::Basename. - twoBitToFa: Find and download executable from the appropriate system folder, then copy/link/move to PATH (i.e., /usr/bin). - hg19.2bit to run FACTERA on the hg19 human genome. Note that hg38.2bit is now available. To use another reference genome, make sure that input BED coordinates are consistent (the exons.bed file provided here is currently hg19 only). - blast+: After downloading, find blastn and makeblastdb in ncbi-blast-version/bin and copy/link/move to PATH (i.e., /usr/bin). - SAMtools: After downloading, find samtools and copy/link/move to PATH (i.e., /usr/bin). Usage Options (defaults in parentheses): ``` -o Output directory (tumor.bam directory). -r <int> Minimum number of breakpoint-spanning reads needed for output (5). -m <int> Minimum number of discordant reads needed for a candidate fusion (2). -x <int> Maximum number of breakpoints to examine for any given pair of genomic regions (5). -s <int> Minimum number of reads with the same breakpoint (1). -f <0-1> Minimum fraction of read bases required for alignment to fusion template (0.9). -S <0-1> Minimum similarity required for alignment of read to fusion template (0.95). -k <int> k-mer size for fragment comparison (10 bases). -c <int> Minimum size of soft-clipped region to consider (16 bases). -b <int> Number of bases flanking breakpoint for fusion template (500). -p <int> Number of threads for blastn search (4; 10 or more recommended). -a <int> Number of bases flanking breakpoint to provide in output (50). -e Disable grouping of input coordinates by column 4 of exons.bed (off). -v Disable verbose output (off). -t Disable running time output (off). -C Disable addition of 'chr' prefix to chromosome names (off)*** -F Force remake of BLAST database for a particular input (off). ``` ***Required if 'chr' is absent from all input files, including reference.2bit. ## Example Dataset 1. Download HCC78 lung cancer cell line targeted sequencing data (~1.2GB). 2. Unpack into HCC78.bam and HCC78.bam.bai, using command: 3. tar -xvf HCC78.tar 4. Download exons.bed (argument #2 for running FACTERA). 5. This file combines hg19 RefSeq and Gencodev17 exon coordinates (downloaded from UCSC 02-23-14) with corresponding HUGO gene symbols in column 4. 6. Download targets.bed (optional argument #4 for running FACTERA). 7. This file contains BED coordinates for the 125kb +/- 500bp sequencing panel described in the FACTERA manuscript. 8. Create output directory. 9. mkdir test 10. Run FACTERA (make sure all dependencies are installed first). 11. perl factera.pl -o test HCC78.bam exons.bed hg19.2bit targets.bed 12. Expected output ## Available Datasets Targeted sequencing datasets available for download (tars contain bam and bai): 1. Download HCC78 lung cancer cell line sequencing data with SLC34A2-ROS1 fusion (~1.2GB) 2. Download NCI-H3122 lung cancer cell line sequencing data with EML4-ALK fusion (~1.1GB) ## FAQ Which aligners are supported? Answer: FACTERA was developed and optimized using targeted sequencing data aligned by bwa aln, and we currently recommend that users employ bwa aln for best performance. While FACTERA can be applied to data mapped by bwa mem, users should be aware of the following considerations when interpreting results. The most notable difference between bwa aln and mem with respect to fusion detection is the use of hard clipping in addition to soft clipping by bwa mem. Absent from bwa aln, hard clipping enables bwa mem to improve the mapping rate by realigning (rather than truncating) sufficiently long read segments in chimeric sequences. In contrast, bwa aln will truncate such reads without realignment (soft-clipping), and FACTERA leverages soft clipped, but not hard clipped, reads for breakpoint detection. Hard clipped reads will be supported in a future release of FACTERA, and we will notify registered users when this version is available. According to the paper, FACTERA has high specificity. Why does FACTERA report some fusions that appear to be false positives? Answer: False positive calls may arise from mapping artifacts (due to repeat sequences), PCR template switching, and other sequencing errors, and are increasingly difficult to avoid as the sequencing space grows in size and complexity. While we have implemented a variety of post-processing algorithms to reduce the false positive rate compared to previous methods (paper), the elimination of all fusions with repetitive content would risk discarding genuine events. We therefore recommend that users inspect the FACTERA output for possible false positives by using BLAT and the UCSC human genome browser. This is particularly important when using FACTERA to analyze exome or genome-scale datasets. In cases where paired normal datasets are available, we recommend leveraging this information to reduce the FPR. Finally, we would welcome suggestions from users on how to best discriminate real fusions in repeat regions from sequencing artifacts. Please send us your feedback/suggestions along with fusion results that you suspect are not real. This will help us to compile a blacklist of poorly behaving genomic regions that might be useful as a post-processing filter. Reference Aaron M. Newman, Scott V. Bratman, Henning Stehr, Luke J. Lee, Chih Long Liu, Maximilian Diehn* and Ash A. Alizadeh* (2014) FACTERA: a practical method for the discovery of genomic rearrangements at breakpoint resolution, BioinformaticsDOI: 10.1093/bioinformatics/btu549. Funding This work was supported by the Stanford Cancer Institute Genomics Initiative (A.A.A., M.D.); the Doris Duke Charitable Foundation (A.A.A., M.D.); the US Department of Defense (LCRP Promising Clinician Research Award to M.D., A.A.A.; W81XWH-12-1-0498 to A.M.N.); the US National Institutes of Health Director’s New Innovator Award Program (1-DP2-CA186569 to M.D.); the Damon Runyon Cancer Research Foundation (A.A.A.); the Lymphoma Research Foundation (A.A.A.); the Gabrielle’s Angel Foundation (A.A.A.); the Radiological Society of North America (RR1221 to S.V.B.); and the Thomas & Stacey Siebel Foundation (A.M.N.). Questions/Comments Contact us 来源: https://factera.stanford.edu/download.php 《附》 TwoBit Sequence Archives A twoBit file is a highly efficient way to store genomic sequence. The format is defined here. Note that lower-case nucleotides are considered masked in twoBit, which can cause such sequence to be ignored when using the -mask option with gfServer; therefore, you may wish to convert lower-case sequence to upper-case when preparing the FASTA format. To complete the steps below you must first download the faToTwoBit, twoBitInfo, and twoBitToFa utilities. For more information on downloading our command-line utilities, see theseinstructions. To create a twoBit file, follow these steps: Prepare the sequence for your twoBit file in a FASTA-formatted file (i.e. genome.fa). Run the faToTwoBit program on your FASTA file: faToTwoBit genome.fa genome.2bit Use twoBitInfo to verify the sequences in this assembly and create a chrom.sizes file, which is useful to construct the big* files in later processing steps: twoBitInfo genome.2bit stdout | sort -k2rn > genome.chrom.sizes The twoBit commands can function with the .2bit file as a URL: twoBitInfo -udcDir=. http://your-website.edu/~user/genome.2bit | sort -k2nr > genome.chrom.sizes Sequence can be extracted from the .2bit file with the twoBitToFa command, for example: twoBitToFa -seq=chr1 -udcDir=. http://your-website.edu/~user/genome.2bit stdout > genome.chr1.fa 来源: http://genome.ucsc.edu/goldenPath/help/twoBit.html ## perl脚本factera.pl ``` #!/usr/bin/perl # See http://factera.stanford.edu/files/license.txt for Usage Agreement and Licensing Details $|++; use List::Util qw[min max]; use Statistics::Descriptive; use File::Spec; use Symbol qw(gensym); use Getopt::Std; use File::Basename; use IPC::Open3; #DESCRIPTION=================================================================================================== #vars-fusions-factera.pl Fusion And Chromosomal Translocation Enumeration and Recovery Algorithm #A tool for rapid identification of genomic fusions in paired-end targeted sequencing data. #============================================================================================================== #METHOD======================================================================================================== # 1) Identify improperly paired reads and classify into unique gene/exon pairs # 2) Find partially mapped reads (soft-clipped) within a reasonable interval of (1) # 3) Use reads from (2) to determine breakpoint # 4) BLAST all reads against putative fusion sequences to calculate depth statistics #============================================================================================================== #DEPENDENCIES================================================================================================== # Must have the following executables in path: # SAMtools, twoBitToFa <http://hgdownload.cse.ucsc.edu/admin/exe/>, blastn, makeblastdb # Perl dependencies: # Statistics::Descriptive; Perl 5 core modules = List::Util, IPC::Open3, File::Spec, Symbol, Getopt::Std #============================================================================================================== my $version = "1.4.4"; my $size = @ARGV; my %opts = (r=>5, m=>2, x=>5, s=>1, f=>0.9, S=>0.95, p=>4, k=>12, b=>500, a=>50, c=>16); my %opts1; getopts('o:r:m:x:s:f:S:p:veCFk:b:p:a:tc:', \%opts1); die(" FACTERA version $version by Aaron M. Newman (amnewman\@stanford.edu) Purpose: A tool for detection of fusions in paired-end targeted (or genome-wide) sequencing data. Usage: factera.pl [options] <tumor.bam> <exons.bed> <genome.2bit> <optional: targets.bed> <tumor.bam> Paired reads mapped with BWA or another soft-clip capable aligner. NOTE: Must be indexed for factera to function properly. <exons.bed> Genomic coordinates with gene/exon names in fourth column. Data sources include RefSeq, UCSC, and/or GENCODE. <genome.2bit> Two bit reference sequence (e.g., hg19.2bit). NOTE: twoBitToFa application must be in PATH. <targets.bed> Restrict search to specified genome coordinates. Options (defaults in parentheses): -o <dir> output directory (tumor.bam directory) -r <int> minimum number of breakpoint-spanning reads required for output ($opts{r}) -m <int> minimum number of discordant reads required for each candidate fusion ($opts{m}) -x <int> maximum number of breakpoints to examine for each pair of genomic regions ($opts{x}) -s <int> minimum number of reads with the same breakpoint ($opts{s}) -f <0-1> minimum fraction of read bases required for alignment to fusion template ($opts{f}) -S <0-1> minimum similarity required for read to match fusion template ($opts{S}) -k <int> k-mer size for fragment comparison ($opts{k} bases) -c <int> minimum size of soft-clipped region to consider ($opts{c} bases) -b <int> number of bases flanking breakpoint for fusion template ($opts{b}) -p <int> number of threads for blastn search ($opts{p}; 10 or more recommended) -a <int> number of bases flanking breakpoint to provide in output ($opts{a}) -e disable grouping of input coordinates by column 4 of exons.bed (off) -v disable verbose output (off) -t disable running time output (off) -C disable addition of 'chr' prefix to chromosome names (off) -F force remake of BLAST database for a particular input (off) For detailed help, see http://factera.stanford.edu ") if($size == 0); my $options=''; foreach my $opt(keys %opts1){ $options.='|'.$opt.$opts1{$opt}; $opts{$opt}=$opts1{$opt}; if($opt eq "o") { die("\'$opts1{$opt}\' is not a directory. Abort!\n") if(!(-d $opts1{$opt})); print "Output directory: $opts1{$opt}\n"; } if($opt eq "r") { die("Minimum breakpoint-spanning reads needed must be an integer. Abort!\n") if is_integer($opts1{$opt}) == 0; die("Minimum breakpoint-spanning reads needed must be >0. Abort!\n") if $opts1{$opt}<1; print "Minimum breakpoint-spanning reads needed for output set to: $opts1{$opt}\n"; } if($opt eq "m") { die("Minimum discordant pairs for candidate fusion must be an integer. Abort!\n") if is_integer($opts1{$opt}) == 0; die("Minimum discordant pairs for candidate fusion must be >0. Abort!\n") if $opts1{$opt}<1; print "Minimum discordant pairs for candidate fusion set to: $opts1{$opt}\n"; } if($opt eq "x") { die("Minimum breakpoints for each gene pair must be an integer. Abort!\n") if is_integer($opts1{$opt}) == 0; die("Maximum breakpoints for each gene pair must be >0. Abort!\n") if $opts1{$opt}<1; print "Maximum breakpoints for each gene pair set to: $opts1{$opt}\n"; } if($opt eq "s") { die("Minimum number of reads with same breakpoint must be an integer. Abort!\n") if is_integer($opts1{$opt}) == 0; die("Minimum number of reads with same breakpoint must be >0. Abort!\n") if $opts1{$opt}<1; print "Minimum number of reads with same breakpoint set to: $opts1{$opt}\n"; } if($opt eq "f") { die("Minimum fraction of matching bases for fusion template must be a fraction (0-1). Abort!\n") if ($opts1{$opt}<0 || $opts1{$opt}>1); print "Minimum fraction of matching bases for fusion template set to: $opts1{$opt}\n"; } if($opt eq "S") { die("Minimum similarity between reads and fusion template must be a fraction (0-1). Abort!\n") if ($opts1{$opt}<0 || $opts1{$opt}>1); print "Minimum similarity between reads and fusion template set to: $opts1{$opt}\n"; } if($opt eq "k") { die("k-mer size for fragment comparison must be an integer. Abort!\n") if is_integer($opts1{$opt}) == 0; die("k-mer size for fragment comparison must be >0. Abort!\n") if $opts1{$opt}<1; print "k-mer size for fragment comparison set to: $opts1{$opt}\n"; } if($opt eq "c") { die("Minimum size of soft-clipped read must be an integer. Abort!\n") if is_integer($opts1{$opt}) == 0; die("Minimum size of soft-clipped read must be >0. Abort!\n") if $opts1{$opt}<1; print "Minimum size of soft-clipped read set to: $opts1{$opt}\n"; } if($opt eq "b") { die("Number of bases flanking breakpoint of fusion template must be an integer. Abort!\n") if is_integer($opts1{$opt}) == 0; die("Number of bases flanking breakpoint of fusion template must be >0. Abort!\n") if $opts1{$opt}<1; print "Number of bases flanking breakpoint of fusion template set to: $opts1{$opt}\n"; } if($opt eq "p") { die("Number of threads for blastn search must be an integer. Abort!\n") if is_integer($opts1{$opt}) == 0; die("Number of threads for blastn search must be >0. Abort!\n") if $opts1{$opt}<1; print "Number of threads for blastn search set to: $opts1{$opt}\n"; } if($opt eq "a") { die("Number of bases flanking breakpoint in output must be an integer. Abort!\n") if is_integer($opts1{$opt}) == 0; die("Number of bases flanking breakpoint in output must be >0. Abort!\n") if $opts1{$opt}<1; print "Number of bases flanking breakpoint in output set to: $opts1{$opt}\n"; } if($opt eq "e") {print "Disable grouping of input coordinates (e.g., allow fusions between exons)\n";} if($opt eq "v") {print "Suppress verbose output\n";} if($opt eq "t") {print "Suppress running time output\n";} if($opt eq "C") {print "Disable addition of \'chr\' prefix to chromosome names\n";} if($opt eq "F") {print "Force remake of blast database\n";} } #INPUT========================================================================================================= my $bam; #bam file my $exon; #exon coordinate bed file [chr <tab> start <tab> end <tab> genename] my $twobit; #2bit genome file (e.g., hg19.2bit) my $targets = 0; #restrict analysis to targeted regions (.bed) [use 0 if non-targeted] #============================================================================================================== die("Missing input files; Abort!\n") unless (@ARGV > 2); #get input files for (my $itor = 0; $itor < @ARGV; $itor++) { my $arg = $ARGV[$itor]; if($itor == 0) {$bam = $arg;} if($itor == 1) {$exon = $arg;} if($itor == 2) {$twobit = $arg;} if($itor == 3) { $targets = $arg; print "Restrict results to regions specified in: $targets\n"; } } #PARAMETERS============================================================================================ my $MINSPANNINGREADS = $opts{r}; #minimum number of breakpoint-spanning reads needed to output a fusion my $MINIMPROPERREADS = $opts{m}; #minimum number of discordant reads needed to consider putative fusion my $MAXBPS2EXAMINE = $opts{x}; #maximum number of putative breakpoints to consider for each unique gene pair my $MINBPSUPPORT = $opts{s}; #minimum number of reads spanning putative breakpoint to proceed my $MINBLASTFRAC = $opts{f}; #fraction of read length to determine minimum alignment length for matching sequences my $VERBOSE = 1; #if 1, verbose output, including running time; o.w. only results will be printed if(exists($opts{v})) {$VERBOSE = 0;} my $TIME = 1; #if 1, print running time if(exists($opts{t})) {$TIME = 0;} my $USE_ALL_COORS = 0; #if 0, cluster input coordinates (second argument) by 4th column of bed file (e.g., gene symbol); o.w. treat all coordinates separately (i.e., each line in bed file will be distinct; e.g., to find fusions between exons, rather than between genes) if(exists($opts{e})) {$USE_ALL_COORS = 1;} my $makeblastdb = 1; #will set to 0 if a $bam.nhr file is detected my $FORCEREMAKE = 0; #if 1, force remake of blast database if one already exists if(exists($opts{F})) {$FORCEREMAKE = 1;} my $disablechrprefix = 0; #if 1, do not add "chr" prefix to chromosome names if(exists($opts{C})) {$disablechrprefix = 1;} my $k = $opts{k}; #k-mer size my $buffer = $opts{b}; #size of genomic padding around breakpoint my $pad = $opts{a}; #number of bases flanking breakpoint to provide in output my $clipsize = $opts{c}; #only keep clipped regions at least 16 bp long by default (only 1 in 4.3B by random chance) my $minsimilarity = $opts{S}; #minimum %similarity for read to match fusion template my $BLASTTHREADS = $opts{p}; #number of threads to assign to blastn my $OUTPUTDIR = dirname($bam); #directory to write output files my $READLEN = 101; #default read length (will be modified if needed base on input) my $READLENBIT = 0; #if zero, check input read length and if different, modify READLEN if(exists($opts{o})) {$OUTPUTDIR = $opts{o};} die("k-mer size ($k) cannot be greater than minimum size of soft-clipped read ($clipsize). Abort!\n") if $k > $clipsize; #============================================================================================================== #write parameters used and arguments to disk my $params = basename($bam); $params =~ s/bam/factera.parameters.txt/g; $params = "$OUTPUTDIR/$params"; open("output0", ">$params") or die $!; my $v_ = $VERBOSE; if($v_ == 1) {$v_ = 0;} my $t_ = $TIME; if($t_ == 1) {$t_ = 0;} my $Ta_ = $targets; if($Ta_ eq "0") {$Ta_ = "null";} print output0 "Parameter (symbolic argument)\tValue Input BAM:\t$bam Genomic coordinates:\t$exon Reference genome:\t$twobit Targeted regions:\t$Ta_ Output directory (o):\t$OUTPUTDIR Minimum breakpoint-spanning reads for each fusion (r):\t$MINSPANNINGREADS Minimum discordant reads for each candidate fusion (m):\t$MINIMPROPERREADS Maximum breakpoints to examine for each read pair (x):\t$MAXBPS2EXAMINE Minimum reads with same breakpoint required (s):\t$MINBPSUPPORT Minimum fraction of read bases for fusion template alignment (f):\t$MINBLASTFRAC Minimum % similarity for read to match fusion template (S):\t$minsimilarity K-mer size for fragment comparison (k):\t$k Minimum size of soft-clipped read region (c):\t$clipsize Number of bases flanking breakpoint for fusion template (b):\t$buffer Number of threads for blastn search (p):\t$BLASTTHREADS Number of bases flanking breakpoint to provide in output (a):\t$pad Disable grouping of input coordinates by column 4 of input bed (e):\t$USE_ALL_COORS Disable verbose output (v):\t$v_ Disable running time output (t):\t$t_ Force remake of blast database (F):\t$FORCEREMAKE "; close(output0); #if blast database already exists, don't make it again... #write reads in fasta format to create blast database of improperly paired, soft-clipped, and unmapped reads my $blastdb = basename($bam); $blastdb =~ s/bam/factera.blastreads.fa/g; $blastdb = "$OUTPUTDIR/$blastdb"; if(-e $blastdb && $FORCEREMAKE == 0){ $makeblastdb = 0; print "Old blast database detected. Use -F argument to force remake.\n"; } #============================================================================================================== my @exons_start = (); #store exon start coordinates my %coors2gene = (); #store gene symbol for every unique exon [format: chr_start_end] my %chrom = (); #store index for each chromosome my %getothercoor = (); #get end coordinate if input is start and vice versa [format: gene_start or gene_end] my $blastitor = 0; #iterate so every subject has unique identifier my $now = time; #running time open(NULL, ">", File::Spec->devnull); #suppress standard error for mpileup #read in exon coors: store coordinates in sorted arrays for each chromosome open(FILE, $exon) or die $!; while(<FILE>){ my $line = $_; chomp($line); my @tokens = split("\t", $line); my $chr = addprefix($tokens[0]); my $start = $tokens[1]; my $end = $tokens[2]; my $gene = $tokens[3]; if(!exists($chrom{$chr})){ $size = (keys %chrom); $chrom{$chr} = $size; } my $chrindex = $chrom{$chr}; #retrieve chromosome index push(@{$exons_start[$chrindex]}, $start); #add start coordinates if(!exists($coors2gene{"$chr $start $end"})) { my $insert = $gene; if($USE_ALL_COORS == 1) {$insert = "$gene:$start";} $coors2gene{"$chr $start $end"} = $insert; } $getothercoor{"$chr $start"} = $end; $getothercoor{"$chr $end"} = $start; } if($VERBOSE == 1) {print "Done loading genomic coordinates\n";} #sort coordinates for each chromosome for(my $itor = 0; $itor < @exons_start; $itor++){ @{$exons_start[$itor]} = sort { $a <=> $b } @{$exons_start[$itor]}; } if($VERBOSE == 1) {print "Done sorting genomic coordinates\n";} my $impropname = basename($bam); $impropname =~ s/bam/factera.discordantpair.details.txt/g; $impropname = "$OUTPUTDIR/$impropname"; open("output", ">$impropname") or die $!; my $depthname = basename($bam); $depthname =~ s/bam/factera.discordantpair.depth.txt/g; $depthname = "$OUTPUTDIR/$depthname"; open(output2, ">$depthname") or die $!; print output "GENE1\tGENE2\tFUSION\tCHR1\tCHR2\tGENE1_START\tGENE2_START\tDIST\tREAD1_POS\tREAD2_POS\n"; print output2 "FUSION\tDEPTH\n"; if($makeblastdb == 1) {open(blast, ">$blastdb") or die $!;} #collect all ids of mapped improper pairs in targeted region my $insert = "-L $targets"; if($targets eq '0') {$insert = "";} open my $imp_pairs, '-|', "samtools view -F 2 $insert $bam | awk '\$2 ~ /81|161|97|145|65|129|113|177/'" or die; my $count = 0; my $addchrprefix = 1; #if 1, check for "chr" prefix; if not present, add it if($disablechrprefix == 1) {$addchrprefix = 0;} my %prevgene = (); #store genes already seen [format: chr_start] my %prevread = (); #store reads already seen my %fusepairs = (); #store unique read pairs per fusion my %genes2exons = (); #store exon coordinates and chromosome for each candidate fusion gene my %storeids4blast = (); #store read ids of properly paired reads that passed filtration criteria; used to build blast database my @progress = ('|','/','-','\\'); my $progressItor = 0; if($VERBOSE == 1) { my $currtime = time - $now; print "\n Time Current step\n"; print "---------- --------------------------------------\n"; printf("[%02d:%02d:%02d]", int($currtime / 3600), int(($currtime % 3600) / 60),int($currtime % 60)); print " Analyzing improperly paired reads...\n"; } while(<$imp_pairs>){ my $line = $_; my @tokens = split("\t", $line); $count++; if($count % 1000 == 0 && $VERBOSE == 1) { print " $progress[$progressItor] $count reads processed \r"; $progressItor++; if($progressItor > 3) {$progressItor = 0;} } my $id = $tokens[0]; #skip read ids that have already been processed if(!exists($prevread{$id})) {$prevread{$id}=$id;} else { next; } #read1 my $chr = $tokens[2]; if($addchrprefix == 1) {$chr = addprefix($chr);} my $pos = $tokens[3]; my $gene = ""; my $gene_start = ""; my $gene_end = ""; #read 2 my $chrb = $tokens[6]; if($addchrprefix == 1) {$chrb = addprefix($chrb);} if($chrb =~ m/=/) {$chrb = $chr;} my $posb = $tokens[7]; my $geneb = ""; my $gene_startb = ""; my $gene_endb = ""; if(!exists($prevgene{"$chr $pos"})){ #-----------------find closest start coordinate to read1 (and corresponding end)-------------- findGene($pos, \@{$exons_start[$chrom{$chr}]}, $gene_start, $gene_end, $gene, $chr); $prevgene{"$chr $pos"} = "$gene|$gene_start|$gene_end";#store gene }else{ my $tmp = $prevgene{"$chr $pos"}; my @vars= split("\\|", $tmp); $gene = $vars[0]; $gene_start = $vars[1]; $gene_end = $vars[2]; } #----------------find closest start coordinate to read2 (and corresponding end)------------ if(!exists($prevgene{"$chrb $posb"})){ findGene($posb, \@{$exons_start[$chrom{$chrb}]}, $gene_startb, $gene_endb, $geneb, $chrb); $prevgene{"$chrb $posb"} = "$geneb|$gene_startb|$gene_endb"; #store gene }else{ my $tmp = $prevgene{"$chrb $posb"}; my @vars = split("\\|", $tmp); $geneb = $vars[0]; $gene_startb = $vars[1]; $gene_endb = $vars[2]; } #-----------------if candidate translocation------------------------------------------------ if($gene ne $geneb){ #discordant gene pair... next if $gene eq "" || $geneb eq ""; my $dist = 0; if($chr eq $chrb){ #calculate distance if same chromosome $dist = abs($gene_start - $gene_startb); } my $fusion = "$gene-$geneb"; print output "$gene\t$geneb\t$fusion\t$chr\t$chrb\t$gene_start\t$gene_startb\t$dist\t$pos\t$posb\n"; #store chromosome and start position of each exon corresponding to candidate fusion gene my $minstart = max(1,min($gene_start, $pos-300)); my $minstartb = max(1,min($gene_startb, $posb-300)); my $maxend = max($gene_end, $pos+300); my $maxendb = max($gene_endb, $posb+300); if(!exists($genes2exons{$gene}->{"$chr $minstart"})){ $genes2exons{$gene}->{"$chr $minstart"} = $maxend;} elsif($genes2exons{$gene}->{"$chr $minstart"} < $maxend){ $genes2exons{$gene}->{"$chr $minstart"} = $maxend; } if(!exists($genes2exons{$geneb}->{"$chrb $minstartb"})){ $genes2exons{$geneb}->{"$chrb $minstartb"} = $maxendb;} elsif($genes2exons{$geneb}->{"$chrb $minstartb"} < $maxendb){ $genes2exons{$geneb}->{"$chrb $minstartb"} = $maxendb; } #eliminate redundant fragments (same start for both reads) for depth statistics if(!exists($fusepairs{"$gene $geneb"}->{"$pos $posb"})){ $fusepairs{"$gene $geneb"}->{"$pos $posb"} = "$pos $posb"; $storeids4blast{$id} = $id; #store read ids for blast database } #if($makeblastdb == 1) {print blast ">$tokens[0] " . ($blastitor++) . "\n$tokens[9]\n";} #print to blast database } } close($imp_pairs); close(output); if($VERBOSE == 1) {print "\n";} #-------------print out depth statistics for mapped improper pairs----------------------- my %genes = (); #store unique genes present in putative translocations my %pairs = (); #store unique gene pairs present in putative translocations my %depth = (); #for sorting depth my $fusetargets = basename($bam); $fusetargets =~ s/bam/factera.fusiontargets.bed/g; $fusetargets = "$OUTPUTDIR/$fusetargets"; open(fusiontargets, ">$fusetargets") or die $!; foreach my $g (keys %fusepairs){ my $d = (keys %{$fusepairs{$g}}); $depth{$g} = $d; } $count = 0; foreach my $g (sort {$depth{$b} <=> $depth{$a}} keys %depth){ $count++; next if $depth{$g} < $MINIMPROPERREADS; #skip if less than x discordant reads my @gs = split(" ", $g); if(!exists($genes{$gs[0]})) { $genes{$gs[0]} = $gs[0]; #print bed coordinates of candidate fusion gene 1 my $chr = ""; my $min = 9999999999; my $max = 0; foreach my $key (keys %{$genes2exons{$gs[0]}}){ my @vars = split(" ",$key); $chr = $vars[0]; if($addchrprefix == 1) {$chr =~ s/chr//g;} my $start = $vars[1]; if($start < $min) {$min = $start;} my $end = $genes2exons{$gs[0]}->{$key}; if($end > $max){$max = $end;} } print fusiontargets "$chr\t$min\t$max\t$g\n"; } if(!exists($genes{$gs[1]})) { $genes{$gs[1]} = $gs[1]; #print bed coordinates of candidate fusion gene 2 //TODO: extend to maximum read start (after last/first exon) my $chr= ""; my $min= 9999999999; my $max= 0; foreach my $key (keys %{$genes2exons{$gs[1]}}){ my @vars = split(" ",$key); $chr = $vars[0]; if($addchrprefix == 1) {$chr =~ s/chr//g;} my $start = $vars[1]; if($start <$min) {$min = $start;} my $end = $genes2exons{$gs[1]}->{$key}; if($end > $max){$max = $end;} } print fusiontargets "$chr\t$min\t$max\t$g\n"; } print output2 "$g\t$depth{$g}\n"; $pairs{$gs[0]}->{$gs[1]} = $depth{$g}; #store fusion partner of gene 1 $pairs{$gs[1]}->{$gs[0]} = $depth{$g}; #store fusion partner of gene 2 } close(fusiontargets); close(output2); #=============================================================================================== #------------infer break-point by soft-clipping, and increment depth of fusion pairs accordingly $count = 0; #reset counter for progress meter #open my $sclipped, '-|', "samtools view -F 12 -L $fusetargets $bam | awk '\$2 ~ /99|147|83|163|67|131|115|179/' | awk '\$6 ~ /S/'" or die; open my $sclipped, '-|', "samtools view -F 12 -L $fusetargets $bam | awk '\$2 ~ /99|147|83|163|67|131|115|179|81|161|97|145|65|129|113|177/' | awk '\$6 ~ /S/'" or die; my %breakpoints = (); #store potential breakpoints [key1=gene; key2=position; value=count] my %screads = (); #store soft-clipped reads meeting gene and length criteria my %consensus = (); #store consensus sequence for each breakpoint cluster $addchrprefix = 1; #if 1, check for "chr" prefix; if not present, add it if($VERBOSE == 1) { my $currtime = time - $now; printf("[%02d:%02d:%02d]", int($currtime / 3600), int(($currtime % 3600) / 60),int($currtime % 60)); print " Analyzing soft-clipped reads...\n"; } while(<$sclipped>){ my $line = $_; my @tokens = split("\t", $line); my $cigar = $tokens[5]; #next if !($cigar =~ m/S/); next if ($cigar =~ m/D/); next if($cigar =~ m/I/); $count++; if($count % 100 == 0 && $VERBOSE == 1) { print " $progress[$progressItor] $count reads processed \r"; $progressItor++; if($progressItor > 3) {$progressItor = 0;} } my $tmp = $cigar; $tmp =~ s/S//g; next if length($tmp) != length($cigar) - 1; my @parsed = parsecigar($cigar); my @parsed2 = parsecigar($parsed[2]); my $matched = 0; #cigar string 'M' my $skipped = 0; #cigar string 'S' #if forward orientation, properly paired, and with match followed by clip, go to next.. if(($tokens[1] == 99) || ($tokens[1] == 163)){ next if $parsed[1] eq "M"; } #if reverse orientation, properly paired, and with clip followed by match, go to next.. if(($tokens[1] == 147) || ($tokens[1] == 83)){ next if $parsed[1] eq "S"; } #only keep clipped regions at least clipsize (default 16 bases; only 1 in 4.3B by random chance) if($parsed[1] eq "S") {$skipped = $parsed[0]; $matched = $parsed2[0]; next if $parsed[0] < $clipsize;} elsif($parsed2[1] eq "S"){$skipped = $parsed2[0]; $matched = $parsed[0]; next if $parsed2[0] < $clipsize;} my $chr = $tokens[2]; if($addchrprefix == 1) {$chr = addprefix($chr);} my $pos = $tokens[3]; my $gene = ""; my $gene_start = 0; my $gene_end = 0; #find gene findGene($pos, \@{$exons_start[$chrom{$chr}]}, $gene_start, $gene_end, $gene, $chr); next if !exists($genes{$gene}); if($makeblastdb == 1) { #so makeblastdb does not complain about >40% Ns my $Ns = $tokens[9]; $Ns =~ s/N//g; if(length($Ns)/length($tokens[9]) >= .75) { print blast ">$tokens[0]\_" . ($blastitor++) . "\_SC\n$tokens[9]\n"; } } #print to blast database my $bp = $pos; if($parsed[1] eq "M") { $bp += $matched - 1; my $prevLen = @{$screads{$gene}->{$bp}}[0]; my $clipped = substr($tokens[9],$matched); my $notclipped = substr($tokens[9],0,$matched); my $middle = length($tokens[9])/2; if(abs(length($clipped) - $middle) <= abs($prevLen - $middle)){ my @temp = (length($clipped), "$notclipped $clipped", "NC", $tokens[0]); $screads{$gene}->{$bp} = \@temp; my $adjustlen = 500; #store consensus sequence for soft-clipped portion of a given breakpoint my @seqarray = getSeqArray(\%consensus, $gene, $bp, 0, $adjustlen - $clipsize, 0, length($clipped), \%screads, $clipped); $consensus{$gene}->{$bp} = \@seqarray; } }else{ my $prevLen = @{$screads{$gene}->{$bp}}[0]; my $clipped = substr($tokens[9],0,length($tokens[9])-$matched); my $notclipped = substr($tokens[9],length($tokens[9])-$matched); my $middle = length($tokens[9])/2; if(abs(length($clipped) - $middle) <= abs($prevLen - $middle)){ my @temp = (length($clipped), "$clipped $notclipped", "CN", $tokens[0]); $screads{$gene}->{$bp} = \@temp; #store consensus sequence for soft-clipped portion of a given breakpoint my $adjustlen = 500; my @seqarray = getSeqArray(\%consensus, $gene, $bp, $adjustlen - $clipsize - length($clipped), $adjustlen - $clipsize, 0, length($clipped), \%screads, $clipped); $consensus{$gene}->{$bp} = \@seqarray; } } $breakpoints{$gene}->{"$chr:$bp"}++; } if($VERBOSE == 1) {print "\n";} #------------------------------------------------------------------------------ #finish writing reads to disk, then make blast database if not already made if($makeblastdb == 1) {makeBLASTdb();} #------------------------------------------------------------------------------ #sort soft-clipped reads by breakpoint, identify "best" breakpoints and analyze my %usedpair = (); #store gene pairs to output only unique pairs my %bestfusions = (); #only keep fusions with best support for a given gene pair and pair of coordinates (+/-5 bp pad) my $fusions = basename($bam); $fusions =~ s/bam/factera.fusions.txt/g; $fusions = "$OUTPUTDIR/$fusions"; open(output3, ">$fusions") or die $!; my $fusionseqs = basename($bam); $fusionseqs =~ s/bam/factera.fusionseqs.fa/g; $fusionseqs = "$OUTPUTDIR/$fusionseqs"; open(output4, ">$fusionseqs") or die $!; my $fusions_bed = basename($bam); $fusions_bed =~ s/bam/factera.fusions.bed/g; $fusions_bed = "$OUTPUTDIR/$fusions_bed"; open(output5, ">$fusions_bed") or die $!; if($VERBOSE == 1) { my $currtime = time - $now; printf("[%02d:%02d:%02d]", int($currtime / 3600), int(($currtime % 3600) / 60),int($currtime % 60)); print " Validating candidate fusions...\n"; } #================================================================================================== my $total_comp = 0; #count total events to be compared (for progressbar) if($VERBOSE == 10) { foreach my $gene (keys %breakpoints){ my $itor = 0; foreach my $bp (sort {$breakpoints{$gene}->{$b} <=> $breakpoints{$gene}->{$a}} keys %{$breakpoints{$gene}}){ $itor++; last if $itor > $MAXBPS2EXAMINE; #only examine the x most abundant putative breakpoints for gene 1 my $count = $breakpoints{$gene}->{$bp}; next if $count < $MINBPSUPPORT; #skip if <x reads support my $itor2 = 0; foreach my $gene2 (keys %{$pairs{$gene}}) { foreach my $bp2 (sort {$breakpoints{$gene2}->{$b} <=> $breakpoints{$gene2}->{$a}} keys %{$breakpoints{$gene2}}){ $itor2++; last if $itor2 > $MAXBPS2EXAMINE; #examine x most abundant putative breakpoints for gene 2 $total_comp++; } } } } } #================================================================================================== my $progress_itor = 0; $progressItor = 0; my $prev_prog = -1; my $fusioncons = ""; foreach my $gene (keys %breakpoints){ my %rankedlist = (); #Rank all breakpoints for partner genes wrt read depth foreach my $gene2 (keys %{$pairs{$gene}}) { foreach my $bp2 (sort {$breakpoints{$gene2}->{$b} <=> $breakpoints{$gene2}->{$a}} keys %{$breakpoints{$gene2}}){ my $depth = $breakpoints{$gene2}->{$bp2}; $rankedlist{$depth}->{$gene2}->{$bp2} = $bp2; } } my $itor = 0; foreach my $bp (sort {$breakpoints{$gene}->{$b} <=> $breakpoints{$gene}->{$a}} keys %{$breakpoints{$gene}}){ $itor++; last if $itor > $MAXBPS2EXAMINE; #only examine the x most abundant putative breakpoints for gene 1 my $count = $breakpoints{$gene}->{$bp}; next if $count < $MINBPSUPPORT; #skip if <x reads support my $bp_ = $bp; $bp_ =~ s/.*://g; my $chr = $bp; $chr =~ s/:.*//g; my $id = @{$screads{$gene}->{$bp_}}[3]; my $clipOrder = @{$screads{$gene}->{$bp_}}[2]; my $read1 = @{$screads{$gene}->{$bp_}}[1]; my $read1b = $read1; my $readLen = length($read1) - 1; if($READLENBIT == 0) {$READLEN = $readLen; $READLENBIT = 1;} my $read1cut = @{$screads{$gene}->{$bp_}}[0]; if($clipOrder eq "NC"){ #excise non-clipped read using order bit $read1 = substr($read1, 0, (length($read1)-$read1cut)-1); my $read1btemp = substr($read1b, length($read1b)-$read1cut); $read1b = getConsensus(\%consensus, $gene, $bp_); $read1b = substr($read1b, 0, length($read1btemp)); # $read1b = $read1btemp; }else{ $read1 = substr($read1, ($read1cut + 1)); my $read1btemp = substr($read1b, 0, $read1cut); $read1b = getConsensus(\%consensus, $gene, $bp_); $read1b = substr($read1b, 1 + length($read1b) - length($read1btemp)); # $read1b = $read1btemp; } my %kmers1 = getKmers($read1, $k); my %kmers1rc = getKmers(doRevComp($read1), $k); my %kmers1b = getKmers($read1b, $k); my %kmers1rcb = getKmers(doRevComp($read1b), $k); my $itor2 = 0; foreach my $depth (sort {$b <=> $a} keys %rankedlist) { foreach my $gene2 (keys %{$rankedlist{$depth}}) { foreach my $bp2 (keys %{$rankedlist{$depth}->{$gene2}}){ $itor2++; last if $itor2 > $MAXBPS2EXAMINE; #examine x most abundant putative breakpoints paired with gene 1 my $count2 = $breakpoints{$gene2}->{$bp2}; next if $count2 < $MINBPSUPPORT; #skip if <x reads support if($VERBOSE == 10){ $progress_itor++; my $prog = int(100*$progress_itor/$total_comp); if($prog % 4 == 0 && $prog != $prev_prog){ print "*"; $prev_prog = $prog; } } $progress_itor++; if($progress_itor % 100 == 0 && $VERBOSE == 1) { print " $progress[$progressItor] $progress_itor candidates processed \r"; $progressItor++; if($progressItor > 3) {$progressItor = 0;} } my $bp2_ = $bp2; $bp2_ =~ s/.*://g; my $chr2 = $bp2; $chr2 =~ s/:.*//g; my $id2 = @{$screads{$gene2}->{$bp2_}}[3]; my $clipOrder2 = @{$screads{$gene2}->{$bp2_}}[2]; my $read2 = @{$screads{$gene2}->{$bp2_}}[1]; my $read2b = $read2; my $read2cut = @{$screads{$gene2}->{$bp2_}}[0]; if($clipOrder2 eq "NC"){ #excise clipped read using order bit my $read2temp = substr($read2,length($read2)-$read2cut); $read2b = substr($read2b, 0, (length($read2b)-$read2cut)-1); $read2 = getConsensus(\%consensus, $gene2, $bp2_); $read2 = substr($read2, 0, length($read2temp)); }else{ my $read2temp = substr($read2, 0, $read2cut); $read2b = substr($read2b,($read2cut+1)); $read2 = getConsensus(\%consensus, $gene2, $bp2_); $read2 = substr($read2, 1 + length($read2) - length($read2temp)); } my $cmpthresh = 0.5; #compare clipped region of read2 with read1 non-clipped my $threshold = min(25,$cmpthresh*(min(length($read2), (length($read1))) - $k)); my $thresholdb = min(25,$cmpthresh*(min(length($read2b), (length($read1b))) - $k)); my $doOffset = 1; #if 1, allow offset adjustment my @same = compKmers(\%kmers1, \%kmers1rc, $read2, $k, $threshold, $read1, $doOffset, $clipOrder, $clipOrder2); my $clipOrderb = "NC"; my $clipOrder2b = "NC"; if($clipOrder eq "NC") {$clipOrderb = "CN";} if($clipOrder2 eq "NC") {$clipOrder2b = "CN";} my @sameb = compKmers(\%kmers1b, \%kmers1rcb, $read2b, $k, $thresholdb, $read1b, $doOffset, $clipOrderb, $clipOrder2b); if($same[0] != 9999999 && $same[0] != $sameb[0]) {$same[0] = $sameb[0] = 9999999;} #if clipped region of read 2 is sufficiently similar to non-clipped region of read 1... if($same[0] != 9999999 && $sameb[0] != 9999999){ next if exists($usedpair{"$gene\_$gene2\_$bp\_$bp2\_$clipOrder\_$clipOrder2"}) || exists($usedpair{"$gene2\_$gene\_$bp2\_$bp\_$clipOrder2\_$clipOrder"}); next if ($clipOrder eq $clipOrder2) && $same[1] eq 'F'; #'F' = forward read next if($clipOrder ne $clipOrder2) && $same[1] eq 'RC'; #'RC' = reverse complement read $usedpair{"$gene\_$gene2\_$bp\_$bp2\_$clipOrder\_$clipOrder2"} = 1; #store unique gene pair #assign strand orientation to fusion my $orientation1 = "2-"; if($clipOrder eq "NC") {$orientation1 = "1+";} my $orientation2= "2-"; if($clipOrder2 eq "CN") {$orientation2 = "1+";} if(($clipOrder ne $clipOrder2) && $clipOrder eq "CN") {$orientation1 = "2+"; $orientation2 = "1+";} if(($clipOrder ne $clipOrder2) && $clipOrder eq "NC") {$orientation2 = "2+"; $orientation1 = "1+";} #do bp correction by comparison to reference================= my $offsetbp1 = 0; my $offsetbp2 = $same[0]; if($same[0] != 0){ my @tmp = bp_correction(@{$screads{$gene2}->{$bp2_}}[1], \@same, $offsetbp2, $bp2_, $chr2); $offsetbp1 = $tmp[0]; $offsetbp2 = $tmp[1]; } #============================================================= #Find and store insert (non-templated or microhomologous) sequence (if any) if($offsetbp2!=0){ if($offsetbp2>0 && $clipOrder2 eq "NC") {$fusioncons = substr($read2, 0, $offsetbp2);} elsif($offsetbp2<0 && $clipOrder eq "NC") {$fusioncons = doRevComp(substr($read1b, 0, -$offsetbp2));} } #return and print adjusted fragment 2 breakpoint my $fusionSeqFa = getFusionSeq($clipOrder, $clipOrder2, $bp_, $bp2_, $bp, $bp2, $offsetbp2, $buffer, $offsetbp1); my @vals = split("\n", $fusionSeqFa); my $fusionSeq = $vals[1]; my $part1 = substr($fusionSeq, 0, $buffer); my $part2 = substr($fusionSeq, $buffer, $buffer); #print "$part1\t$part2\n"; #print "$fusioncons\t$bp_\t$bp2_$orientation1\t$orientation2\n"; #add non-templated sequence into estimated fusion sequence==== my $inserttype = "-"; if(length($fusioncons) > 0){ my $nt_len = length($fusioncons); if($orientation1 =~ m/1/) { $fusioncons = doRevComp($fusioncons); if(substr($part2, 0, length($fusioncons)) eq $fusioncons) { #$inserttype = "Microhomology"; $fusioncons = ""; } else { $inserttype = "Non-templated"; substr($part2, 0, length($fusioncons)) = $fusioncons; if($orientation2 =~ m/\+/){ $bp2_ += $nt_len; }else {$bp2_ -= $nt_len;} } } else { if(substr($part1, length($part1)-length($fusioncons), length($fusioncons)) eq $fusioncons){ #$inserttype = "Microhomology"; $fusioncons = ""; }else { $inserttype = "Non-templated"; substr($part1, length($part1)-length($fusioncons), length($fusioncons)) = $fusioncons; if($orientation1 =~ m/\+/){ $bp2_ -= $nt_len; }else {$bp2_ += $nt_len;} } } } #print "$fusioncons\t$bp_\t$bp2_\n"; if($fusioncons eq "") {$fusioncons = "-"; $fusioncons_bp = 0;} else {$fusioncons = "[$fusioncons]";} $nonref_insert = ""; $fusionSeqFa = $vals[0] . "\n" . $part1 . "" . $part2; #============================================================= my $data = "$gene\t$gene2\t$chr:" . ($bp_ + $offsetbp1) . "\t$chr2:" . ($bp2_ + $offsetbp2) . "\t$count\t$count2\t$same[0]\t$orientation1 $orientation2\t$clipOrder\t$clipOrder2\t"; #do BLAST search my $datatmp = doBLAST($fusionSeqFa, $gene, $gene2, $buffer, ($MINBLASTFRAC * $READLEN)); $data = $data . "$datatmp"; #print close-up of breakpoint my @vals = split("\n", $fusionSeqFa); my $fusionSeq = $vals[1]; $part1 = substr($fusionSeq, $buffer - $pad, $pad); $part2 = substr($fusionSeq, $buffer, $pad); #add non-templated sequence into estimated fusion sequence $data = $data . $part1 . " <> " . $part2 . " $fusioncons";#\_$inserttype"; $fusioncons = ""; #print "$data\n"; my @vars = split("\t", $datatmp); my $bp_depth = $vars[0]; #retrieve breakpoint depth next if $bp_depth < $MINSPANNINGREADS; #don't save putative translocations with too few reads spanning breakpoint my @tmp = ($data, $fusionSeqFa); my $bp1breaks = $tmp[4]; my $bp2breaks = $tmp[5]; my $code = "$chr:" . ($bp_ + $offsetbp1) . " $chr2:" . ($bp2_ + $offsetbp2); #only keep best fusion with same coordinates and depth (highest read 1 and read 2 bp support) if(!exists($bestfusions{$bp_depth}->{$code})) {$bestfusions{$bp_depth}->{$code} = \@tmp; }else{ my @tmp2 = $bestfusions{$bp_depth}->{$code}; if($bp1breaks > $tmp2[4] && $bp2breaks > $tmp2[5]){ $bestfusions{$bp_depth}->{$code} = \@tmp; } } } } } } } } #if(($progress_itor == $total_comp || $progress_itor == 0) && $VERBOSE == 1) {print "\n";} #print header if(keys %bestfusions > 0){ print output3 "Est_Type\tRegion1\tRegion2\tBreak1\tBreak2\tBreak_support1\tBreak_support2\tBreak_offset\tOrientation\tOrder1\tOrder2\tBreak_depth\tProper_pair_support\tUnmapped_support\tImproper_pair_support\tPaired_end_depth\tTotal_depth\tFusion_seq\tNon-templated_seq\n"; if($VERBOSE == 1) {print "\n-------------------------------------------------\n\nFACTERA results:\nEst_Type\tRegion1\tRegion2\tBreak1\tBreak2\tBreak_support1\tBreak_support2\tBreak_offset\tOrientation\tOrder1\tOrder2\tBreak_depth\tProper_pair_support\tUnmapped_support\tImproper_pair_support\tPaired_end_depth\tTotal_depth\tFusion_seq\tNon-templated_seq\n";} } #=========================================================================================================== #print final fusion predictions, ordered by decreasing breakpoint depth (remove redundancies with 20bp window) my %bestfusions_ = (); #store best fusion events and use to remove redundancies (+/- 20bp padding) my %normdepth = getNormDepth(\%bestfusions, $buffer); #total depth of genomic regions flanking candidate fusion(s) my $transitor = 0; #count translocations/fusions foreach my $bp_depth (sort {$b <=> $a} keys %bestfusions){ foreach my $bps (keys %{$bestfusions{$bp_depth}}){ my $orig_bps = $bps; my @tmp = @{$bestfusions{$bp_depth}->{$bps}}; my $data = $tmp[0]; my @tokens = split(" ", $bps); my $bp1 = $tokens[0]; my $bp2 = $tokens[1]; my $chr1 = $bp1; $chr1 =~ s/:.*//g; $bp1 =~ s/.*://g; my $chr2 = $bp2; $chr2 =~ s/:.*//g; $bp2 =~ s/.*://g; my @vars = split("\t", $data); my $orient1 = $vars[7]; #Estimate rearrangement type my $first = substr($orient1,0,1); my $second = substr($orient1,3,1); my $first_polarity = substr($orient1,1,1); my $sec_polarity = substr($orient1,4,1); my $SR_type = " - "; if($chr1 ne $chr2) {$SR_type = "TRA";} elsif($first_polarity ne $sec_polarity) {$SR_type = "INV";} elsif(($bp1<$bp2 && $first == 1) || ($bp1>$bp2 && $first == 2)){$SR_type = "DEL";} #elsif(abs(1+$bp1-$bp2)>1000000){$SR_type = "TRA";} #Switch order for instances where region 2 is first (5' of fusion) if($first == 2){ my $tmp = $vars[0]; $vars[0] = $vars[1]; $vars[1] = $tmp; $tmp = $vars[2]; $vars[2] = $vars[3]; $vars[3] = $tmp; $tmp = $vars[4]; $vars[4] = $vars[5]; $vars[5] = $tmp; $tmp = $vars[8]; $vars[8] = $vars[9]; $vars[9] = $tmp; $vars[7] =~ s/1/0/g; $vars[7] =~ s/2/1/g; $vars[7] =~ s/0/2/g; $orient1 = $vars[7]; $tmp = $bp1; $bp1 = $bp2; $bp2 = $tmp; $tmp = $chr1; $chr1 = $chr2; $chr2 = $tmp; $bps = "$chr1:$bp1 $chr2:$bp2"; } #remove redundancies within 20bp window foreach my $bp_depth_ (sort {$b <=> $a} keys %bestfusions){ foreach my $bps_ (keys %{$bestfusions{$bp_depth_}}){ my @tokens2 = split(" ", $bps_); my $bp1_ = $tokens2[0]; my $bp2_ = $tokens2[1]; my $chr1_ = $bp1_; $chr1_ =~ s/:.*//g; $bp1_ =~ s/.*://g; my $chr2_ = $bp2_; $chr2_ =~ s/:.*//g; $bp2_ =~ s/.*://g; my @tmp_ = @{$bestfusions{$bp_depth_}->{$bps_}}; my $data_ = $tmp_[0]; my @vars_ = split("\t", $data_); my $orient2 = $vars_[7]; my $first = substr($orient2,0,1); #Switch order for instances where region 2 is first (5' of fusion) if($first == 2){ $vars_[7] =~ s/1/0/g; $vars_[7] =~ s/2/1/g; $vars_[7] =~ s/0/2/g; $orient2 = $vars_[7]; my $tmp = $bp1_; $bp1_ = $bp2_; $bp2_ = $tmp; $tmp = $chr1_; $chr1_ = $chr2_; $chr2_ = $tmp; $bps_ = "$chr1_:$bp1_ $chr2_:$bp2_"; } next if $bps eq $bps_; next if ($chr1 ne $chr1_) || ($chr2 ne $chr2_); next if $orient1 ne $orient2; if(($bp1_ <= $bp1 + 20) && ($bp1_ >= $bp1 - 20) && ($bp2_ <= $bp2 + 20) && ($bp2_ >= $bp2 - 20)){ $bestfusions_{$bps_} = $bps_; } } } if(!exists($bestfusions_{$bps})) { $bestfusions_{$bps} = $bps; my $seq = $vars[(@vars-1)]; my @tokens = split(" ", $seq); $seq = join("\t", @tokens); $vars[(@vars-1)] = $normdepth{$orig_bps}; #retrieve genomic depth $data = join("\t", @vars); $data = $SR_type . "\t" . $data . "\t". $seq; print output3 "$data\n"; if($VERBOSE == 1) {print "$data\n";} my $fusionSeq = $tmp[1]; print output4 "$fusionSeq\n"; my @t1 = split(":", $vars[2]); my @t2 = split(":", $vars[3]); print output5 "$t1[0]\t$t1[1]\t$t1[1]\t$vars[0]\_$vars[1]\_$vars[2]\_$vars[3]\n"; print output5 "$t2[0]\t$t2[1]\t$t2[1]\t$vars[0]\_$vars[1]\_$vars[2]\_$vars[3]\n"; $transitor++; #count translocations } } } if($VERBOSE == 1) { if($transitor > 0) { my $ins = "fusion"; if($transitor>1) {$ins = "fusions";} print "$transitor $ins found!\n"; } else { print "\nNo fusions found\n"; } } close(output3); close(output4); close(output5); $now = time - $now; # Print runtime # if($TIME == 1) {printf("Total running time: %02d:%02d:%02d\n", int($now / 3600), int(($now % 3600) / 60),int($now % 60));} #=========================================================================================================== #FUNCTIONS................ #=========================================================================================================== #retrieve depth of region surrounding fusion(s) sub getNormDepth{ my $buffer = $_[1]; my %coors = (); #store surrounding depth for each translocation foreach my $bp_depth ( keys %{$_[0]}){ foreach my $bps (keys %{$_[0]{$bp_depth}}){ my $data = $_[0]{$bp_depth}->{$bps}; my @tokens = split(" ", $bps); my $bp1 = $tokens[0]; my $bp2 = $tokens[1]; my $chr1 = $bp1; if($addchrprefix == 1) {$chr1 =~ s/chr//g;} $chr1 =~ s/:.*//g; $bp1 =~ s/.*://g; my $chr2 = $bp2; $chr2 =~ s/:.*//g; if($addchrprefix == 1) {$chr2 =~ s/chr//g;} $bp2 =~ s/.*://g; my @vars = split("\t", $data); my $orient1 = $vars[7]; my @cut = split(" ", $orient1); if($cut[0] =~ m/-/){ $bp1 -= $buffer; } $coors{$chr1}->{$bp1} = $tokens[0]; #chr1:bp1 if($cut[1] =~ m/-/){ $bp2 -= $buffer; } $coors{$chr2}->{$bp2} = $tokens[1]; #chr2:bp2 my $stat = Statistics::Descriptive::Full->new(); my $insert = "-l $targets"; if($targets eq '0') {$insert = "";} my $pid = open3(gensym, \*PH, ">&NULL", "samtools mpileup -ABr $chr1:$bp1-" . ($bp1+$buffer) . " -Q20 $insert -d 10000000 $bam"); my @d1 = (); #store bp depth for gene 1 while(<PH>){ my $line = $_; chomp($line); my @vars = split("\t", $line); my $depth = $vars[3]; push(@d1, $depth); } close(PH); $stat->add_data(@d1); my $med_depth1 = 0; $med_depth1 = $stat->median(); my $pid = open3(gensym, \*PH, ">&NULL", "samtools mpileup -ABr $chr2:$bp2-" . ($bp2+$buffer) . " -Q20 $insert -d 10000000 $bam"); my @d2 = (); #store bp depth for gene 2 while(<PH>){ my $line = $_; chomp($line); my @vars = split("\t", $line); my $depth = $vars[3]; push(@d2, $depth); } close(PH); $stat = Statistics::Descriptive::Full->new(); $stat->add_data(@d2); my $med_depth2 = 0; $med_depth2 = $stat->median(); my $final_depth = 0; #if(@d1 > 0 && @d2 > 0){ # $final_depth = ($med_depth1 + $med_depth2)/2; #}else { $final_depth = max($med_depth1, $med_depth2); #} $coors{$bps} = $final_depth; } } return %coors; } #=========================================================================================================== #BLAST all reads that map to genes involved in fusion region (properly paired, improperly paired, and not paired) sub doBLAST{ my $gene = $_[1]; my $gene2 = $_[2]; my $buffer = $_[3]; my $min_len = $_[4]; #write fusion gene to temp file for blast search my $query = basename($bam); $query =~ s/bam/factera.blastquery.fa/g; $query = "$OUTPUTDIR/$query"; open(bquery, ">$query") or die $!; print bquery "$_[0]"; close(bquery); my $bdbname = basename($bam); $bdbname =~ s/bam/factera.blastreads.fa/g; $bdbname = "$OUTPUTDIR/$bdbname"; #TODO: detect blast version: blastx: -max_target_seqs; blastn "-num_alignments 9999999 -num_descriptions 9999999" #open my $blastout, '-|', "blastn -task 'megablast' -query $query -db $bdbname -outfmt 6 -num_threads $BLASTTHREADS -num_alignments 9999999 -num_descriptions 9999999" or die; open my $blastout, '-|', "blastn -task 'megablast' -query $query -db $bdbname -outfmt 6 -num_threads $BLASTTHREADS -max_target_seqs 9999999" or die; my $bp_depth = 0; #breakpoint depth my $bpd_SC = 0; #count soft-clipped properly paired read support my $bpd_UM = 0; #count unpaired read support my $bpd_IP = 0; #count soft-clipped improperly paired read support my $proper_pair_depth = 0; #extra support due to properly paired reads that flank breakpoint my %properpairs = (); #store proper pairs my %storeused = (); #store read ids to ensure each fragment only counted once my $percentsim = 100 * $minsimilarity; #iterate through BLAST results and count reads overlapping breakpoint while(<$blastout>){ my $line = $_; chomp($line); my @tokens = split("\t", $line); my $id = $tokens[1]; $id =~ s/\_.*//g; my $readtype = $tokens[1]; $readtype =~ s/.*\_//g; #store read type code (SC=soft-clipped, UM=unmapped, IP=improperly paired) my $percent = $tokens[2]; my $align_len = $tokens[3]; my $left = $tokens[6]; my $right = $tokens[7]; if($percent >= $percentsim && $align_len >= $min_len){ if (exists($storeused{$id})){ #skip if id already used $properpairs{$id}->{$left} = $right; #group proper pairs next; } if(min($left,$right) < ($buffer - 15) && max($left,$right) > ($buffer + 15)){ $bp_depth++; if($readtype eq "SC") {$bpd_SC++;} if($readtype eq "UM") {$bpd_UM++;} if($readtype eq "IP") {$bpd_IP++;} $storeused{$id} = $id; #store id } $properpairs{$id}->{$left} = $right; #group proper pairs } } #===================================== #count proper pairs that flank breakpoint foreach my $id (keys %properpairs){ my $size = (keys %{$properpairs{$id}}); next if $size != 2; my @coors = (); foreach my $left (keys %{$properpairs{$id}}){ push(@coors, $left); push(@coors, $properpairs{$id}->{$left}); } if((min(@coors) < ($buffer - 85)) && (max(@coors) > ($buffer + 85))){ $proper_pair_depth++; } } #===================================== #print break-point depth my $data = "$bp_depth\t$bpd_SC\t$bpd_UM\t$bpd_IP\t"; #print proper pair depth $data = $data . "$proper_pair_depth\t"; return $data; } #=========================================================================================================== sub makeBLASTdb{ my $currtime = time - $now; printf("[%02d:%02d:%02d]", int($currtime / 3600), int(($currtime % 3600) / 60),int($currtime % 60)); print" Extracting reads for blast database...\n"; #print all unmapped reads to blast database fasta file open my $blastreads, '-|', "samtools view -F 2 -L $fusetargets $bam" or die; $count = 0; while(<$blastreads>){ my $line = $_; chomp($line); my @tokens = split("\t", $line); if ($storeids4blast{$tokens[0]}){ $count++; if($count % 1000 == 0 && $VERBOSE == 1) { print " $progress[$progressItor] $count reads written \r"; $progressItor++; if($progressItor > 3) {$progressItor = 0;} } #so makeblastdb does not complain about >40% Ns my $Ns = $tokens[9]; $Ns =~ s/N//g; if(length($Ns)/length($tokens[9]) < .75) {next;} print blast ">$tokens[0]\_" . ($blastitor++) . "\_IP\n$tokens[9]\n"; }elsif(($tokens[1] == 73) || ($tokens[1] == 133) || ($tokens[1] == 89) || ($tokens[1] == 121) || ($tokens[1] == 165) || ($tokens[1] == 181) || ($tokens[1] == 101) || ($tokens[1] == 117) || ($tokens[1] == 153) || ($tokens[1] == 185) || ($tokens[1] == 69) || ($tokens[1] == 137) || ($tokens[1] == 77) || ($tokens[1] == 141)){ $count++; if($count % 1000 == 0 && $VERBOSE == 1) { print " $progress[$progressItor] $count reads written\r"; $progressItor++; if($progressItor > 3) {$progressItor = 0;} } #so makeblastdb does not complain about >40% Ns my $Ns = $tokens[9]; $Ns =~ s/N//g; if(length($Ns)/length($tokens[9]) < .75) {next;} print blast ">$tokens[0]\_" . ($blastitor++) . "\_UM\n$tokens[9]\n"; } } print "\n"; close($blastreads); close(blast); my $bdbname = basename($bam); $bdbname =~ s/bam/factera.blastreads.fa/g; $bdbname = "$OUTPUTDIR/$bdbname"; $currtime = time - $now; printf("[%02d:%02d:%02d]", int($currtime / 3600), int(($currtime % 3600) / 60),int($currtime % 60)); print " Creating blast database..."; system("makeblastdb -in $bdbname -dbtype 'nucl' >/dev/null"); if($VERBOSE == 1){ print "\n - Done\n"; } } #=========================================================================================================== #calculate fusion coordinates and retrieve genomic sequences surrounding breakpoint using user-definable pad length (default 200) sub getFusionSeq{ my $g1start = 0; my $g1end = 0; my $g2start = 0; my $g2end = 0; my $buffer = $_[7]; #size of genomic padding around breakpoint my $clipOrder = $_[0]; #NC or CN my $clipOrder2 = $_[1]; #NC or CN my $bp_ = $_[2]; #breakpoint fragment 1 my $bp2_ = $_[3]; #breakpoint fragment 2 my $bp = $_[4]; #chr:breakpoint fragment 1 my $bp2 = $_[5];#chr:breakpoint fragment 2 my $offset = $_[6]; #correct for overlap among fragments my $offset1 = $_[8]; #calculate breakpoint coordinates using clipping order (NC=not clipped followed by clipped; CN=clipped followed by non-clipped) and offset #Four orientations possible... #1) NC <-> NC [2nd fragment is reverse complement] #2) CN <-> CN [2nd fragment is reverse complement] #3) NC <-> CN [both fragments forward orientation] #4) CN <-> NC [both fragments forward orientation] if ($clipOrder eq $clipOrder2){ if($clipOrder eq "NC"){ $bp_ += $offset1; $g1start = ($bp_ - $buffer); $g1end = $bp_; $bp2_ += $offset; $g2start = ($bp2_ - $buffer); $g2end = $bp2_; }else{ $bp_ += $offset1; $g1start = $bp_ - 1; $g1end = $bp_ + ($buffer - 1); $bp2_ += $offset; $g2start = $bp2_ - 1; $g2end = $bp2_ + ($buffer - 1); } }else{ if($clipOrder eq "NC"){ $bp_ += $offset1; $g1start = ($bp_ - $buffer); $g1end = $bp_; $bp2_ += $offset; $g2end = $bp2_ + ($buffer - 1); $g2start = $bp2_ - 1; }else{ $bp_ += $offset1; $g1start = $bp_ - 1; $g1end = $bp_ + ($buffer - 1); $bp2_ += $offset; $g2end = $bp2_; $g2start = ($bp2_ - $buffer); } } my $index1 = $bp; $index1 =~ s/:.*//g; $index1 = $index1 . ":" . $g1start . "-" . $g1end; my $index2 = $bp2; $index2 =~ s/:.*//g; $index2 = $index2 . ":" .$g2start . "-" . $g2end; if($g1start < 0 || $g2start < 0) { return ">$index1,$index2\nNNNNNN";} #extract genomic sequence for putative translocation and write to temporary file my $output = basename($bam); $output =~ s/bam/factera.tmp.fa/g; $output = "$OUTPUTDIR/$output"; system("twoBitToFa -noMask $twobit:$index1,$index2 $output"); #open temporary fasta file containing genomic sequences open(FILE, $output) or die $!; my $seqCount = 0; my $seq1 = ""; my $seq2 = ""; while(<FILE>){ my $line = $_; chomp($line); if($line =~ m/>/){ $seqCount++; next; } if($seqCount == 1){ $seq1 = $seq1 . $line; }else {$seq2 = $seq2 . $line;} } close(FILE); #remove temporary file system("rm $output"); #merge both sequences to create putative fusion gene sequence my $fusionseq = $seq1; if($clipOrder eq $clipOrder2) { #need to flip fragment 2 into reverse complemen t(seq2) $seq2 = doRevComp($seq2); } if($clipOrder eq "NC") {$fusionseq = $fusionseq . $seq2;} #order=fragment 1, fragment 2 else {$fusionseq = $seq2 . $fusionseq;} #order=fragment 2, fragment 1 return ">$index1,$index2\n$fusionseq"; } #=========================================================================================================== #compare clipped read to non-clipped read using k-mer hashtable sub compKmers{ my $k = $_[3]; #size of k-mer my $sumk = 0; #count forward hits my $sumkrc = 0; #count reverse complement hits my $read1len = length($_[5]); #read 1 length ($_[5] is read1 seq) my $doOffset = $_[6]; my $clip1 = $_[7]; #clip order, read 1 my $clip2 = $_[8]; #clip order, read 2 my $offset = 0; my $offsetrc = 0; my $firstmatch = 0; my $firstmatchrc = 0; for(my $itor = 0; $itor < length($_[2]) - $k; $itor++){ my $km = substr($_[2], $itor, $k); if(exists($_[0]{$km})) { $sumk++; if($firstmatch == 0) { if($clip1 eq "CN" && $clip2 eq "NC"){ $offset = $itor - $_[0]{$km}; }elsif($clip1 eq "NC" && $clip2 eq "CN") { $offset = ($read1len - $_[0]{$km}) - (length($_[2]) - $itor); } $firstmatch = 1; } } if(exists($_[1]{$km})) { $sumkrc++; if($firstmatchrc == 0) { if($clip1 eq "NC" && $clip2 eq "NC"){ $offsetrc = $itor - $_[1]{$km}; }elsif($clip1 eq "CN" && $clip2 eq "CN") { $offsetrc = ($read1len - $_[1]{$km}) - (length($_[2]) - $itor); #print "$offsetrc\t$read1len\t$itor\t" . $_[1]{$km} . "\t" . length($_[2]) . "\t$km\n"; } $firstmatchrc = 1; } } } if($sumk >= max($_[4],($clipsize-$k))){ my @tmp = ($offset,'F'); return @tmp; }elsif($sumkrc >= max($_[4],($clipsize-$k))) { my @tmp = ($offsetrc,'RC'); return @tmp; } else{return 9999999;} } #=========================================================================================================== #convert read into kmers; store as hashtable sub getKmers{ my $k = $_[1]; #size of k my %kmers = (); #store k-mers for(my $itor = 0; $itor < length($_[0]) - $k; $itor++){ my $km = substr($_[0], $itor, $k); if(!exists($kmers{$km})) {$kmers{$km} = $itor;} } return %kmers; } #=========================================================================================================== #parse cigar string sub parsecigar{ my $cigar = $_[0]; my $cigarItor = 0; my $var = substr($cigar,$cigarItor,1); my $num = ""; #extract number and cigar code while($var =~ /^[+-]?\d+$/ && $cigarItor < length($cigar)){ $cigarItor++; $num = $num . $var; $var = substr($cigar,$cigarItor,1); } $cigar = substr($cigar,($cigarItor+1)); my @data; $data[0] = $num; $data[1] = $var; $data[2] = $cigar; return @data; } #=========================================================================================================== #return reverse complement of read sub doRevComp { my $rc = ""; for(my $itor = length($_[0])-1; $itor >= 0; $itor--){ my $c = substr($_[0], $itor, 1); if($c eq "A") {$c = "T";} elsif($c eq "T") {$c = "A";} elsif($c eq "C") {$c = "G";} elsif($c eq "G") {$c = "C";} $rc = $rc . $c; } return $rc; } #============================================================================================================ #given start coordinate of a read, find and return the closest gene------------------------------------------ sub findGene { my $pos = $_[0]; my $chr = $_[5]; my $start_coor = BinSearch($pos, \&cmpFunc, \@{$_[1]}); if((int $start_coor) - $start_coor != 0){ my $p1 = $start_coor - 0.5; my $p2 = $start_coor + 0.5; my $pos1 = @{$_[1]}[$p1]; #-1 my $pos2 = @{$_[1]}[$p2]; #-1 if(abs($pos1 - $pos) < abs($pos2 - $pos)){ $start_coor = $pos1; }else {$start_coor = $pos2;} }else{ $start_coor = @{$_[1]}[$start_coor]; } my $start_coor_end = $getothercoor{"$chr $start_coor"}; $_[2] = $start_coor; $_[3] = $start_coor_end; $_[4] = $coors2gene{"$chr $start_coor $start_coor_end"}; } sub cmpFunc { my ($index, $arrayRef, $target) = @_; my $item = $$arrayRef[$index]; return $item <=> $target; } #binary search routine sub BinSearch { my ($target, $cmp) = @_; my $posmin = 0; my $posmax = $#{$_[2]}; return -0.5 if &$cmp (0, \@{$_[2]}, $target) > 0; return $#{$_[2]} + 0.5 if &$cmp ($#{$_[2]}, \@{$_[2]}, $target) < 0; while (1) { my $mid = int (($posmin + $posmax) / 2); my $result = &$cmp ($mid, \@{$_[2]}, $target); if ($result < 0) { $posmin = $posmax, next if $mid == $posmin && $posmax != $posmin; return $mid + 0.5 if $mid == $posmin; $posmin = $mid; } elsif ($result > 0) { $posmax = $posmin, next if $mid == $posmax && $posmax != $posmin; return $mid - 0.5 if $mid == $posmax; $posmax = $mid; } else { return $mid; } } } sub is_integer { local $_ = shift; # checks from perlfaq4 return $] < 5.009002 unless defined; return 1 if (/^[+-]?\d+$/); # is a +/- integer 0; } #get consensus sequence for soft-clipped reads for a given breakpoint sub getConsensus { my @seqarray = @{$_[0]{$_[1]}->{$_[2]}}; my $sq = ""; for(my $j = 0; $j < @seqarray; $j++) { my $mx_val = 0; my $mx_itor = 0; my $mx_char = "A"; for(my $r = 0; $r < 4; $r++){ if($seqarray[$j][$r] > $mx_val){ $mx_val = $seqarray[$j][$r]; $mx_itor = $r; } } next if $mx_val == 0; if($mx_itor == 1) {$mx_char = "T";} if($mx_itor == 2) {$mx_char = "C";} if($mx_itor == 3) {$mx_char = "G";} $sq = $sq . $mx_char; } return $sq; } #store consensus sequence for soft-clipped reads for a given breakpoint sub getSeqArray { my $i = $_[3]; my $len = $_[4]; my $j = $_[5]; my $len2 = $_[6]; my @seqarray = (); if(exists($_[7]{$_[1]}->{$_[2]})){ @seqarray = @{$_[0]{$_[1]}->{$_[2]}}; } for(; $i < $len; $i++){ next if $i < 0; my $ch = substr($_[8], $j, 1); if($ch eq "A") {$seqarray[$i][0]++;} if($ch eq "T") {$seqarray[$i][1]++;} if($ch eq "C") {$seqarray[$i][2]++;} if($ch eq "G") {$seqarray[$i][3]++;} $j++; if($j >= $len2) {last;} } return @seqarray; } #perform bp correction by checking subsequence against reference sub bp_correction { my $read2 = $_[0]; my @same = @{$_[1]}; my $offsetbp2 = $_[2]; my $bp2_ = $_[3]; my $chr2 = $_[4]; my $offsetbp1 = 0; my $index = ""; my $r2seq = ""; my @r2 = split(" ",$read2); if($same[0] > 0){ #grab from read 2 downstream of cutpoint $r2seq = substr($r2[1], 0, $same[0]); $index = $chr2 . ":" . ($bp2_-1) . "-" . ($bp2_+$same[0]-1); if($bp2_-1 < 0) {return @data;} }else{ #grab from read 2 upstream of cutpoint $r2seq = substr($r2[0], length($r2[0])+$same[0],-$same[0]); $index = $chr2 . ":" . ($bp2_+$same[0]) . "-" . $bp2_; if($bp2_+$same[0] < 0) {return @data;} } if(length($r2seq) < 3) {return ($offsetbp1, $offsetbp2);} #extract genomic sequence for putative translocation and write to temporary file my $output = basename($bam); $output =~ s/bam/factera.tmp.fa/g; $output = "$OUTPUTDIR/$output"; system("twoBitToFa -noMask $twobit:$index $output"); #open temporary fasta file containing genomic sequences open(FILE, $output) or die $!; my $seq1 = ""; while(<FILE>){ my $line = $_; chomp($line); next if($line =~ m/>/); $seq1 = $seq1 . $line; } close(FILE); #remove temporary file system("rm $output"); if($r2seq eq $seq1){ #adjust breakpoint 1 if($same[1] eq 'F') {$offsetbp1 = -$offsetbp2;} else {$offsetbp1 = $offsetbp2;} $offsetbp2 = 0; } my @data = ($offsetbp1, $offsetbp2); return @data; } #check for 'chr' prefix on chromosome input argument; add if not present sub addprefix { if($disablechrprefix == 1) {return $_[0];} if(!($_[0] =~ m/chr/)){ $_[0] = "chr" . $_[0]; }else {$addchrprefix = 0;} return $_[0]; } ```
laihui126
2023年1月9日 15:01
分享文档
收藏文档
上一篇
下一篇
微信扫一扫
复制链接
手机扫一扫进行分享
复制链接
关于 MrDoc
觅道文档MrDoc
是
州的先生
开发并开源的在线文档系统,其适合作为个人和小型团队的云笔记、文档和知识库管理工具。
如果觅道文档给你或你的团队带来了帮助,欢迎对作者进行一些打赏捐助,这将有力支持作者持续投入精力更新和维护觅道文档,感谢你的捐助!
>>>捐助鸣谢列表
微信
支付宝
QQ
PayPal
下载Markdown文件
分享
链接
类型
密码
更新密码