UCSC_2bit基因组格式ToFASTA格式twoBitToFa

69 阅读 0 评论 46 点赞

我是靠谱客的博主慈祥奇迹，最近开发中收集的这篇文章主要介绍UCSC_2bit基因组格式ToFASTA格式twoBitToFa，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

twoBitToFa

在UCSC下载小鼠的mm10版本基因组数据时没有找到.fa文件，发现了一个mm10.2bit文件，估计是把基因组序列存成了二进制文件，翻看文件说明：
mm10.2bit - contains the complete mouse/mm10 genome sequence in the 2bit file format. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. The utility program, twoBitToFa (available from the kent src tree), can be used to extract .fa file(s) from this file.

A pre-compiled version of the command line tool can be found at:
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ （重点划线，总之UCSC给出了解决方法）
See also:
http://genome.ucsc.edu/admin/git.html
http://genome.ucsc.edu/admin/jk-install.html

下载twoBitToFa

chmod +x twoBitToFa
export PATH=$PATH:/home/xxx/lustre1/software/twoBit2Fa
source ~/.bashrc

运行twoBitToFa

============================= twoBitToFa==================================
twoBitToFa - Convert all or part of .2bit file to fasta
usage:
twoBitToFa input.2bit output.fa
options:
-seq=name Restrict this to just one sequence.
-start=X Start at given position in sequence (zero-based).
-end=X End at given position in sequence (non-inclusive).
-seqList=file File containing list of the desired sequence names
in the format seqSpec[:start-end], e.g. chr1 or chr1:0-189
where coordinates are half-open zero-based, i.e. [start,end).
-noMask Convert sequence to all upper case.
-bpt=index.bpt Use bpt index instead of built-in one.
-bed=input.bed Grab sequences specified by input.bed. Will exclude introns.
-bedPos With -bed, use chrom:start-end as the fasta ID in output.fa.
-udcDir=/dir/to/cache Place to put cache for remote bigBed/bigWigs.

Sequence and range may also be specified as part of the input
file name using the syntax:
/path/input.2bit:name
or
/path/input.2bit:name
or
/path/input.2bit:name:start-end