使用Last比对基因组DNA序列

LAST can:

Handle big sequence data, e.g:
Compare two vertebrate genomes
Align billions of DNA reads to a genome
Indicate the reliability of each aligned column.
Use sequence quality data properly.
Compare DNA to proteins, with frameshifts.
Compare PSSMs to sequences
Calculate the likelihood of chance similarities between random sequences.
Do split and spliced alignment.
Train alignment parameters for unusual kinds of sequence (e.g. nanopore).

安装

wget http://last.cbrc.jp/last-1080.zip
unzip last-1080.zip
cd last-1080
make && make install
cd .. && rm last-1080.zip

使用流程：

lastdb -P0 -uMAM4 -R01 darer-MAM4 ../01.proteomics/darer.genome.fasta
#-P 设置线程数，0调用服务器所有线程
#-u 该参数的选择是last比对非常关键的一步，常用参数：
##MAM8：This DNA seeding scheme finds weak similarities with high sensitivity, but low speed and high memory usage (e.g. ~50 GB for mammal genomes).
##MAM4：This DNA seeding scheme is like MAM8, but a bit less sensitive, and uses about half as much memory.
##NEAR：This DNA seeding scheme is good for finding short-and-strong (near-identical) similarities. It is also good for similarities with many gaps (insertions and deletions), because it can find the short matches between the gaps. (Long-and-weak seeding schemes allow for mismatches but not gaps.) 
##YASS：This DNA seeding scheme is good for finding long-and-weak similarities. It is a good compromise for both protein-coding and non protein-coding DNA
#-R01 tells it to mark simple sequences (such as cacacacacacacacaca) by lowercase, but not suppress them.
for i in `cat ../00.initial_data/reflect.txt | cut -f 1`
do
    echo "last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 darer-MAM4 ../01.proteomics/$i.genome.fasta > $i.mat"
done > last_all_pairs_mat.list
ParaFly -c last_all_pairs_mat.list -CPU 25

for i in `cat ../00.initial_data/reflect.txt | cut -f 1`
do
    echo "lastal -P0 -m100 -E0.05 -C2 -p $i.mat darer-MAM4 ../01.proteomics/$i.genome.fasta | last-split -m1 > $i.maf"
done > last_all_pairs_maf.list
ParaFly -c last_all_pairs_maf.list -CPU 25

maf-swap datra.maf | awk '/^s/ {$2 = (++s % 2 ? "datra." : "darer.") $2} 1' | last-split -m1 | maf-swap > datra-2.maf
last-postmask datra-2.maf | maf-convert -n tab | awk -F'=' '$2 <= 1e-5' > datra.tab
#lastdb -P0 -uNEAR -cR11 darer.fa.db ../01.proteomics/darer.genome.fasta
#for i in `cat ../00.initial_data/reflect.txt | cut -f 1`
#do
#    echo "lastal -P20 -m100 -E0.05 darer.fa.db ../01.proteomics/$i.genome.fasta | last-split -m1 > $i.maf"
#done > last_all_pairs_maf.list
#ParaFly -c last_all_pairs_maf.list -CPU 8
#for i in `cat ../00.initial_data/reflect.txt | cut -f 1`
#do
#    echo "maf-swap $i.maf | last-split | maf-swap | last-split | maf-sort > $i.LAST.maf"
#done > last_maf_swap.list
#ParaFly -c last_maf_swap.list -CPU 25
#for i in `cat ../00.initial_data/reflect.txt | cut -f 1`
#do
#    echo "maf-convert psl $i.LAST.maf > $i.psl"
#done > last_maf_convert.list
#ParaFly -c last_maf_convert.list -CPU 25
#for i in `cat ../00.initial_data/reflect.txt | cut -f 1`
#do
#    echo "perl maf.rename.species.S.pl $i.LAST.maf darer $i $i.Final.maf > $i.stat"
#done > last_maf_rename.list
#ParaFly -c last_maf_rename.list -CPU 25

#for i in `cat ../00.initial_data/reflect.txt | cut -f 1`
#do
#    echo "cp $i.Final.maf darer.$i.sing.maf"
#done > cp.lsit
#ParaFly -c lcp.list -CPU 25

roast T=/home/wuchangsong/gc_genome/19.paml/i.LAST/ E=darer "tree topology" ./*Final.maf all.roast.maf
#Then the output file example.roast.maf will contain the orthologous multiple alignment.

参考链接：https://github.com/mcfrith/last-genome-alignments

使用Last比对基因组DNA序列

Comments

Leave a Reply Cancel reply