Author: Wuchangsong

多物种全基因组比对得到保守的DNA序列

查阅不同文献和教程发现获得多物种保守的DNA序列（编码区和非编码区）主要过程有： 1.Repeat mask：通过RepeatMasker和RepeatModeler获得 2.Pairwise alignment: 用到的软件主要有last、lastz、blastz 3.Chaining: axtChain 4.Netting: chainNet 5.Mafing: 6.Combine multiple pairwise results: 7.PhastCons: PHAST 详细步骤如下第一步：从数据库下载重复序列屏蔽后的基因组fasta文件，对自己组装的序列可以通过Geta获得第二步：前边介绍过last的使用，但看文献发现使用lastz的比较多，有关last和lastz的比较（last aligner is considered faster and memory efficient. It creates maf file, which can converted to psl files. Then the same following processes can be used on psl files. Different from lastz, last aligner starts with…

2020年8月26日
PSMC分析流程

bowtie2-build ../genome.fasta genome bowtie2 -x genome -p 80 -1 reads.1.fastq -2 reads.2.fastq -S bowtie2.sam samtools sort -o bowtie2_sort.bam -O BAM -@ 40 -m 4G bowtie2.sam /opt/biosoft/samtools-0.1.18/samtools mpileup -C50 -uf ../genome.fasta bowtie2_sort.bam > gc_psmc.bcf /opt/biosoft/samtools-0.1.18/bcftools/bcftools view -c gc_psmc.bcf > Pb_2G.vcf vcfutils.pl vcf2fq -d 10 -D 100 Pb_2G.vcf | gzip > diploid.fq.gz /opt/biosoft/psmc-master/utils/fq2psmcfa -q20 diploid.fq.gz > diploid.psmcfa…

2020年8月22日
使用AdmixTools做D-statistics

安装软件和缺少的库文件 git clone https://github.com/DReichLab/AdmixTools.git cd AdmixTools/src make clobber make all #如果报错/usr/bin/ld: cannot find -lopenblas说明缺少libopenblas库文件 git clone https://github.com/xianyi/OpenBLAS.git cd OpenBLAS make make PREFIX=/path/to/your/installation install cd /usr/lib/ ln -s /opt/biosoft/OpenBLAS/lib/libopenblas_nehalemp-r0.3.10.dev.a ./libopenblas.a ln -s /opt/biosoft/OpenBLAS/lib/libopenblas_nehalemp-r0.3.10.dev.so ./libopenblas.so cd /opt/biosoft/AdmixTools/src/ make clean make all && make install

2020年8月11日
根据gff文件统计exon、intron长度分布图

下载需要的脚本和安装Python模块 wget https://github.com/irusri/Extract-intron-from-gff3/archive/master.zip unzip master.zip rm master.zip && cd Extract-intron-from-gff3-master/scripts/ sudo chmod 755 * pip install misopy pip install gffutils 获取exon、intron的gff文件并提取DNA序列 python /opt/biosoft/Extract-intron-from-gff3-master/scripts/extract_intron_gff3_from_gff3.py out.gff3 out_intron.gff3 ##结果文件out_intron.gff3_introns.gff3 awk ‘/intron\t/{print}’ out_intron.gff3_introns.gff3 | sort -k 1,1 -k4,2n > processed_intron.gff3 awk ‘/exon\t/{print}’ out_intron.gff3_introns.gff3 | sort -k 1,1 -k4,2n > processed_exon.gff3 perl /opt/biosoft/Extract-intron-from-gff3-master/scripts/extract_seq_from_gff3.pl -d out.tmp/genome.fasta – processed_intron.gff3 > output_intron.fa perl…

2020年8月8日
使用Last比对基因组DNA序列

LAST can: Handle big sequence data, e.g: Compare two vertebrate genomes Align billions of DNA reads to a genome Indicate the reliability of each aligned column. Use sequence quality data properly. Compare DNA to proteins, with frameshifts. Compare PSSMs to sequences Calculate the likelihood of chance similarities between random sequences. Do split and spliced alignment.…

2020年8月5日
PHAST安装使用

PHAST：Phylogenetic Analysis with Space/Time models (PHAST) is a freely available software package consisting of a collection of command-line programs and supporting libraries for comparative and evolutionary genomics. Best known as the search engine behind the Conservation tracks in the University of California, Santa Cruz (UCSC) Genome Browser, PHAST also includes several tools for phylogenetic modeling,…

2020年7月10日
使用MCScanX和JCVI共线性区块分析

首先从基因组数据库下载目标物种的全基因组文件和gff3文件对gff3文件预处理得到对应格式的gff和bed文件 MCScanX：perl -e ‘while () { if (m/^(\S+)\t.*\tgene\t(\d+)\t(\d+).*ID=(darer\d+)/) { print “$1\t$4\t$2\t$3\n” } }’ darer.genome.gff3 > darer.gff jcvi：perl -e ‘while () { if (m/^(\S+)\t.*\tgene\t(\d+)\t(\d+)\t(\S+)\t(\S+).*ID=(darer\d+)/) { print “$1\t$2\t$3\t$6\t$4\t$5\n” } }’ darer.genome.gff3 > zebr.bed 对NCBI的GFF3文件再进行提取，若基因含有可变剪接，则仅保留CDS长度最长的转录本并获得蛋白序列，该过程参考网上其他教程 cat darer.gff gc.gff > all.gff cat darer.pep.fasta gc.pep.fasta > all.pep.fasta makeblastdb -in all.pep.fasta -dbtype prot -title all -parse_seqids -out all -logfile all.log…

2020年7月6日
LACHESIS安装及使用

在组装基因组时，使用二代或三代数据组装到contigs后，下一步就是将contig提升到染色体水平。如果利用HiC数据，那么目前常见的组装软件有下面几个： HiRise: 2015年后的GitHub就不再更新 LACHESIS: 发表在NBT，2017年后不再更新 SALSA: 发表在BMC genomics, 仍在更新中 3D-DNA: 发表在science，仍在更新中 ALLHiC: 发表在Nature Plants, 用于解决植物多倍体组装问题对于二倍体物种而言，目前3D-DNA应该是组装效果较好的一个软件。软件安装 Lachesis有两个依赖：samtools（低于0.1.19的版本）和C++的boost库（需要大于1.52.0但是又不能太高比如1.67.0就不行） curl -o LACHESIS.zip https://codeload.github.com/shendurelab/LACHESIS/legacy.zip/master unzip LACHESIS.zip mv shendurelab-LACHESIS-2e27abb LACHESIS cd LACHESIS export LACHESIS_BOOST_DIR=/opt/biosoft/boost_1_53_0 #boost_1_64_0版本使用时出现问题 export LACHESIS_SAMTOOLS_DIR=/opt/biosoft/samtools-0.1.18 ./configure –with-samtools=/opt/biosoft/samtools-0.1.18 –with-boost=/opt/biosoft/boost_1_53_0/ make -j 40 mv src/bin . mv src/Lachesis bin/ 如果Perl版本不是 5.14.2 ，需要打开bin下面的perl脚本，删除如下信息: /////////////////////////////////////////////////////////////////////////////// // // // This software…

2020年4月27日
Iso-seq3安装与使用（全长转录组分析）

使用conda安装Iso-seq3 conda create -n anaCogent5.2 python=2.7 anaconda source activate anaCogent5.2 conda install -n anaCogent5.2 biopython conda install -n anaCogent5.2 -c http://conda.anaconda.org/cgat bx-python conda install -n anaCogent5.2 -c bioconda isoseq3 conda install -n anaCogent5.2 -c bioconda pbccs conda install -n anaCogent5.2 -c bioconda lima #The packages below are optional: conda install -n anaCogent5.2 -c bioconda pbcoretools…

2020年4月15日
HiC-pro安装与使用

软件安装 #conda 安装软件需要的环境 . ~/miniconda3/bin/activate conda create -y -n hic-pro python=2.7 pysam bx-python numpy scipy samtools bowtie2 conda activate hic-pro #安装HiC-pro cd /opt/biosoft/ wget https://github.com/nservant/HiC-Pro/archive/v2.11.1.tar.gz tar -zxvf v2.11.1.tar.gz cd HiC-Pro-2.11.1 make make install rm v2.11.1.tar.gz #软件使用（生成软件输入需要的三个文件） /opt/biosoft/HiC-Pro-2.11.1/bin/utils/digest_genome.py -r mboi -o mboi_genome.bed genome.fasta samtools faidx genome.fasta awk ‘{print $1″\t”$2}’ genome.fasta.fai > genome.size bowtie2-build –threads 140 genome.fasta…

2020年4月8日