LACHESIS安装及使用

在组装基因组时,使用二代或三代数据组装到contigs后,下一步就是将contig提升到染色体水平。如果利用HiC数据,那么目前常见的组装软件有下面几个:

HiRise: 2015年后的GitHub就不再更新
LACHESIS: 发表在NBT,2017年后不再更新
SALSA: 发表在BMC genomics, 仍在更新中
3D-DNA: 发表在science,仍在更新中
ALLHiC: 发表在Nature Plants, 用于解决植物多倍体组装问题

对于二倍体物种而言,目前3D-DNA应该是组装效果较好的一个软件。

软件安装
Lachesis有两个依赖:samtools(低于0.1.19的版本)和C++的boost库(需要大于1.52.0但是又不能太高比如1.67.0就不行)

curl -o LACHESIS.zip https://codeload.github.com/shendurelab/LACHESIS/legacy.zip/master
unzip LACHESIS.zip
mv shendurelab-LACHESIS-2e27abb LACHESIS
cd LACHESIS
export LACHESIS_BOOST_DIR=/opt/biosoft/boost_1_53_0 #boost_1_64_0版本使用时出现问题
export LACHESIS_SAMTOOLS_DIR=/opt/biosoft/samtools-0.1.18
./configure --with-samtools=/opt/biosoft/samtools-0.1.18 --with-boost=/opt/biosoft/boost_1_53_0/
make -j 40
mv src/bin .
mv src/Lachesis bin/

如果Perl版本不是 5.14.2 ,需要打开bin下面的perl脚本,删除如下信息:
///////////////////////////////////////////////////////////////////////////////
// //
// This software and its documentation are copyright (c) 2014-2015 by Joshua //
// N. Burton and the University of Washington. All rights are reserved. //
// //
// THIS SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS //
// OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF //
// MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. //
// IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY //
// CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT //
// OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR //
// THE USE OR OTHER DEALINGS IN THE SOFTWARE. //
// //
///////////////////////////////////////////////////////////////////////////////

第一行替换成#!/usr/bin/perl -w

添加环境变量

echo 'PATH=$PATH:/opt/biosoft/LACHESIS/bin/' > ~/.bashrc

软件使用

实际的软件使用要求你需要提供至少两类输入文件
•HiC数据,双端fastq格式
•初始组装,fasta格式

使用bwa比对得到bam文件

samtools faidx draft.asm.fasta
bwa index -a bwtsw draft.asm.fasta
bwa aln -t 155 draft.asm.fasta reads_R1.fastq.gz > reads_R1.sai
bwa aln -t 155 draft.asm.fasta reads_R2.fastq.gz > reads_R2.sai
bwa sampe draft.asm.fasta reads_R1.sai reads_R2.sai reads_R1.fastq.gz reads_R2.fastq.gz > sample.bwa_aln.sam
PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI
filterBAM_forHiC.pl sample.bwa_aln.REduced.paired_only.bam sample.clean.sam
samtools view -bt draft.asm.fasta.fai sample.clean.sam > sample.clean.bam

设置配置文件

cp /opt/biosoft/LACHESIS/bin/INIs/test_case.ini sample.ini

SPECIES = test # 写实际的物种名即可
DRAFT_ASSEMBLY_FASTA = draft.asm.fasta # 待组装序列的实际位置
SAM_DIR = . #表示当前目录下查找文件
SAM_FILES = sample.clean.bam #bam文件名
RE_SITE_SEQ = AAGCTT #酶切识别序列
USE_REFERENCE = 0 #不使用参考序列
BLAST_FILE_HEAD = . # BLAST的输出结果
CLUSTER_N = 16 # 最终聚类数目

# contig中最小的酶切位点数,
CLUSTER_MIN_RE_SITES=25
# contig中最大的link密度, 也就是一个区域与多个contig存在信号
# 可能是异染色质或重复序列
CLUSTER_MAX_LINK_DENSITY=2
# 对于CLUSTER_MIN_RE_SITES过滤掉的contig在初步聚类后还有机会加入已有的分组中
# 如果它加入其中的信号是加入另一组信号的3倍
CLUSTER_NONINFORMATIVE_RATIO=3
# 允许成组的最小酶切数
ORDER_MIN_N_RES_IN_TRUNK=15

运行:

ulimit -s 10240
Lachesis sample.ini
CreateScaffoldedFasta.pl draft.asm.fasta out/test_case

Iso-seq3安装与使用(全长转录组分析)

使用conda安装Iso-seq3

conda create -n anaCogent5.2 python=2.7 anaconda
source activate anaCogent5.2
conda install -n anaCogent5.2 biopython
conda install -n anaCogent5.2 -c http://conda.anaconda.org/cgat bx-python
conda install -n anaCogent5.2 -c bioconda isoseq3
conda install -n anaCogent5.2 -c bioconda pbccs
conda install -n anaCogent5.2 -c bioconda lima
#The packages below are optional:
conda install -n anaCogent5.2 -c bioconda pbcoretools # for manipulating PacBio datasets
conda install -n anaCogent5.2 -c bioconda bamtools # for converting BAM to fasta
conda install -n anaCogent5.2 -c bioconda pysam # for making CSV reports

Running IsoSeq
Typical workflow:
1. Generate consensus sequences from your raw subread data
$ ccs movie.subreads.bam movie.ccs.bam –noPolish –minPasses 1

2. Generate full-length reads by primer removal and demultiplexing
$ cat primers.fasta
>primer_5p
AAGCAGTGGTATCAACGCAGAGTACATGGGG
>primer_3p
AAGCAGTGGTATCAACGCAGAGTAC
$ lima movie.ccs.bam primers.fasta movie.fl.bam –isoseq –no-pbi

3. Remove noise from FL reads
$ isoseq3 refine movie.fl.P5–P3.bam primers.fasta movie.flnc.bam

4. Cluster consensus sequences to generate unpolished transcripts
$ isoseq3 cluster movie.flnc.bam unpolished.bam –verbose

5. Optionally, polish transcripts using subreads
$ isoseq3 polish unpolished.bam movie.subreads.bam polished.bam

6. Map unpolished or polished transcripts to genome and collapse transcripts based on genomic mapping
$ pbmm2 align unpolished.bam reference.fasta aligned.sorted.bam –preset ISOSEQ –sort
$ isoseq3 collapse aligned.sorted.bam out.gff
or $ isoseq3 collapse aligned.sorted.bam movie.ccs.bam out.gff

参考:https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md#step-2—primer-removal-and-demultiplexing

https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki/Tutorial:-Installing-and-Running-Iso-Seq-3-using-Conda

HiC-pro安装与使用

软件安装

#conda 安装软件需要的环境
. ~/miniconda3/bin/activate
conda create -y -n hic-pro python=2.7 pysam bx-python numpy scipy samtools bowtie2
conda activate hic-pro
#安装HiC-pro
cd /opt/biosoft/
wget https://github.com/nservant/HiC-Pro/archive/v2.11.1.tar.gz
tar -zxvf v2.11.1.tar.gz
cd HiC-Pro-2.11.1
make
make install
rm v2.11.1.tar.gz
#软件使用(生成软件输入需要的三个文件)
/opt/biosoft/HiC-Pro-2.11.1/bin/utils/digest_genome.py -r mboi -o mboi_genome.bed genome.fasta
samtools faidx genome.fasta
awk '{print $1"\t"$2}' genome.fasta.fai > genome.size
bowtie2-build --threads 140 genome.fasta grass_genome

#配置文件修改参考其他教程,输入文件填写绝对路径!

#运行HiC-pro
nohup HiC-Pro --input /home/wuchangsong/gc_genome/10.HIC/raw_data/ --output hic_results --conf /home/wuchangsong/gc_genome/10.HIC/config-hicpro.txt &

clusterProfiler富集分析

BiocManager::install('clusterProfiler')
BiocManager::install('org.Hs.eg.db')
BiocManager::install('DOSE')
library(clusterProfiler)
library(org.Hs.eg.db)
library(DOSE)

entrezID <- read.table("11.xls",header=F,sep="\t")
entrezID <- entrezID$V1
BP <- enrichGO(entrezID,"org.Hs.eg.db",ont="BP",keyType = "ENSEMBL",pAdjustMethod = "BH",pvalueCutoff = 0.05,qvalueCutoff = 0.1,readable = T)
dotplot(BP, x = "GeneRatio", color = "p.adjust", showCategory = 20, size = NULL, split = NULL, font.size = 12, title="Dotplot for Gene Ontology Analysis")
write.table(BP, 'go_tmp.txt', sep = '\t', row.names = FALSE, quote = FALSE)
id_with_fc <- read.table("22.xls",header=T,sep="\t")
id_with_fc2 <- id_with_fc[,2]
names(id_with_fc2) <- id_with_fc[,1]
geneList=id_with_fc2
cnetplot(BP,showCategory = 10,foldChange = geneList)

Anvi’o 安装

Dependencies

  • DIAMOND or NCBI’s blastp for search.
  • MCL for clustering.
  • muscle for alignment.

easy install through conda:

 

wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod 777 Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
#在询问是否将conda加入环境变量的时候选择no
cd miniconda3/bin/
chmod 777 activate
. ./activate
#添加频道
conda config --env --add channels conda-forge
conda config --env --add channels bioconda

conda create -n anvio-6 python=3.6
conda activate anvio-6
conda install -y anvio=6
conda install -y diamond=0.9.14

具体用法 anvi’o官网教程:http://merenlab.org/2016/11/08/pangenomics-v2/

提取单拷贝同源基因

anvi-get-sequences-for-gene-clusters -p AH_all_Pan-PAN.db -g ../PROCHLORO-GENOMES.db -o SCG.fasta --max-num-genes-from-each-genome 1 --min-num-genomes-gene-cluster-occurs 85
grep '>' SCG.fasta | wc -l
#结果除以85就是单拷贝同源基因的数量
anvi-get-sequences-for-gene-clusters -p AH_all_Pan-PAN.db -g ../PROCHLORO-GENOMES.db -o Singletons.fasta -C default -b Singletons
#提取感兴趣的bin

MLST细菌分型

多位点序列分型(multilocus sequence typing,MLST)是一种基于核酸序列测定的细菌分型方法。这种方法通过PCR扩增多个管家基因内部片段并测定其序列,分析菌株的变异。

依赖conda安装

conda install -c conda-forge -c bioconda -c defaults mlst
mlst contigs.fa
contigs.fa  neisseria  11149  abcZ(672) adk(3) aroE(4) fumC(3) gdh(8) pdhC(4) pgm(6)

mlst genomes/*
genomes/6008.fna        saureus         239  arcc(2)   aroe(3)   glpf(1)   gmk_(1)   pta_(4)   tpi_(4)   yqil(3)
genomes/strep.fasta.gz  ssuis             1  aroA(1)   cpn60(1)  dpr(1)    gki(1)    mutS(1)   recA(1)   thrA(1)
genomes/NC_002973.gbk   lmonocytogenes    1  abcZ(3)   bglA(1)   cat(1)    dapE(1)   dat(3)    ldh(1)    lhkA(3)
genomes/L550.gbk.bz2    leptospira      152  glmU(26)  pntA(30)  sucA(28)  tpiA(35)  pfkB(39)  mreA(29)  caiB(29)

mlst --scheme neisseria NM*
NM003.fa   neisseria  4821  abcZ(222)  adk(3)  aroE(58)  fumC(275)  gdh(30)  pdhC(5)  pgm(255)
NM005.gbk  neisseria  177   abcZ(7)    adk(8)  aroE(10)  fumC(38)   gdh(10)  pdhC(1)  pgm(20)
NM011.fa   neisseria  11    abcZ(2)    adk(3)  aroE(4)   fumC(3)    gdh(8)   pdhC(4)  pgm(6)
NMC.gbk.gz neisseria  8     abcZ(2)    adk(3)  aroE(7)   fumC(2)    gdh(8)   pdhC(5)  pgm(2)

mlst --legacy --scheme neisseria *.fa
FILE      SCHEME     ST    abcZ  adk  aroE  fumC  gdh  pdhC  pgm
NM003.fa  neisseria  11    2     3    4     3       8     4    6
NM009.fa  neisseria  11149 672   3    4     3       8     4    6
MN043.fa  neisseria  11    2     3    4     3       8     4    6
NM051.fa  neisseria  11    2     3    4     3       8     4    6
NM099.fa  neisseria  1287  2     3    4    17       8     4    6
NM110.fa  neisseria  11    2     3    4     3       8     4    6

链接:Mlst官网

ARDB和VFDB数据库比对后筛选

python .py -i db.fasta -I blastresult.tab -o selected.txt -O filtered.txt
blastresult.tab是根据P值sort后的文件

from __future__ import division
import re
import sys, getopt
import operator
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_nucleotide

opts, args = getopt.getopt(sys.argv[1:], "hI:i:o:O:")
input_info1= ""
input_info2= ""
out_file1 = ""
out_file2 = ""

for op, value in opts:
    if op =="-o": #ARG_pattern_MIN_7030.fasta or ARG_pattern_MAX_7030.fasta
        out_file1 = open(value, "w")
        filename_pro = str (value)
        name_item = filename_pro.strip().split(".")
        filename = name_item[0]
    elif op == "-i": # All_ARGcontig_remove_S_nr.fasta
        input_info1 = open(value, "r")
    elif op == "-I":
        input_info2 = open(value, "r")
    elif op == "-O":
        out_file2 = open(value, "w")
        filename_pro = str (value)
        name_item = filename_pro.strip().split(".")
        filename = name_item[0]      

dict1 = {} 
dict2 = {}
for record in SeqIO.parse(input_info1,"fasta"):
    h = len(record.seq)
   
    dict1[record.id]=h
    dict2[record.id]=record.description
    
        
for line1 in input_info2:
    info1 = line1.strip("\n").split("\t")
    a = float(info1[2])
    b = float(info1[3])
    c = b/float(dict1[info1[1]])
    if a >= 80.0 and c >= 0.8:
        out_file1.write(str(dict1[info1[1]])+'\t'+str(b)+'\t'+dict2[info1[1]]+'\t'+line1)
    else:
        out_file2.write(str(dict1[info1[1]])+'\t'+str(b)+'\t'+dict2[info1[1]]+'\t'+line1)
input_info1.close()
input_info2.close()
out_file1.close()
out_file2.close()

ARDB注释抗药基因

wget -c ftp://ftp.cbcb.umd.edu/pub/data/ARDB/ARDBflatFiles.tar.gz
tar -zxvf ARDBflatFiles.tar.gz
wget -c ftp://ftp.cbcb.umd.edu/pub/data/ARDB/ardbAnno1.0.tar.gz
tar -zxvf ardbAnno1.0.tar.gz
makeblastdb -in resisGenes.pfasta -dbtype prot -out ARDB
vim genomeList.tab #目的蛋白序列路径
perl ardbAnno.pl

Prodigal注释原核基因组

下载地址:https://github.com/hyattpd/prodigal/releases/
下载源码包:Prodigal-2.6.3.tar.gz

 tar -zxvf Prodigal-2.6.3.tar.gz
cd Prodigal-2.6.3
make install
#添加环境变量
prodigal -a UBA705.pep -d UBA705.cds -f gff -g 11  -o UBA705.gff -p single -s UBA705.stat -i UBA705.fasta > prodigal.log

-a 是输出氨基酸文件

-c 不允许基因一边断开,也就是要求完整的orf,有起始和终止结构

-d 输出预测基因的序列文件

-f 选择输出文件格式,有gbk,gff,和sco格式可供选择

-g 指定密码子,原核为第11套

-i 输入文件,即需要预测的基因组序列文件

-m 屏蔽基因组中的N碱基

-o 输出文件,默认为屏幕输出

-p 选择方式,是单菌还是meta样品

-q 不输出错误信息到屏幕

-t 指定训练集

-s 输出所有潜在基因以及分值到一个文件中

python 3.7安装

wget https://www.python.org/ftp/python/3.7.2/Python-3.7.2.tar.xz
tar -xvf Python-3.7.2.tar.xz
cd Python-3.7.2
./configure --enable-optimizations
make altinstall