LACHESIS安装及使用

在组装基因组时,使用二代或三代数据组装到contigs后,下一步就是将contig提升到染色体水平。如果利用HiC数据,那么目前常见的组装软件有下面几个:

HiRise: 2015年后的GitHub就不再更新
LACHESIS: 发表在NBT,2017年后不再更新
SALSA: 发表在BMC genomics, 仍在更新中
3D-DNA: 发表在science,仍在更新中
ALLHiC: 发表在Nature Plants, 用于解决植物多倍体组装问题

对于二倍体物种而言,目前3D-DNA应该是组装效果较好的一个软件。

软件安装
Lachesis有两个依赖:samtools(低于0.1.19的版本)和C++的boost库(需要大于1.52.0但是又不能太高比如1.67.0就不行)

curl -o LACHESIS.zip https://codeload.github.com/shendurelab/LACHESIS/legacy.zip/master
unzip LACHESIS.zip
mv shendurelab-LACHESIS-2e27abb LACHESIS
cd LACHESIS
export LACHESIS_BOOST_DIR=/opt/biosoft/boost_1_53_0 #boost_1_64_0版本使用时出现问题
export LACHESIS_SAMTOOLS_DIR=/opt/biosoft/samtools-0.1.18
./configure --with-samtools=/opt/biosoft/samtools-0.1.18 --with-boost=/opt/biosoft/boost_1_53_0/
make -j 40
mv src/bin .
mv src/Lachesis bin/

如果Perl版本不是 5.14.2 ,需要打开bin下面的perl脚本,删除如下信息:
///////////////////////////////////////////////////////////////////////////////
// //
// This software and its documentation are copyright (c) 2014-2015 by Joshua //
// N. Burton and the University of Washington. All rights are reserved. //
// //
// THIS SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS //
// OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF //
// MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. //
// IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY //
// CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT //
// OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR //
// THE USE OR OTHER DEALINGS IN THE SOFTWARE. //
// //
///////////////////////////////////////////////////////////////////////////////

第一行替换成#!/usr/bin/perl -w

添加环境变量

echo 'PATH=$PATH:/opt/biosoft/LACHESIS/bin/' > ~/.bashrc

软件使用

实际的软件使用要求你需要提供至少两类输入文件
•HiC数据,双端fastq格式
•初始组装,fasta格式

使用bwa比对得到bam文件

samtools faidx draft.asm.fasta
bwa index -a bwtsw draft.asm.fasta
bwa aln -t 155 draft.asm.fasta reads_R1.fastq.gz > reads_R1.sai
bwa aln -t 155 draft.asm.fasta reads_R2.fastq.gz > reads_R2.sai
bwa sampe draft.asm.fasta reads_R1.sai reads_R2.sai reads_R1.fastq.gz reads_R2.fastq.gz > sample.bwa_aln.sam
PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI
filterBAM_forHiC.pl sample.bwa_aln.REduced.paired_only.bam sample.clean.sam
samtools view -bt draft.asm.fasta.fai sample.clean.sam > sample.clean.bam

设置配置文件

cp /opt/biosoft/LACHESIS/bin/INIs/test_case.ini sample.ini

SPECIES = test # 写实际的物种名即可
DRAFT_ASSEMBLY_FASTA = draft.asm.fasta # 待组装序列的实际位置
SAM_DIR = . #表示当前目录下查找文件
SAM_FILES = sample.clean.bam #bam文件名
RE_SITE_SEQ = AAGCTT #酶切识别序列
USE_REFERENCE = 0 #不使用参考序列
BLAST_FILE_HEAD = . # BLAST的输出结果
CLUSTER_N = 16 # 最终聚类数目

# contig中最小的酶切位点数,
CLUSTER_MIN_RE_SITES=25
# contig中最大的link密度, 也就是一个区域与多个contig存在信号
# 可能是异染色质或重复序列
CLUSTER_MAX_LINK_DENSITY=2
# 对于CLUSTER_MIN_RE_SITES过滤掉的contig在初步聚类后还有机会加入已有的分组中
# 如果它加入其中的信号是加入另一组信号的3倍
CLUSTER_NONINFORMATIVE_RATIO=3
# 允许成组的最小酶切数
ORDER_MIN_N_RES_IN_TRUNK=15

运行:

ulimit -s 10240
Lachesis sample.ini
CreateScaffoldedFasta.pl draft.asm.fasta out/test_case

Iso-seq3安装与使用(全长转录组分析)

使用conda安装Iso-seq3

conda create -n anaCogent5.2 python=2.7 anaconda
source activate anaCogent5.2
conda install -n anaCogent5.2 biopython
conda install -n anaCogent5.2 -c http://conda.anaconda.org/cgat bx-python
conda install -n anaCogent5.2 -c bioconda isoseq3
conda install -n anaCogent5.2 -c bioconda pbccs
conda install -n anaCogent5.2 -c bioconda lima
#The packages below are optional:
conda install -n anaCogent5.2 -c bioconda pbcoretools # for manipulating PacBio datasets
conda install -n anaCogent5.2 -c bioconda bamtools # for converting BAM to fasta
conda install -n anaCogent5.2 -c bioconda pysam # for making CSV reports

Running IsoSeq
Typical workflow:
1. Generate consensus sequences from your raw subread data
$ ccs movie.subreads.bam movie.ccs.bam –noPolish –minPasses 1

2. Generate full-length reads by primer removal and demultiplexing
$ cat primers.fasta
>primer_5p
AAGCAGTGGTATCAACGCAGAGTACATGGGG
>primer_3p
AAGCAGTGGTATCAACGCAGAGTAC
$ lima movie.ccs.bam primers.fasta movie.fl.bam –isoseq –no-pbi

3. Remove noise from FL reads
$ isoseq3 refine movie.fl.P5–P3.bam primers.fasta movie.flnc.bam

4. Cluster consensus sequences to generate unpolished transcripts
$ isoseq3 cluster movie.flnc.bam unpolished.bam –verbose

5. Optionally, polish transcripts using subreads
$ isoseq3 polish unpolished.bam movie.subreads.bam polished.bam

6. Map unpolished or polished transcripts to genome and collapse transcripts based on genomic mapping
$ pbmm2 align unpolished.bam reference.fasta aligned.sorted.bam –preset ISOSEQ –sort
$ isoseq3 collapse aligned.sorted.bam out.gff
or $ isoseq3 collapse aligned.sorted.bam movie.ccs.bam out.gff

参考:https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md#step-2—primer-removal-and-demultiplexing

https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki/Tutorial:-Installing-and-Running-Iso-Seq-3-using-Conda

HiC-pro安装与使用

软件安装

#conda 安装软件需要的环境
. ~/miniconda3/bin/activate
conda create -y -n hic-pro python=2.7 pysam bx-python numpy scipy samtools bowtie2
conda activate hic-pro
#安装HiC-pro
cd /opt/biosoft/
wget https://github.com/nservant/HiC-Pro/archive/v2.11.1.tar.gz
tar -zxvf v2.11.1.tar.gz
cd HiC-Pro-2.11.1
make
make install
rm v2.11.1.tar.gz
#软件使用(生成软件输入需要的三个文件)
/opt/biosoft/HiC-Pro-2.11.1/bin/utils/digest_genome.py -r mboi -o mboi_genome.bed genome.fasta
samtools faidx genome.fasta
awk '{print $1"\t"$2}' genome.fasta.fai > genome.size
bowtie2-build --threads 140 genome.fasta grass_genome

#配置文件修改参考其他教程,输入文件填写绝对路径!

#运行HiC-pro
nohup HiC-Pro --input /home/wuchangsong/gc_genome/10.HIC/raw_data/ --output hic_results --conf /home/wuchangsong/gc_genome/10.HIC/config-hicpro.txt &