使用MCScanX和JCVI共线性区块分析

首先从基因组数据库下载目标物种的全基因组文件和gff3文件
对gff3文件预处理得到对应格式的gff和bed文件

MCScanX:perl -e 'while (<>) { if (m/^(\S+)\t.*\tgene\t(\d+)\t(\d+).*ID=(darer\d+)/) { print "$1\t$4\t$2\t$3\n" } }' darer.genome.gff3 > darer.gff
jcvi:perl -e 'while (<>) { if (m/^(\S+)\t.*\tgene\t(\d+)\t(\d+)\t(\S+)\t(\S+).*ID=(darer\d+)/) { print "$1\t$2\t$3\t$6\t$4\t$5\n" } }' darer.genome.gff3 > zebr.bed

对NCBI的GFF3文件再进行提取,若基因含有可变剪接,则仅保留CDS长度最长的转录本并获得蛋白序列,该过程参考网上其他教程

cat darer.gff gc.gff > all.gff
cat darer.pep.fasta gc.pep.fasta > all.pep.fasta
makeblastdb -in all.pep.fasta -dbtype prot -title all -parse_seqids -out all -logfile all.log
blast.pl blastp all all.pep.fasta 1e-10 160 blast 6
mkdir MCScanx
mv ../blast.tab all.blast
mv ../all.gff ./
MCScanX all
cd ..
mkdir jcvi && cd jcvi
mv ../*.bed ./
mv ../darer.pep.fasta darer.pep
mv ../gc.pep.fasta gc.pep
python -m jcvi.compara.catalog ortholog --dbtype prot --no_strip_names darer gc
python -m jcvi.compara.synteny screen --minspan=30 --simple darer.gc.anchors darer.gc.anchors.new

Iso-seq3安装与使用(全长转录组分析)

使用conda安装Iso-seq3

conda create -n anaCogent5.2 python=2.7 anaconda
source activate anaCogent5.2
conda install -n anaCogent5.2 biopython
conda install -n anaCogent5.2 -c http://conda.anaconda.org/cgat bx-python
conda install -n anaCogent5.2 -c bioconda isoseq3
conda install -n anaCogent5.2 -c bioconda pbccs
conda install -n anaCogent5.2 -c bioconda lima
#The packages below are optional:
conda install -n anaCogent5.2 -c bioconda pbcoretools # for manipulating PacBio datasets
conda install -n anaCogent5.2 -c bioconda bamtools # for converting BAM to fasta
conda install -n anaCogent5.2 -c bioconda pysam # for making CSV reports

Running IsoSeq
Typical workflow:
1. Generate consensus sequences from your raw subread data
$ ccs movie.subreads.bam movie.ccs.bam –noPolish –minPasses 1

2. Generate full-length reads by primer removal and demultiplexing
$ cat primers.fasta
>primer_5p
AAGCAGTGGTATCAACGCAGAGTACATGGGG
>primer_3p
AAGCAGTGGTATCAACGCAGAGTAC
$ lima movie.ccs.bam primers.fasta movie.fl.bam –isoseq –no-pbi

3. Remove noise from FL reads
$ isoseq3 refine movie.fl.P5–P3.bam primers.fasta movie.flnc.bam

4. Cluster consensus sequences to generate unpolished transcripts
$ isoseq3 cluster movie.flnc.bam unpolished.bam –verbose

5. Optionally, polish transcripts using subreads
$ isoseq3 polish unpolished.bam movie.subreads.bam polished.bam

6. Map unpolished or polished transcripts to genome and collapse transcripts based on genomic mapping
$ pbmm2 align unpolished.bam reference.fasta aligned.sorted.bam –preset ISOSEQ –sort
$ isoseq3 collapse aligned.sorted.bam out.gff
or $ isoseq3 collapse aligned.sorted.bam movie.ccs.bam out.gff

参考:https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md#step-2—primer-removal-and-demultiplexing

https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki/Tutorial:-Installing-and-Running-Iso-Seq-3-using-Conda

Anvi’o 安装

Dependencies

  • DIAMOND or NCBI’s blastp for search.
  • MCL for clustering.
  • muscle for alignment.

easy install through conda:

 

wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod 777 Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
#在询问是否将conda加入环境变量的时候选择no
cd miniconda3/bin/
chmod 777 activate
. ./activate
#添加频道
conda config --env --add channels conda-forge
conda config --env --add channels bioconda

conda create -n anvio-6 python=3.6
conda activate anvio-6
conda install -y anvio=6
conda install -y diamond=0.9.14

具体用法 anvi’o官网教程:http://merenlab.org/2016/11/08/pangenomics-v2/

提取单拷贝同源基因

anvi-get-sequences-for-gene-clusters -p AH_all_Pan-PAN.db -g ../PROCHLORO-GENOMES.db -o SCG.fasta --max-num-genes-from-each-genome 1 --min-num-genomes-gene-cluster-occurs 85
grep '>' SCG.fasta | wc -l
#结果除以85就是单拷贝同源基因的数量
anvi-get-sequences-for-gene-clusters -p AH_all_Pan-PAN.db -g ../PROCHLORO-GENOMES.db -o Singletons.fasta -C default -b Singletons
#提取感兴趣的bin