Category: Bioinformatics

多样本bam文件合并及vcf文件生成

进行细菌群体基因组分析时需要将多个样本比对到参考基因组生成vcf文件，常用的软件有snippy，snippy的结果只包含core snp，进行下游分析时往往需要更多的snp信息。 #首先将snippy的比对结果中所有样本的bam文件合并（将所有样本的bam及bai文件复制到一个文件夹） samtools merge merge.bam *bam samtools mpileup -gSDf ref.fasta merge.bam > merge.bcf bcftools call -vm merge.bcf > merge.vcf #过滤indel，保留snp vcftools –remove-indels –recode –recode-INFO-all –vcf merge.vcf –stdout > merge.snp.vcf #过滤掉50%样本中缺失的SNP位点 vcftools –vcf merge.snp.vcf –recode –recode-INFO-all –stdout –max-missing 0.5 > merge.filtered.vcf

2022年12月23日
Pyseer报错

scree_plot_pyseer mash.tsv #ImportError: Cannot load backend ‘TkAgg’ which requires the ‘tk’ interactive framework, as ‘headless’ is currently running 修改scree_plot.py第37行，matplotlib.use(“TkAgg”)改为matplotlib.use(“Agg”)

2022年10月5日
Swiss-Prot注释脊椎动物基因组

wget -c ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/uniprot_sprot_vertebrates.dat.gz wget -c ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/uniprot_trembl_vertebrates.dat.gz zcat uniprot_sprot_vertebrates.dat.gz uniprot_trembl_vertebrates.dat.gz > uniprot_vertebrates.dat awk ‘{if (/^ /) {gsub(/ /, “”); print} else if (/^AC/) print “>” $2}’ uniprot_vertebrates.dat > uniprot_vertebrates.fasta diamond makedb –in uniprot_vertebrates.fasta -d uniprot_vertebrates diamond blastp -d uniprot_vertebrates.dmnd -q grass_carp.pep.fasta –evalue 1e-5 > blastp.outfmt6 python -m jcvi.formats.blast best -n 1 blastp.outfmt6 python add_annotation_from_dat.py blastp.outfmt6.best /data/database/UniProt-Plant/uniprot_plants.dat…

2022年9月24日
twilight处理pangenome结果

github下载脚本（https://github.com/ghoresh11/twilight）需要的R包data.table，ggplot2，optparse Rscript classify_genes.R -p gene_presence_absence.Rtab -g groups.tab 必须参数： -p, –presence_absence：Roary 或 Panaroo输出的gene_presence_absence.Rtab文件 -g, –grouping：一个制表符分隔的分组文件可选参数： -o, –out：输出目录名称（默认=“out”）。 -m, –min_size：忽略少于min_size基因组的组（默认 = 10）。 -c, –core_threshold：用于定义每个组内的核心基因的阈值（默认值 = 0.95）。 -r, –rare_threshold：用于定义每组内稀有基因的阈值（默认值 = 0.15）。 -h, –help：打印帮助信息并退出。输出： 1、classification.tab 基因簇在每个分类中存在的形式（核心、辅助、和稀有） 2、frequencies.csv 基因簇在每个分离中的频率 3、plots 文件夹，包含柱状图和PCA图

2022年9月19日
CheckM安装及使用

CheckM可以用于评估已组装的基因组或者宏基因组序列的质量，包括基因组完整度、污染度、序列分布等信息。 ##通过conda安装 conda create -n checkm conda activate checkm conda install -c bioconda checkm-genome ##OR pip3 install numpy pip3 install matplotlib pip3 install pysam pip3 install checkm-genome conda install hmmer prodigal pplacer ##下载数据库并设置数据路径 wget -c https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz tar -zxvf checkm_data_2015_01_16.tar.gz checkm data setRoot /path/to/checkm_data checkm lineage_wf -t 150 -x fasta -f sag_2554.txt –tab_table ../fasta_file output

2022年6月23日
PopCOGenT安装及使用

从PopCOGenT github下载PopCOGenT.yml后使用conda安装软件运行需要环境 conda config –set restore_free_channel true conda env create -f PopCOGenT.yml conda activate PopCOGenT ##下载mugsy并解压 export MUGSY_INSTALL=/path/to/install/mugsy python get_alignment_and_length_bias.py –genome_dir /home/wuchangsong/Streptococcus_agalactiae/test/ –genome_ext .fasta –alignment_dir ./output –mugsy_path /opt/biosoft/PopCOGenT/mugsy_x86-64-v1r2.3/mugsy –mugsy_env /opt/biosoft/PopCOGenT/mugsy_x86-64-v1r2.3/mugsyenv.sh –base_name test_12 –num_threads 150 ##安装infomap git clone https://github.com/mapequation/infomap.git cd infomap make ##gcc要求4.9以上，不然报错（g++: error: unrecognized command line option ‘-std=c++14’） ##安装gcc-11.2.0（root用户） wget http://ftp.gnu.org/gnu/gcc/gcc-11.2.0/gcc-11.2.0.tar.gz tar -zxvf gcc-11.2.0.tar.gz cd…

2022年6月23日
SRST2:使用短序列鉴定MLST分型

安装： #通过conda安装 conda create -n srst2 srst2 conda activate srst2 #通过pip安装 pip install srst2 下载MLST数据库 getmlst.py –species “Streptococcus agalactiae” 使用 srst2 –input_pe S1_1.fastq.gz S1_2.fastq.gz –output S1 –log –mlst_db Streptococcus_agalactiae.fasta –mlst_definitions profiles_csv –mlst_delimiter _ ##双末端测序文件命名_1和_2 MLST扫描结果字段： Sample：扫描数据的样品名称，比如S1，或者SRR1002045等 ST：扫描比对数据库后获得ST型别 7个具体管家基因：扫描每个基因allele分配的型别 mismatches：7个管家基因中是否有不匹配的碱基 uncertainty：由于序列深度或者覆盖度不够造成的不确定情况 depth：基因比对的覆盖度 maxMAF：扫描的7个管家基因中最高的minor等位基因频率

2021年12月11日
ViewBS:DNA甲基化数据的可视化

官网提供了三种安装方式：conda、docker和正常一步步安装，前两种较为简单，能解决较多的依赖关系，下边介绍一步步安装（https://github.com/xie186/ViewBS） #安装htslib git clone https://github.com/samtools/htslib.git git submodule update –init –recursive autoreconf ./configure make sudo make insatll #安装ViewBS wget -c https://github.com/xie186/ViewBS/archive/refs/tags/v0.1.10.tar.gz tar -zxvf v0.1.10.tar.gz cd ViewBS-0.1.10 /opt/biosoft/ViewBS-0.1.10/ext_tools/cpanm –local-lib=~/perl5 local::lib && eval $(perl -I ~/perl5/lib/perl5/ -Mlocal::lib) /opt/biosoft/ViewBS-0.1.10/ext_tools/cpanm Getopt::Long::Subcommand /opt/biosoft/ViewBS-0.1.10/ext_tools/cpanm Getopt::Long (> 2.50) /opt/biosoft/ViewBS-0.1.10/ext_tools/cpanm –force Bio::DB::HTS::Tabix ViewBS主要以Bismark的bismark_methylation_extractor输出结果为输入，同时根据不同的目的需要准备全基因组DNA序列的fasta文件，所有基因的bed文件，转座子bed文件，差异基因的bed文件等。 #数据准备 samtools faidx genome.fasta bgzip ../A.1_bismark_bt2_pe.deduplicated.CX_report.txt ./ #Bismark结果需要bgzip压缩 tabix -p vcf…

2021年6月15日
基因表达趋势分析（TCseq使用）

TCseq可根据不同的聚类方法将基因按照表达模式分类 BiocManager::install(“TCseq”) library(TCseq) data <- read.delim(‘df_TCseq.xls’, row.names = 1, sep = ‘\t’, check.names = FALSE) data <- as.matrix(data) tca <- timeclust(data, algo = “cm”, k = 6, standardize = TRUE) #character string giving a clustering method. Options are km’ (kmeans), ‘pam’ (partitioning around medoids), ‘hc’ (hierachical clustering), ‘cm’ (cmeans). p <- timeclustplot(tca, value = “z-score(TPM)”,…

2021年4月14日
BS-seq分析流程（二）

DNA甲基化：DNA甲基化为DNA化学修饰的一种形式，能在不改变DNA序列的前提下，改变遗传表现。为表观遗传编码的一部分，是一种外遗传机制。DNA甲基化过程会使甲基添加到DNA分子上，例如在胞嘧啶环的5’碳上：这种5’方向的DNA甲基化方式可见于所有脊椎动物。在人类细胞内，大约有1%的DNA碱基受到了甲基化。 gzip -dc A.1_bismark_bt2_pe.deduplicated.CX_report.txt.gz > A.1_bismark_bt2_pe.deduplicated.CX_report.txt perl BismarkCX2methykit.pl A.1_bismark_bt2_pe.deduplicated.CX_report.txt #准备注释需要的bed文件（格式12） convert2bed -i gtf < out.gtf > out6.bed grep ‘exon’ out6.bed > aa && mv aa out6.bed python3 bed6Tobed12.py out6.bed > out12.bed#bed6Tobed12.py #methylKit安装及使用 if (!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”) BiocManager::install(“methylKit”) library(methylKit) file.list = list(“A.CG_methykit.txt”,”B.CG_methykit.txt”,”C.CG_methykit.txt”,”D.CG_methykit.txt”,”E.CG_methykit.txt”,”F.CG_methykit.txt”,”G.CG_methykit.txt”,”H.CG_methykit.txt”,”I.CG_methykit.txt”) myobjDB=methRead(file.list, sample.id=list(“Control_1″,”Control_2″,”Control_3″,”Single_1″,”Single_2″,”Single_3″,”Multiple_1″,”Multiple_2″,”Multiple_3″), assembly=”HZGC”, treatment=c(0,0,0,1,1,1,2,2,2), context=”CpG”, mincov = 10, dbtype = “tabix”,…

2021年4月8日