scree_plot_pyseer mash.tsv
#ImportError: Cannot load backend ‘TkAgg’ which requires the ‘tk’ interactive framework, as ‘headless’ is currently running
修改scree_plot.py第37行,matplotlib.use(“TkAgg”)改为matplotlib.use(“Agg”)
Changsong Wu (Changsong111332@163.com)
scree_plot_pyseer mash.tsv
#ImportError: Cannot load backend ‘TkAgg’ which requires the ‘tk’ interactive framework, as ‘headless’ is currently running
修改scree_plot.py第37行,matplotlib.use(“TkAgg”)改为matplotlib.use(“Agg”)
从PopCOGenT github下载PopCOGenT.yml后使用conda安装软件运行需要环境
conda config --set restore_free_channel true conda env create -f PopCOGenT.yml conda activate PopCOGenT ##下载mugsy并解压 export MUGSY_INSTALL=/path/to/install/mugsy python get_alignment_and_length_bias.py --genome_dir /home/wuchangsong/Streptococcus_agalactiae/test/ --genome_ext .fasta --alignment_dir ./output --mugsy_path /opt/biosoft/PopCOGenT/mugsy_x86-64-v1r2.3/mugsy --mugsy_env /opt/biosoft/PopCOGenT/mugsy_x86-64-v1r2.3/mugsyenv.sh --base_name test_12 --num_threads 150 ##安装infomap git clone https://github.com/mapequation/infomap.git cd infomap make ##gcc要求4.9以上,不然报错(g++: error: unrecognized command line option ‘-std=c++14’) ##安装gcc-11.2.0(root用户) wget http://ftp.gnu.org/gnu/gcc/gcc-11.2.0/gcc-11.2.0.tar.gz tar -zxvf gcc-11.2.0.tar.gz cd gcc-11.2.0 ./contrib/download_prerequisites mkdir build cd build/ ../configure -enable-checking=release -enable-languages=c,c++ -disable-multilib make -j 160 make install ##重新打开终端gcc -v ##安装gcc-11.2.0后infomap make成功,然而./Infomap报错 ##./Infomap: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./Infomap) ##./Infomap: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by ./Infomap) ##./Infomap: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by ./Infomap) strings /usr/lib64/libstdc++.so.6 | grep GLIBCXX_##发现确实少了库文件 cp /usr/local/lib64/libstdc++.so.6.0.29 /usr/lib64/ cd /usr/lib64/ && rm libstdc++.so.6 ln -s libstdc++.so.6.0.29 ./libstdc++.so.6 ##进入infomap安装目录 ./Infomap -h ##Infomap没有报错可以运行cluster.py,infomap安装最新版本运行时需要将cluster.py的第107行-i改为-c,因为v2.5.0输入参数为-c,作者推荐的v0.18.3输入参数为-i。infomap github已找不到v0.18.3,如果不想麻烦下载最新版本的infomap,可以进入PopCOGenT的安装目录,在Infomap目录下make即可,这是作者提供的v0.18.3。 python cluster.py --base_name test_12 --length_bias_file test_12.length_bias.txt --output_directory output --infomap_path /opt/biosoft/PopCOGenT-master/Infomap/Infomap --single_cell
python .py -i db.fasta -I blastresult.tab -o selected.txt -O filtered.txt
blastresult.tab是根据P值sort后的文件
from __future__ import division import re import sys, getopt import operator from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_nucleotide opts, args = getopt.getopt(sys.argv[1:], "hI:i:o:O:") input_info1= "" input_info2= "" out_file1 = "" out_file2 = "" for op, value in opts: if op =="-o": #ARG_pattern_MIN_7030.fasta or ARG_pattern_MAX_7030.fasta out_file1 = open(value, "w") filename_pro = str (value) name_item = filename_pro.strip().split(".") filename = name_item[0] elif op == "-i": # All_ARGcontig_remove_S_nr.fasta input_info1 = open(value, "r") elif op == "-I": input_info2 = open(value, "r") elif op == "-O": out_file2 = open(value, "w") filename_pro = str (value) name_item = filename_pro.strip().split(".") filename = name_item[0] dict1 = {} dict2 = {} for record in SeqIO.parse(input_info1,"fasta"): h = len(record.seq) dict1[record.id]=h dict2[record.id]=record.description for line1 in input_info2: info1 = line1.strip("\n").split("\t") a = float(info1[2]) b = float(info1[3]) c = b/float(dict1[info1[1]]) if a >= 80.0 and c >= 0.8: out_file1.write(str(dict1[info1[1]])+'\t'+str(b)+'\t'+dict2[info1[1]]+'\t'+line1) else: out_file2.write(str(dict1[info1[1]])+'\t'+str(b)+'\t'+dict2[info1[1]]+'\t'+line1) input_info1.close() input_info2.close() out_file1.close() out_file2.close()
wget https://www.python.org/ftp/python/3.7.2/Python-3.7.2.tar.xz tar -xvf Python-3.7.2.tar.xz cd Python-3.7.2 ./configure --enable-optimizations make altinstall
Python replace() 方法把字符串中的 old(旧字符串) 替换成 new(新字符串),如果指定第三个参数max,则替换不超过 max 次。
replace()方法语法:
str.replace(old, new[, max])
返回字符串中的 old(旧字符串) 替换成 new(新字符串)后生成的新字符串,如果指定第三个参数max,则替换不超过 max 次。
以下实例展示了replace()函数的使用方法:
#!/usr/bin/python str = "this is string example....wow!!! this is really string"; print str.replace("is", "was"); print str.replace("is", "was", 3);
以上实例输出结果如下:
thwas was string example....wow!!! thwas was really string thwas was string example....wow!!! thwas is really string
#! /usr/bin/python import glob out_file1 = open(r"E:\\Git-2.11.1-64-bit\\Git\\localhost\\lianshou\\4WGCNA\\merge.txt", "w") MyDiction = {} def Readfile(File): file_obj = open(File) try: while True: line = file_obj.readline().strip("\n") if not line: break array = line.split("\t") if array[0] in MyDiction: MyDiction[array[0]].append(array[1]) else: MyDiction[array[0]] = [array[1]] finally: file_obj.close() return MyDiction def main(): list_dirs = glob.glob("./*.txt") for i in list_dirs: Readfile(i) for gene_name in MyDiction: #print ("%s\t%s" %(gene_name, "\t".join(MyDiction[gene_name]))) out_file1.write("%s\t%s%s" %(gene_name, "\t".join(MyDiction[gene_name]),"\n")) if __name__ == '__main__': main()
脚本来自于生信技能树很实用,学习到了glob包的妙用。
前几天实验室一个师兄给我一个质谱结果,让帮忙做下go的功能富集,数据格式大概是这样的:
由于之前做go和kegg时都是跑流程,像这种针对性的go富集还没做过,说到底,还是由于自己手上缺少数据,没有属于自己的项目,很多细节性的问题都没有经历过。但这不妨碍咱一颗求知的心,我们都是在学习中成长。由于没事的时候逛论坛逛的比较频繁,知道数据的第二列是UniPro数据库的accession,然后该怎么办呢?作为生信人,Google是少不了的,看到Google结果,瞬间明了。根据Google的指引我从网上下载了UniProt数据库里的idmapping.tb.gz文件(wget -c -t 10000 ftp://ftp.pir.georgetown.edu/databases/idmapping/idmapping.tb.gz),大概18G左右,数据结构如下:
一共有22列,依次分别是:UniProtKB accession,UniProtKB ID,EntrezGene,RefSeq,NCBI GI number,PDB,Pfam,GO,PIRSF,IPI,UniRef100,UniRef90,UniRef50,UniParc,PIR-PSD accession,NCBI taxonomy,MIM,UniGene,Ensembl,PubMed ID,EMBL/GenBank/DDBJ,EMBL protein_id;这就有意思了,数据的第八列就是我们想要的go信息。更有意思的是,有了这个数据库信息,我们就可以根据不同数据库的注释信息做go富集啦!
下面要做的是写一个脚本,根据师兄给的结果调出对应的go号,对于会编程的人来说,这点自然不在话下,代码如下:
import sys USAGE = "\nusage: python %s idmapping.tb.gz blastout outputfile outputfile2\n" % sys.argv[0] if len(sys.argv) != 5: print USAGE sys.exit() def parseIDmapping(filename): UniProt_GO = {} with open(filename, 'r') as f: for line in f: lsplit = line.rstrip().split("\t") if lsplit[7]: UniProt_GO[lsplit[0]] = lsplit[7] return UniProt_GO def parseBlastOut(filename): tab_res = [] with open(filename, 'r') as f: for line in f: lsplit = line.strip('\n').split('\t') tab_res.append(lsplit[0]) return tab_res UniProtKB_GO = parseIDmapping(sys.argv[1]) BlastOut = parseBlastOut(sys.argv[2]) OUT = open(sys.argv[3], 'w') OUT1 = open(sys.argv[4], 'w') for i in BlastOut: if i in UniProtKB_GO.keys(): print i go = UniProtKB_GO[i] print go OUT.write(i+"\t"+go+"\n") else: OUT1.write(i+"\n") OUT.close() OUT1.close()
得到的结果是这样子:
由于使用软件的关系,这种格式貌似还不能达到要求,再写一脚本转换一下:
import re file1 = open(r"C:\\Users\\wuchangsong\\Desktop\\11.txt") out_file1 = open(r"C:\\Users\\wuchangsong\\Desktop\\12.txt", "w") for line1 in file1: info1 = re.sub('; ','\t',line1) out_file1.write(info1) file1.close() out_file1.close()
结果长这样:
最终结果:
满满的成就感有么有^_^!
版权声明:本文为博主原创文章,未经博主允许不得转载。
大概去年六七月份,那时还不懂什么是生物信息学,什么是编程,Python更是听都没听说过,稀里糊涂的就在老师的安排下跟着别的院一个师兄学习生信, 刚见面师兄就跟我讲解了什么是生物信息学,学生信的种种好处,什么不用做实验就能发文章啊,做的好的话读博能去一个不错的实验室啊之类的,当时我是处于懵逼状态,心想“这是一种怎样的操作?” 在师兄的推荐下,我掏了300大洋买了《DNA和蛋白质序列数据分析工具》《鸟哥的私房菜上下》三本书,用来了解和入门什么是生物信息学。说实话,到现在为止这三本书我都没怎么翻过,在我带入门的人里,我也不会推荐《鸟哥的私房菜》这种,这只会让他们望而却步,只要他们想学,我手里也有他们学不完的资源。好的是,师兄手把手的教了我一段时间。记得师兄说过,要想学生信,就必须学会一门编程语言,否则出门千万别说自己是搞生信的,丢人!所以我就又入了编程的坑,在师兄的强烈推荐下,我选择学习Python,师兄也帮我装上Python.2.7和编辑器,并装上Biopython包,然后扔给我一本全英的《Biopython》和一个脚本,说你要是一周之内不能把这个脚本弄懂,就不要学生信了,不适合。当时我差点就一口老血喷出来,心想“老哥,咱先不谈其他,你好歹也给个中文版本的吧,谁跟你这么强,硕士就到英国留学,毕业论文搞个全英的?”无奈,有总比没有强,还好后来我在网上搜到了这本书的中文版。
因此,在我对Python一无所知的情况下,首先学习了biopython,然后买了《python基础教程》《python核心编程》这两本书更进一步的学习。一但入了此坑就很难回头,尽管现在我主要跟着学习生信的老师和一些小伙伴都是Perl大神,也很难把我从这个坑里拉出来。可喜的是,现在python在生信上的应用越来越广,在机器学习方向,python也是处于领先地位,这更加给了我学下去的理由。当然,主学Python之余,R和Perl也是要懂一点的,用来做图和单行操作还是很必要的!
人生接触的第一个脚本:
import re
import sys, getopt
import operator
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_nucleotide
import re
import sys, getopt
import operator
opts, args = getopt.getopt(sys.argv[1:], “c:i:o:”)
blast_info = “”
out_file = “”
for op, value in opts:
if op ==”-o”: #ARG_pattern_MIN_7030.fasta or ARG_pattern_MAX_7030.fasta
out_file = open(value, “w”)
filename_pro = str (value)
name_item = filename_pro.strip().split(“.”)
filename = name_item[0]
if op == “-i”: # All_ARGcontig_remove_S_nr.fasta
blast_info = open(value, “r”)
if op == “-c”: #Contig_ARG_region_abstract_info_7030.out
location_info = open(value, “r”)
for record in SeqIO.parse(blast_info,”fasta”):
Pattern_star = 1 – 80
Pattern_end = 90
pattern_seq = record.seq[Pattern_star:Pattern_end]
Seq_len = str(len(pattern_seq))
pattern_ID = filename +”:”+Seq_len
New_record = SeqRecord(pattern_seq,id = pattern_ID, description = “”)
SeqIO.write(New_record, out_file,”fasta”)
blast_info.close()
out_file.close()
这个代码现在看来是不完美的,毕竟每次手动改起止位点还是比较麻烦