Anvi’o 安装

Dependencies

  • DIAMOND or NCBI’s blastp for search.
  • MCL for clustering.
  • muscle for alignment.

easy install through conda:

 

wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod 777 Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
#在询问是否将conda加入环境变量的时候选择no
cd miniconda3/bin/
chmod 777 activate
. ./activate
#添加频道
conda config --env --add channels conda-forge
conda config --env --add channels bioconda

conda create -n anvio-6 python=3.6
conda activate anvio-6
conda install -y anvio=6
conda install -y diamond=0.9.14

anvi’o官网教程:http://merenlab.org/2016/11/08/pangenomics-v2/

MLST细菌分型

多位点序列分型(multilocus sequence typing,MLST)是一种基于核酸序列测定的细菌分型方法。这种方法通过PCR扩增多个管家基因内部片段并测定其序列,分析菌株的变异。

依赖conda安装

conda install -c conda-forge -c bioconda -c defaults mlst
mlst contigs.fa
contigs.fa  neisseria  11149  abcZ(672) adk(3) aroE(4) fumC(3) gdh(8) pdhC(4) pgm(6)

mlst genomes/*
genomes/6008.fna        saureus         239  arcc(2)   aroe(3)   glpf(1)   gmk_(1)   pta_(4)   tpi_(4)   yqil(3)
genomes/strep.fasta.gz  ssuis             1  aroA(1)   cpn60(1)  dpr(1)    gki(1)    mutS(1)   recA(1)   thrA(1)
genomes/NC_002973.gbk   lmonocytogenes    1  abcZ(3)   bglA(1)   cat(1)    dapE(1)   dat(3)    ldh(1)    lhkA(3)
genomes/L550.gbk.bz2    leptospira      152  glmU(26)  pntA(30)  sucA(28)  tpiA(35)  pfkB(39)  mreA(29)  caiB(29)

mlst --scheme neisseria NM*
NM003.fa   neisseria  4821  abcZ(222)  adk(3)  aroE(58)  fumC(275)  gdh(30)  pdhC(5)  pgm(255)
NM005.gbk  neisseria  177   abcZ(7)    adk(8)  aroE(10)  fumC(38)   gdh(10)  pdhC(1)  pgm(20)
NM011.fa   neisseria  11    abcZ(2)    adk(3)  aroE(4)   fumC(3)    gdh(8)   pdhC(4)  pgm(6)
NMC.gbk.gz neisseria  8     abcZ(2)    adk(3)  aroE(7)   fumC(2)    gdh(8)   pdhC(5)  pgm(2)

mlst --legacy --scheme neisseria *.fa
FILE      SCHEME     ST    abcZ  adk  aroE  fumC  gdh  pdhC  pgm
NM003.fa  neisseria  11    2     3    4     3       8     4    6
NM009.fa  neisseria  11149 672   3    4     3       8     4    6
MN043.fa  neisseria  11    2     3    4     3       8     4    6
NM051.fa  neisseria  11    2     3    4     3       8     4    6
NM099.fa  neisseria  1287  2     3    4    17       8     4    6
NM110.fa  neisseria  11    2     3    4     3       8     4    6

链接:Mlst官网

ARDB和VFDB数据库比对后筛选

python .py -i db.fasta -I blastresult.tab -o selected.txt -O filtered.txt
blastresult.tab是根据P值sort后的文件

from __future__ import division
import re
import sys, getopt
import operator
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_nucleotide

opts, args = getopt.getopt(sys.argv[1:], "hI:i:o:O:")
input_info1= ""
input_info2= ""
out_file1 = ""
out_file2 = ""

for op, value in opts:
    if op =="-o": #ARG_pattern_MIN_7030.fasta or ARG_pattern_MAX_7030.fasta
        out_file1 = open(value, "w")
        filename_pro = str (value)
        name_item = filename_pro.strip().split(".")
        filename = name_item[0]
    elif op == "-i": # All_ARGcontig_remove_S_nr.fasta
        input_info1 = open(value, "r")
    elif op == "-I":
        input_info2 = open(value, "r")
    elif op == "-O":
        out_file2 = open(value, "w")
        filename_pro = str (value)
        name_item = filename_pro.strip().split(".")
        filename = name_item[0]      

dict1 = {} 
dict2 = {}
for record in SeqIO.parse(input_info1,"fasta"):
    h = len(record.seq)
   
    dict1[record.id]=h
    dict2[record.id]=record.description
    
        
for line1 in input_info2:
    info1 = line1.strip("\n").split("\t")
    a = float(info1[2])
    b = float(info1[3])
    c = b/float(dict1[info1[1]])
    if a >= 80.0 and c >= 0.8:
        out_file1.write(str(dict1[info1[1]])+'\t'+str(b)+'\t'+dict2[info1[1]]+'\t'+line1)
    else:
        out_file2.write(str(dict1[info1[1]])+'\t'+str(b)+'\t'+dict2[info1[1]]+'\t'+line1)
input_info1.close()
input_info2.close()
out_file1.close()
out_file2.close()

ARDB注释抗药基因

wget -c ftp://ftp.cbcb.umd.edu/pub/data/ARDB/ARDBflatFiles.tar.gz
tar -zxvf ARDBflatFiles.tar.gz
wget -c ftp://ftp.cbcb.umd.edu/pub/data/ARDB/ardbAnno1.0.tar.gz
tar -zxvf ardbAnno1.0.tar.gz
makeblastdb -in resisGenes.pfasta -dbtype prot -out ARDB
vim genomeList.tab #目的蛋白序列路径
perl ardbAnno.pl

Prodigal注释原核基因组

下载地址:https://github.com/hyattpd/prodigal/releases/
下载源码包:Prodigal-2.6.3.tar.gz

 tar -zxvf Prodigal-2.6.3.tar.gz
cd Prodigal-2.6.3
make install
#添加环境变量
prodigal -a UBA705.pep -d UBA705.cds -f gff -g 11  -o UBA705.gff -p single -s UBA705.stat -i UBA705.fasta > prodigal.log

-a 是输出氨基酸文件

-c 不允许基因一边断开,也就是要求完整的orf,有起始和终止结构

-d 输出预测基因的序列文件

-f 选择输出文件格式,有gbk,gff,和sco格式可供选择

-g 指定密码子,原核为第11套

-i 输入文件,即需要预测的基因组序列文件

-m 屏蔽基因组中的N碱基

-o 输出文件,默认为屏幕输出

-p 选择方式,是单菌还是meta样品

-q 不输出错误信息到屏幕

-t 指定训练集

-s 输出所有潜在基因以及分值到一个文件中

Cellranger使用教程

建库,人和小鼠的数据库可以直接下载,对于无法直接下载的需要自行下载全基因组序列和gtf文件,根据 cellranger mkref构建参考数据库

wget ftp://ftp.ensembl.org/pub/release-97/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz
gunzip Danio_rerio.GRCz11.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-97/gtf/danio_rerio/Danio_rerio.GRCz11.97.gtf.gz
gunzip Danio_rerio.GRCz11.97.gtf.gz
cellranger mkgtf Danio_rerio.GRCz11.97.gtf Danio_rerio.GRCz11.97.filtered.gtf                  
                 --attribute=gene_biotype:protein_coding \
                 --attribute=gene_biotype:lincRNA \
                 --attribute=gene_biotype:antisense \
                 --attribute=gene_biotype:IG_LV_gene \
                 --attribute=gene_biotype:IG_V_gene \
                 --attribute=gene_biotype:IG_V_pseudogene \
                 --attribute=gene_biotype:IG_D_gene \
                 --attribute=gene_biotype:IG_J_gene \
                 --attribute=gene_biotype:IG_J_pseudogene \
                 --attribute=gene_biotype:IG_C_gene \
                 --attribute=gene_biotype:IG_C_pseudogene \
                 --attribute=gene_biotype:TR_V_gene \
                 --attribute=gene_biotype:TR_V_pseudogene \
                 --attribute=gene_biotype:TR_D_gene \
                 --attribute=gene_biotype:TR_J_gene \
                 --attribute=gene_biotype:TR_J_pseudogene \
                 --attribute=gene_biotype:TR_C_gene
cellranger mkref --nthreads=80 --genome=ref_zebr_GRCz11 --fasta=Danio_rerio.GRCz11.dna.primary_assembly.fa --genes=Danio_rerio.GRCz11.97.filtered.gtf --ref-version=3.1.0    #注意gene_id和transcript_id的顺序

计算表达量,下机的原始数据可以通过bcl2fastq拆分,也可通过cellranger自带的mkfastq命令拆分。bcl2fastq拆分后得到两个fastq文件,index信息包含在第一个fastq文件第一行尾部;mkfastq拆分后得到三个fastq文件,index信息包含在一个单独的fastq文件里。两种拆分结果都可作为count命令的输入,但文件的命名一定要严格安装软件的说明,否则会出错。

cellranger count --id=PBLzebr1 --transcriptome=/home/wuchangsong/sc_cell/ref_zebr_GRCz11 --fastqs=/home/wuchangsong/sc_cell/fastq/ --sample=PBLzebr1

得到的表达矩阵在filtered_feature_bc_matrix文件夹中,在analysis文件夹中有PCA和tSNE聚类结果文本形式。cellranger count 计算的结果只能作为错略观测的结果,如果需要进一步分析聚类细胞,还需要进行下游分析,这里使用官方推荐 R 包(Seurat),后边的分析参考Seurat的使用。

参考:https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build#GRCh38mm10_3.1.0

Seurat使用流程

seurat软件安装

Depends R (>= 3.4.0), methods

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Seurat")

CentOS系统安装时要注意gcc的版本

setwd("D:/Experiment_data/zxj/outs")
library(Seurat)
pbl.data <- Read10X(data.dir = "D:/Experiment_data/zxj/outs/filtered_feature_bc_matrix")
dim(pbl.data) #查看行和列
#创建 Seurat 对象与数据过滤。保留在>=3 个细胞中表达的基因;保留能检测到>=200 个基因的细胞。
pbl <- CreateSeuratObject(counts = pbl.data, project = "pbl1907", min.cells = 3, min.features = 200) 
#mt-开头的为线粒体基因,这里将其进行标记并统计其分布频率
pbl[["percent.mt"]] <- PercentageFeatureSet(pbl, pattern = "^mt-")
# 对 pbmc 对象做小提琴图,分别为基因数,细胞数和线粒体占比
VlnPlot(object = pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
#根据图片中基因数和线粒体数,分别设置过滤参数,这里基因数 200-2500,线粒体百分比为小于 5%
pbl <- subset(pbl, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")+ NoLegend()
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")+ NoLegend()
CombinePlots(plots = list(plot1, plot2))
#标准化
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
#鉴定高变基因
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
plot3 <- VariableFeaturePlot(pbmc)+ NoLegend()
plot4 <- LabelPoints(plot = plot3, points = top10, repel = TRUE,xnudge=0,ynudge=0)
CombinePlots(plots = list(plot1, plot2))
#数据归一化,对所有基因进行标准化,默认只是标准化高变基因( 2000 个),速度更快,不影响 PCA 和分群,但影响热图的绘制。
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc,features = all.genes,vars.to.regress = "percent.mt")
#线性降维(PCA),默认用高变基因集,但也可通过 features 参数自己指定
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
#定义可视化细胞和功能的几种有用的方式PCA,包括VizDimReduction,DimPlot,和DimHeatmap
VizDimLoadings(object = pbmc, dims = 1:2, reduction = "pca")
DimPlot(pbmc, reduction = "pca")+ NoLegend()
DimHeatmap(pbmc, dims = 1:2, cells = 500, balanced = TRUE)
#鉴定数据集的可用维度,主成分分析结束后需要确定哪些主成分所代表的基因可以进入下游分析,这里可以使用JackStraw做重抽样分析。可以用JackStrawPlot可视化看看哪些主成分可以进行下游分析。
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStrawPlot(pbmc, dims = 1:15)#虚线以上的为可用维度,你也可以调整 dims 参数,画出所有 pca 查看
#肘部图(碎石图),基于每个主成分对方差解释率的排名。建议尝试选择多个主成分个数做下游分析,对整体影响不大;在选择此参数时,建议选择偏高的数字( “宁滥勿缺”,为了获取更多的稀有分群);有些亚群很罕见,如果没有先验知识,很难将这种大小的数据集与背景噪声区分开来。
ElbowPlot(object = pbmc)
#非线性降维( UMAP/tSNE)
#基于 PCA 空间中的欧氏距离计算 nearest neighbor graph,优化任意两个细胞间的距离权重(输入上一步得到的 PC 维数)
pbmc <- FindNeighbors(pbmc, dims = 1:10)
#resolution 参数决定下游聚类分析得到的分群数,对于 3K 左右的细胞,设为 0.4-1.2 能得到较好的结果(官方说明);如果数据量增大,该参数也应该适当增大;增加的值会导致更多的群集。
pbmc <- FindClusters(pbmc, resolution = 0.5)
#使用 Idents()函数可查看不同细胞的分群;
head(Idents(pbmc), 8)
table(pbmc@active.ident) # 查看每一类有多少个细胞
#Seurat 提供了几种非线性降维的方法进行数据可视化(在低维空间把相似的细胞聚在一起),比如 UMAP 和 t-SNE,运行 UMAP 需要先安装'umap-learn'包,两种方法都可以使用,但不要混用,这样,后面的结算结果会将先前的聚类覆盖掉,只能保留一个
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages = 'umap-learn')
pbmc <- RunTSNE(pbmc, dims = 1:10)
#pbmc <- RunUMAP(object = pbmc, dims = 1:10)
#DimPlot(object = pbmc, reduction = "umap")
#用 DimPlot()函数绘制散点图, reduction = "tsne",指定绘制类型;如果不指定,默认先从搜索 umap, 然后 tsne, 再然后 pca;也可以直接使用这 3 个函数 PCAPlot()、 TSNEPlot()、UMAPPlot(); cols, pt.size 分别调整分组颜色和点的大小;
tsneplot<-TSNEPlot(pbmc,label = TRUE, pt.size = 1.5)+ NoLegend()
#保存数据
saveRDS(pbmc, file = "pbmc_tutorial.rds")
save(pbmc,file="pbmc.RData")
load(file = "pbmc.RData")
#寻找差异表达的特征(聚类生物标志物)
#寻找某个聚类和其他所有聚类显著表达的基因
cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 1, min.pct = 0.25)
head(x = cluster1.markers, n = 5)
#find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(object = pbmc, ident.1 = 5, ident.2 = c(0,3), min.pct = 0.25)
write.table(as.data.frame(cluster5.markers), file="cluster1vs2.xls", sep="\t", quote = F)
# find markers for every cluster compared to all remaining cells, report only the positive ones
#pbmc.markers <- FindAllMarkers(object = pbmc, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25)
pbmc.markers <- FindAllMarkers(object = pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25, return.thresh = 0.01)
library(dplyr)
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(object = pbmc, features = top10$gene) + NoLegend()
#画图
VlnPlot(object = pbmc, features = c("MS4A1", "CD79A"))
FeaturePlot(object = pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
# you can plot raw UMI counts as well
VlnPlot(object = pbmc, features.plot = c("NKG7", "PF4"), use.raw = TRUE, y.log = TRUE)
FeaturePlot(object = pbmc, features = c("hbaa1","hbaa2","ighv1-4","cd79a","cd79b","cd4-1","cd8a","cd8b","itga2b"), cols = c("grey", "blue"), reduction = "tsne")
#将单元类型标识分配给集群
new.cluster.ids <- c("1", "3", "4", "2", "5", "6", "7", "8")
names(x = new.cluster.ids) <- levels(x = pbmc)
pbmc <- RenameIdents(object = pbmc, new.cluster.ids)
DimPlot(object = pbmc, label = TRUE, pt.size = 1.5) + NoLegend()
#气泡图
markers.to.plot <- c("cd74a", "cd74b")
DotPlot(pbmc, features = markers.to.plot) + RotatedAxis()


未完待续。。。

awk命令将fastq文件转换成fasta

awk '{if(NR%4==1||NR%4==2){print $0}}'  test.fq | sed 's/^@/>/g' > myfile.fasta

 有时候,比如说我们需要做一个clustalw,我们需要将多行的fasta文件(multiple line fasta)格式文件,转换成单行的fasta文件(single line fasta)序列文件,怎么办呢。 巧用awk + printf一行搞定。
awk '/^>/ { print n $0;}  !/^>/ {printf "%s", $0, n="\n"}  END {print ""}'  test.fa

 

WGCNA(my project)

setwd("D:/Ma_transcription/WGCNA/")
database <- read.table(file = "COS_deseq_counts_normalized.txt", sep = "\t", header = T, row.names = 1, stringsAsFactors = F)
library(reshape2)
library('WGCNA')
enableWGCNAThreads()#打开多线程
WGCNA_matrix = t(database[order(apply(database,1,mad), decreasing = T)[1:10000],])
subname=sapply(colnames(database),function(x) strsplit(x,"_")[[1]][1])
datTraits = data.frame(gsm=names(database),
subtype=subname)
rownames(datTraits)=datTraits[,1]
head(datTraits)
# Choose a set of soft-thresholding powers
powers = c(c(1:10), seq(from = 12, to=20, by=2))
datExpr <- WGCNA_matrix
# Call the network topology analysis function
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# Plot the results:
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");
# this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="green")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
sft$powerEstimate
net = blockwiseModules(datExpr, power = sft$powerEstimate,
                       maxBlockSize = 5000,TOMType = "unsigned", 
                       minModuleSize = 30,reassignThreshold = 0, mergeCutHeight = 0.25,
                       numericLabels = TRUE, pamRespectsDendro = FALSE,
                       saveTOMs = TRUE,
                       saveTOMFileBase = "AS-green-FPKM-TOM",
                       verbose = 3)
table(net$colors)

# Convert labels to colors for plotting
mergedColors = labels2colors(net$colors)
table(mergedColors)
# Plot the dendrogram and the module colors underneath
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
                    "Module colors",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)
## assign all of the gene to their corresponding module 
## hclust for the genes.
#明确样本数和基因数
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
#首先针对样本做个系统聚类树
datExpr_tree<-hclust(dist(datExpr), method = "average")
par(mar = c(0,5,2,0))
plot(datExpr_tree, main = "Sample clustering", sub="", xlab="", cex.lab = 2, 
     cex.axis = 1, cex.main = 1,cex.lab=1)
## 如果这个时候样本是有性状,或者临床表型的,可以加进去看看是否聚类合理
#针对前面构造的样品矩阵添加对应颜色
sample_colors <- numbers2colors(as.numeric(factor(datTraits$subtype)), 
                                colors = c("grey","blue","red","green"),signed = FALSE)
## 这个给样品添加对应颜色的代码需要自行修改以适应自己的数据分析项目。
#  sample_colors <- numbers2colors( datTraits ,signed = FALSE)
## 如果样品有多种分类情况,而且 datTraits 里面都是分类信息,那么可以直接用上面代码,
##当然,这样给的颜色不明显,意义不大。
#构造10个样品的系统聚类树及性状热图
par(mar = c(1,4,3,1),cex=0.8)
plotDendroAndColors(datExpr_tree, sample_colors,
                    groupLabels = colnames(sample),
                    cex.dendroLabels = 0.8,
                    marAll = c(1, 4, 3, 1),
                    cex.rowText = 0.01,
design=model.matrix(~0+ datTraits$subtype)
colnames(design)=levels(datTraits$subtype)
moduleColors <- labels2colors(net$colors)
# Recalculate MEs with color labels
MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenes
MEs = orderMEs(MEs0); ##不同颜色的模块的ME值矩阵(样本vs模块)
moduleTraitCor = cor(MEs, design , use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)

sizeGrWindow(10,6)
# Will display correlations and their p-values
textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
                   signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3));
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
               xLabels = names(design),
               yLabels = names(MEs),
               ySymbols = names(MEs),
               colorLabels = FALSE,
               colors = greenWhiteRed(50),
               textMatrix = textMatrix,
               setStdMargins = FALSE,
               cex.text = 0.5,
               zlim = c(-1,1),
               main = paste("Module-trait relationships"))                 main = "Sample dendrogram and trait heatmap")
head(design)
# names (colors) of the modules
modNames = substring(names(MEs), 3)
geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));
## 算出每个模块跟基因的皮尔森相关系数矩阵
## MEs是每个模块在每个样本里面的值
## datExpr是每个基因在每个样本的表达量
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));
names(geneModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");
## 只有连续型性状才能只有计算
## 这里把是否属于 CB 表型这个变量用0,1进行数值化。
CB = as.data.frame(design[,2]);
names(CB) = "CB"
geneTraitSignificance = as.data.frame(cor(datExpr, CB, use = "p"));
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));
names(geneTraitSignificance) = paste("GS.", names(CB), sep="");
names(GSPvalue) = paste("p.GS.", names(CB), sep="")
module = "turquoise"
column = match(module, modNames);
moduleGenes = moduleColors==module;
sizeGrWindow(7, 7);
par(mfrow = c(1,1));
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
                   abs(geneTraitSignificance[moduleGenes, 1]),
                   xlab = paste("Module Membership in", module, "module"),
                   ylab = "Gene significance for CB",
                   main = paste("Module membership vs. gene significance\n"),
                   cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)

#首先针对所有基因画热图
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
geneTree = net$dendrograms[[1]]; 
#生成全基因不相似TOM矩阵
dissTOM = 1-TOMsimilarityFromExpr(datExpr, power = 6); 
plotTOM = dissTOM^7; 
diag(plotTOM) = NA; 
#TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")

#然后随机选取部分基因作图
nSelect = 400
# For reproducibility, we set the random seed
set.seed(10);
select = sample(nGenes, size = nSelect);
selectTOM = dissTOM[select, select];
# There’s no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
selectTree = hclust(as.dist(selectTOM), method = "average")
selectColors = moduleColors[select];
# Open a graphical window
sizeGrWindow(9,9)
# Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
# the color palette; setting the diagonal to NA also improves the clarity of the plot
plotDiss = selectTOM^7;
diag(plotDiss) = NA;
TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")

#最后画模块和性状的关系
# Recalculate module eigengenes
MEs = moduleEigengenes(datExpr, moduleColors)$eigengenes
## 只有连续型性状才能只有计算
## 这里把是否属于 Luminal 表型这个变量用0,1进行数值化。
CB = as.data.frame(design[,2]);
names(CB) = "CB"
# Add the weight to existing module eigengenes
MET = orderMEs(cbind(MEs, CB))
# Plot the relationships among the eigengenes and the trait
sizeGrWindow(5,7.5);
par(cex = 0.9)
plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(3,4,1,2), cex.lab = 0.8,                         xLabelsAngle= 90)
# Plot the dendrogram
sizeGrWindow(6,6);
par(cex = 1.0)
## 模块的聚类图
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0),
                      plotHeatmaps = FALSE)
# Plot the heatmap matrix (note: this plot will overwrite the dendrogram plot)
par(cex = 1.0)
## 性状与模块热图
plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(3,4,2,2),
                      plotDendrograms = FALSE, xLabelsAngle = 90)
# Recalculate topological overlap
TOM = TOMsimilarityFromExpr(datExpr, power = 6); 
# Select module
module = "blue";
# Select module probes
probes = colnames(datExpr) ## 我们例子里面的probe就是基因名
inModule = (moduleColors==module);
modProbes = probes[inModule]; 
## 也是提取指定模块的基因名
# Select the corresponding Topological Overlap
modTOM = TOM[inModule, inModule];
dimnames(modTOM) = list(modProbes, modProbes)
nTop = 200;
IMConn = softConnectivity(datExpr[, modProbes]);
top = (rank(-IMConn) <= nTop)
vis = exportNetworkToCytoscape(modTOM[top, top], edgeFile = paste("CytoscapeInput-edges-top200", paste(module, collapse="-"), ".txt", sep=""), nodeFile = paste("CytoscapeInput-nodes-top200", paste(module, collapse="-"), ".txt", sep=""), weighted = TRUE, threshold = 0);

参考:lncRNA实战项目-第六步-WGCNA相关性分析

 

DESeq2使用流程

library("DESeq2")
database <- read.table(file = "COS_genes.count_table.xls", sep = "\t", header = T, row.names = 1)
countData <- database[,1:15]
condition <- factor(c(rep("COS0",3),rep("COS1",3),rep("COS3",3),rep("COS6",3),rep("COS12",3)), levels = c("COS0", "COS1","COS3","COS6","COS12"))
countData <- round(as.matrix(countData))
coldata <- data.frame(row.names = colnames(countData), condition)
dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), design= ~ condition )
dds <- dds[ rowSums(counts(dds)) > 1, ]
dds <- DESeq(dds)
res0vs1 <- results(dds, contrast = c("condition","COS1","COS0"))
table(res0vs1$pvalue <0.001)
res0vs1 <- res0vs1[order(res0vs1$pvalue),]
write.table(as.data.frame(res0vs1), file="res0vs1.xls", sep="\t", quote = F)