Category: R

preprocessCore分位数标准化

library(“preprocessCore”) library(“ggplot2”) data <- read.table(“mouse_gc.xls”,sep=”\t”,header = TRUE) rownames(data) <- data$GeneID #rownames(data) <- data[,1] #data_mat <- data.matrix(data[,-1]) quantile_normalisation <- function(df){ df_rank <- apply(df,2,rank,ties.method=”min”) df_sorted <- data.frame(apply(df, 2, sort)) df_mean <- apply(df_sorted, 1, mean) index_to_mean <- function(my_index, my_mean){ return(my_mean[my_index]) } df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean) rownames(df_final) <- rownames(df) return(df_final) } new_data <- quantile_normalisation(data[,2:19]) boxplot(data) boxplot(new_data) write.table(new_data,…

2021年3月15日
数据相关性分析

简介：评价两组数据之间的相关性，有皮尔森（pearson）相关系数，斯皮尔曼（spearman）相关系数和肯德尔（kendall）相关系数。在这三大相关系数中，spearman和kendall属于等级相关系数亦称为“秩相关系数”，是反映等级相关程度的统计分析指标。相关系数的绝对值越大，相关性越强，相关系数越接近于1或-1，相关度越强，相关系数越接近于0，相关度越弱。 Pearson（皮尔逊）相关系数：皮尔逊相关也称为积差相关（或积矩相关）是英国统计学家皮尔逊于20世纪提出的一种计算直线相关的方法。适用于： (1)、两个变量之间是线性关系，都是连续数据。 (2)、两个变量的总体是正态分布，或接近正态的单峰分布。 (3)、两个变量的观测值是成对的，每对观测值之间相互独立。 Spearman Rank（斯皮尔曼等级）相关系数：在统计学中，斯皮尔曼等级相关系数以Charles Spearman命名，并经常用希腊字母ρ（rho）表示其值。又称秩相关系数，是利用两变量的秩次大小作线性相关分析，对原始变量的分布不作要求，属于非参数统计方法，适用范围要广些。斯皮尔曼等级相关是根据等级资料研究两个变量间相关关系的方法。它是依据两列成对等级的各对等级数之差来进行计算的，所以又称为“等级差数法”。斯皮尔曼等级相关对数据条件的要求没有积差相关系数严格，只要两个变量的观测值是成对的等级评定资料，或者是由连续变量观测资料转化得到的等级资料，不论两个变量的总体分布形态、样本容量的大小如何，都可以用斯皮尔曼等级相关来进行研究。对于服从Pearson 相关系数的数据亦可计算Spearman 相关系数，但统计效能要低一些。Pearson 相关系数的计算公式可以完全套用 Spearman 相关系数计算公式，但公式中的x 和y 用相应的秩次代替即可。 Kendall Rank（肯德尔等级）相关系数：在统计学中，肯德尔相关系数是以Maurice Kendall命名的，并经常用希腊字母τ（tau）表示其值。用于反映分类变量相关性的指标，适用于两个分类变量均为有序分类的情况。对相关的有序变量进行非参数相关检验；取值范围在-1-1 之间，此检验适合于正方形表格；肯德尔(Kendall)W 系数又称和谐系数，是表示多列等级变量相关程度的一种方法。适用这种方法的数据资料一般是采用等级评定的方法收集的，即让K 个评委（被试）评定N 件事物，或1 个评委（被试）先后K 次评定N 件事物。等级评定法每个评价者对N 件事物排出一个等级顺序，最小的等级序数为1 ，最大的为N ，若并列等级时，则平分共同应该占据的等级，如，平时所说的两个并列第一名，他们应该占据1 ，2 名，所以它们的等级应是1.5, 又如一个第一名，两个并列第二名，三个并列第三名，则它们对应的等级应该是1,2.5,2.5,5,5,5, 这里2.5 是2,3 的平均，5 是4,5,6 的平均。 R包计算相关性系数： library(ggplot2) library(ggpubr) data <- read.table(“111.xls”,sep=”\t”,header = TRUE)#列是特征（两个特征），行是物种名 ggplot(data=data, aes(x=genome, y=Tes))+geom_point(color=”red”)+stat_smooth(method=”lm”)+stat_cor(data=data, method = “pearson”)+theme_classic() #stat_cor(data=dat, method…

2021年2月21日
MUMmer使用及后续绘图

MUMmer被广泛用于大片段序列的比对，如染色体共线性分析。 nucmer [options] delta-filter -i 80 -l 1000 -r -q out.delta > out.rq.delta #比对率大于80%，对比长度大于1000 #-r: 仅保留每个reference在query上的最佳位置,允许多条reference在query上重叠 #-q: 仅保留每个query在reference上的最佳位置,允许多条query在reference上重叠 show-coords out.rq.delta > out.coords grep -P “zfCh04\s+” out.coords|awk ‘{print $12,$13}’ |sort |uniq -c #查看reference单个染色体zfCh04和query不同染色体的匹配区域的数量 grep -P “zfCh04\s+gcCh04” out.coords|awk ‘{print $12,$1,$2,$13,$4,$5}’ > zfCh04_gcCh04.txt sed -i ‘s/ /\t/g’ zfCh04_gcCh04.txt #R作图使用RIdeogram包，可参考RIdeogram：染色体数据可视化的R包 install.packages(‘RIdeogram’) require(RIdeogram) cc <- read.table(“111.xls”,sep=”\t”,header = TRUE,stringsAsFactors =…

2021年2月20日
Upset plot：进阶版Venn plot实现多元素可视化

if (!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”) BiocManager::install(“UpSetR”) library(UpSetR) pathway=read.table(“carp_combined_id1.xls”,header=T,sep=”\t”) upset(pathway,nsets = 9, nintersects = 50,mb.ratio = c(0.5, 0.5), order.by = “freq”, decreasing = c(TRUE,FALSE), queries = list(list(query=intersects, params=list(“datra”, “caaur”,”meamb”,”onmac”,”ctide”,”cycar”,”sigra”,”sians”,”sirhi”), color=”red”, active=T))) #详细参数查看官网 #行列转置awk ‘{for(i=0;++i

2020年10月21日
R画折线图

library(ggplot2) library(gcookbook) data

2020年8月29日
clusterProfiler富集分析

BiocManager::install(‘clusterProfiler’) BiocManager::install(‘org.Hs.eg.db’) BiocManager::install(‘DOSE’) library(clusterProfiler) library(org.Hs.eg.db) library(DOSE) entrezID <- read.table(“11.xls”,header=F,sep=”\t”) entrezID <- entrezID$V1 BP <- enrichGO(entrezID,”org.Hs.eg.db”,ont=”BP”,keyType = “ENSEMBL”,pAdjustMethod = “BH”,pvalueCutoff = 0.05,qvalueCutoff = 0.1,readable = T) dotplot(BP, x = “GeneRatio”, color = “p.adjust”, showCategory = 20, size = NULL, split = NULL, font.size = 12, title=”Dotplot for Gene Ontology Analysis”) write.table(BP, ‘go_tmp.txt’, sep = ‘\t’, row.names…

2020年1月14日
Cellranger使用教程

建库，人和小鼠的数据库可以直接下载，对于无法直接下载的需要自行下载全基因组序列和gtf文件，根据 cellranger mkref构建参考数据库 wget ftp://ftp.ensembl.org/pub/release-97/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz gunzip Danio_rerio.GRCz11.dna.primary_assembly.fa.gz wget ftp://ftp.ensembl.org/pub/release-97/gtf/danio_rerio/Danio_rerio.GRCz11.97.gtf.gz gunzip Danio_rerio.GRCz11.97.gtf.gz cellranger mkgtf Danio_rerio.GRCz11.97.gtf Danio_rerio.GRCz11.97.filtered.gtf –attribute=gene_biotype:protein_coding \ –attribute=gene_biotype:lincRNA \ –attribute=gene_biotype:antisense \ –attribute=gene_biotype:IG_LV_gene \ –attribute=gene_biotype:IG_V_gene \ –attribute=gene_biotype:IG_V_pseudogene \ –attribute=gene_biotype:IG_D_gene \ –attribute=gene_biotype:IG_J_gene \ –attribute=gene_biotype:IG_J_pseudogene \ –attribute=gene_biotype:IG_C_gene \ –attribute=gene_biotype:IG_C_pseudogene \ –attribute=gene_biotype:TR_V_gene \ –attribute=gene_biotype:TR_V_pseudogene \ –attribute=gene_biotype:TR_D_gene \ –attribute=gene_biotype:TR_J_gene \ –attribute=gene_biotype:TR_J_pseudogene \ –attribute=gene_biotype:TR_C_gene cellranger mkref –nthreads=80 –genome=ref_zebr_GRCz11 –fasta=Danio_rerio.GRCz11.dna.primary_assembly.fa –genes=Danio_rerio.GRCz11.97.filtered.gtf –ref-version=3.1.0…

2019年8月16日
Seurat使用流程

seurat软件安装 Depends R (>= 3.4.0), methods if (!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”) BiocManager::install(“Seurat”) CentOS系统安装时要注意gcc的版本 setwd(“D:/Experiment_data/zxj/outs”) library(Seurat) pbl.data <- Read10X(data.dir = “D:/Experiment_data/zxj/outs/filtered_feature_bc_matrix”) dim(pbl.data) #查看行和列 #创建 Seurat 对象与数据过滤。保留在>=3 个细胞中表达的基因；保留能检测到>=200 个基因的细胞。 pbl <- CreateSeuratObject(counts = pbl.data, project = “pbl1907”, min.cells = 3, min.features = 200) #mt-开头的为线粒体基因，这里将其进行标记并统计其分布频率 pbl[[“percent.mt”]] <- PercentageFeatureSet(pbl, pattern = “^mt-“) # 对 pbmc 对象做小提琴图，分别为基因数，细胞数和线粒体占比 VlnPlot(object =…

2019年8月6日
CentOS 6.9 安装R-3.6.1

根据configure报错下载bzip2、curl、PCRE、xz-lzma、zlib对应的版本如果是64位的系统，安装bzip2时修改Makefile文件，如下： CC=gcc -fPIC AR=ar RANLIB=ranlib LDFLAGS= BIGFILES=-D_FILE_OFFSET_BITS=64 CFLAGS=-fPIC -Wall -Winline -O2 -g $(BIGFILES) 安装好上边的模块后设置环境变量 export PATH=/home/wuchangsong/packages/bin:$PATH export LD_LIBRARY_PATH=/home/wuchangsong/packages/lib:$LD_LIBRARY_PATH export CFLAGS=”-I/home/wuchangsong/packages/include” export LDFLAGS=”-L/home/wuchangsong/packages/lib” 根据报错做如下操作 sudo yum install texinfo sudo yum install texlive unzip inconsolata.zip cp -Rfp inconsolata/* /usr/share/texmf/ sudo mktexlsr ./configure –prefix=/opt/sysoft/R-3.6.1 –enable-R-shlib –with-readline=yes –with-libpng=yes –with-x=no make -j 80 make install

2019年8月6日
ggplot2绘制富集分析柱状图

library(“ggplot2”) pathway=read.table(“COS3_kegg_total.xls”,header=T,sep=”\t”) pathbar = ggplot(pathway,aes(x=Pathway,y=-1*log10(Pvalue))) pathbar + geom_bar(stat=”identity”) pathbar + geom_bar(stat=”identity”) + coord_flip() pathbar+geom_bar(stat=”identity”,color=”red”,fill=”blue”)+coord_flip() pathbar+geom_bar(stat=”identity”,aes(fill=-1*log10(Pvalue)))+coord_flip() porder=factor(as.integer(rownames(pathway)),labels=pathway$Pathway) pathbar+geom_bar(stat=”identity”,aes(x=porder,fill=-1*log10(Pvalue)))+coord_flip() pathbar+geom_bar(stat=”identity”,aes(x=rev(porder),fill=-1*log10(Pvalue)))+coord_flip() porder=factor(as.integer(rownames(pathway)),labels=rev(pathway$Pathway)) pathbar+geom_bar(stat=”identity”,aes(x=rev(porder),fill=-1*log10(Pvalue)))+coord_flip() pqbar=pathbar+geom_bar(stat=”identity”,aes(x=rev(porder),fill=-1*log10(Pvalue)))+coord_flip()+theme(legend.position=”none”)+labs(title=”Top20 of pathway enrichment”,y=expression(-log[10](Pvalue))) pqbar pqbar+geom_hline(yintercept=2,color=c(“red”),linetype=4)

2018年4月23日