This function can plot different representations of the gene expression in a specific gene set.
Usage
plot1GS(
expr,
gmt,
Subject_ID,
TimePoint,
geneset.name,
baseline = NULL,
group.var = NULL,
Group_ID_paired = NULL,
ref = NULL,
group_of_interest = NULL,
FUNcluster = NULL,
clustering_metric = "euclidian",
clustering_method = "ward",
B = 500,
max_trends = 4,
aggreg.fun = "median",
na.rm.aggreg = TRUE,
trend.fun = "median",
methodOptiClust = "firstSEmax",
indiv = "genes",
verbose = TRUE,
clustering = TRUE,
showTrend = TRUE,
smooth = TRUE,
precluster = NULL,
time_unit = "",
title = NULL,
y.lab = NULL,
desc = TRUE,
lab.cex = 1,
axis.cex = 1,
main.cex = 1,
y.lab.angle = 90,
x.axis.angle = 45,
margins = 1,
line.size = 1,
y.lim = NULL,
x.lim = NULL,
gg.add = list(theme()),
plot = TRUE
)Arguments
- expr
either a matrix or dataframe of gene expression upon which dynamics are to be calculated, or a list of gene sets estimation of gene expression. In the case of a matrix or dataframe, its dimension are \(n\) x \(p\), with the \(p\) sample in column and the \(n\) genes in row. In the case of a list, its length should correspond to the number of gene sets under scrutiny and each element should be an 3 dimension array of estimated gene expression, such as for the list returned in the
'Estimations'element ofTcGSA.LR. See details.- gmt
a gmt object containing the gene sets definition. See
GSA.read.gmtand https://docs.gsea-msigdb.org/.- Subject_ID
a factor of length \(p\) that is in the same order as the columns of
expr(when it is a dataframe) and that contains the patient identifier of each sample.- TimePoint
a numeric vector or a factor of length \(p\) that is in the same order as
Subject_IDand the columns ofexpr(when it is a dataframe), and that contains the time points at which gene expression was measured.- geneset.name
a character string containing the name of the gene set to be plotted, that must appear in the
"geneset.names"element ofgmt.- baseline
a character string which is the value of
TimePointthat can be used as a baseline. Default isNULL, in which case no time point is used as a baseline value for gene expression. Has to beNULLwhen comparing two treatment groups.- group.var
in the case of several treatment groups, this is a factor of length \(p\) that is in the same order as
Timepoint,Subject_IDand the columns ofexpr. It indicates to which treatment group each sample belongs to. Default isNULL, which means that there is only one treatment group.- Group_ID_paired
a character vector of length \(p\) that is in the same order as
Timepoint,Subject_ID,group.varand the columns ofexpr. This argument must not beNULLin the case of a paired analysis, and must beNULLotherwise. Default isNULL.- ref
the group which is used as reference in the case of several treatment groups. Default is
NULL, which means that reference is the first group in alphabetical order of the labels ofgroup.var. See Details.- group_of_interest
the group of interest, for which dynamics are to be computed in the case of several treatment groups. Default is
NULL, which means that group of interest is the second group in alphabetical order of the labels ofgroup.var.- FUNcluster
a function which accepts as first argument a matrix
xand as second argument the number of clusters desiredk, and which returns a list with a component named'cluster'which is a vector of lengthn = nrow(x)of integers in 1:k, determining the clustering or grouping of the n observations. Default isNULL, in which case a hierarchical clustering is performed via the functionagnes, using the metricclustering_metricand the methodclustering_method. See'FUNcluster'inclusGapand Details.- clustering_metric
character string specifying the metric to be used for calculating dissimilarities between observations in the hierarchical clustering when
FUNclusterisNULL. The currently available options are"euclidean"and"manhattan". Default is"euclidean". Seeagnes. Also, a"sts"option is available in TcGSA. It implements the 'Short Time Series' distance [Moller-Levet et al., Fuzzy Clustering of short time series and unevenly distributed sampling points, Advances in Intelligent Data Analysis V:330-340 Springer, 2003] designed specifically for clustering time series.- clustering_method
character string defining the agglomerative method to be used in the hierarchical clustering when
FUNclusterisNULL. The six methods implemented are"average"([unweighted pair-]group average method, UPGMA),"single"(single linkage),"complete"(complete linkage),"ward"(Ward's method),"weighted"(weighted average linkage). Default is"ward". Seeagnes.- B
integer specifying the number of Monte Carlo ("bootstrap") samples used to compute the gap statistics. Default is
500. SeeclusGap.- max_trends
integer specifying the maximum number of different clusters to be tested. Default is
4.- aggreg.fun
a character string such as
"median"or"mean"or the name of any other defined statistics function that returns a single numeric value. It specifies the function used to aggregate the observations before the clustering. Default is to"mean".- na.rm.aggreg
a logical flag indicating whether
NAshould be remove to prevent propagation throughaggreg.fun. Can be useful to set to TRUE with unbalanced design as those will generate structuralNAs in$Estimations. Default isTRUE.- trend.fun
a character string such as
"mean"or the name of any other function that returns a single numeric value. It specifies the function used to calculate the trends of the identified clustered. Default is to"mean".- methodOptiClust
character string indicating how the "optimal" number of clusters is computed from the gap statistics and their standard deviations. Possible values are
"globalmax","firstmax","Tibs2001SEmax","firstSEmax"and"globalSEmax". Default is"firstSEmax". See'method'inclusGap, Details and Tibshirani et al., 2001 in References.- indiv
a character string indicating by which unit observations are aggregated (through
aggreg.fun) before the clustering. Possible values are"genes"or"patients". Default is"genes". See Details.- verbose
logical flag enabling verbose messages to track the computing status of the function. Default is
TRUE.- clustering
logical flag. If
FALSE, there is no clustering representation; ifTRUE, the lines are colored according to which cluster they belong to. Default isTRUE. See Details.- showTrend
logical flag. If
TRUE, a black line is added for each cluster, representing the correspondingtrend.fun. Default isTRUE.- smooth
logical flag. If
TRUEandshowTrendis alsoTRUE, the representation of each clustertrend.funis smoothed using cubic polynomials (seegeom_smooth. Default isTRUE. At the moment, must accept parameter"na.rm"(which is automatically set toTRUE). This might change in future versions- precluster
a vector of length \(p\) that is in the same order as
Subject_ID,TimePointand the columns ofexpr(when it is a dataframe), and that contains a prior clustering of the subjects. Default isNULL.- time_unit
the time unit to be displayed (such as
"Y","M","W","D","H", etc) next to the values ofTimePointon the x-axis. Default is"", in which case the time scale on the x-axis is proportional to the time values.- title
character specifying the title of the plot. If
NULL, a title is automatically generated, if"", no title appears. Default isNULL.- y.lab
character specifying the annotation of the y axis. If
NULL, an annotation is automatically generated, if"", no annotation appears. Default isNULL.- desc
a logical flag. If
TRUE, a line is added to the title of the plot with the description of the gene set plotted (from the gmt file). Default isTRUE.- lab.cex
a numerical value giving the amount by which lab labels text should be magnified relative to the default
1.- axis.cex
a numerical value giving the amount by which axis annotation text should be magnified relative to the default
1.- main.cex
a numerical value giving the amount by which title text should be magnified relative to the default
1.- y.lab.angle
a numerical value (in [0, 360]) giving the orientation by which y-label text should be turned (anti-clockwise). Default is
90. Seeelement_text.- x.axis.angle
a numerical value (in [0, 360]) giving the orientation by which x-axis annotation text should be turned (anti-clockwise). Default is
45.- margins
a numerical value giving the amount by which the margins should be reduced or increased relative to the default
1.- line.size
a numerical value giving the amount by which the line sizes should be reduced or increased relative to the default
1.- y.lim
a numeric vector of length 2 giving the range of the y-axis. See
plot.default.- x.lim
if numeric, will create a continuous scale, if factor or character, will create a discrete scale. Observations not in this range will be dropped. See
xlim.- gg.add
A list of instructions to add to the
ggplot2instructions. See+.gg. Default islist(theme()), which adds nothing to the plot.- plot
logical flag. If
FALSE, no plot is drawn. Default isTRUE.
Value
A list with 2 elements:
classif: adata.framewith the 2 following variables:ProbeIDwhich contains the IDs of the probes of the plotted gene set, andClustercontaining $ which cluster the probe belongs to. IfclusteringisFALSE, thenClusterisNAfor all the probes.p: aggplotobject containing the plot
Details
If expr is a matrix or a dataframe, then the "original" data are
plotted. On the other hand, if expr is a list returned in the
'Estimations' element of TcGSA.LR, then it is those
"estimations" made by the TcGSA.LR function that are plotted.
If indiv is 'genes', then each line of the plot is the median of a
gene expression over the patients. On the other hand, if indiv is
'patients', then each line of the plot is the median of a patient genes
expression in this gene set.
This function uses the Gap statistics to determine the optimal number of
clusters in the plotted gene set. See
clusGap.
References
Tibshirani, R., Walther, G. and Hastie, T., 2001, Estimating the number of data clusters via the Gap statistic, Journal of the Royal Statistical Society, Series B (Statistical Methodology), 63, 2: 411–423.
Examples
if(interactive()){
data(data_simu_TcGSA)
tcgsa_sim_1grp <- TcGSA.LR(expr=expr_1grp, gmt=gmt_sim, design=design,
subject_name="Patient_ID", time_name="TimePoint",
time_func="linear", crossedRandom=FALSE)
plot1GS(expr=expr_1grp, TimePoint=design$TimePoint,
Subject_ID=design$Patient_ID, gmt=gmt_sim,
geneset.name="Gene set 4",
indiv="genes", clustering=FALSE,
time_unit="H",
lab.cex=0.7)
plot1GS(expr=expr_1grp, TimePoint=design$TimePoint,
Subject_ID=design$Patient_ID, gmt=gmt_sim,
geneset.name="Gene set 5",
indiv="patients", clustering=FALSE, baseline=1,
time_unit="H",
lab.cex=0.7)
}
if(interactive()){
geneclusters <- plot1GS(expr=tcgsa_sim_1grp$Estimations, TimePoint=design$TimePoint,
Subject_ID=design$Patient_ID, gmt=gmt_sim,
geneset.name="Gene set 5",
indiv="genes",
time_unit="H",
lab.cex=0.7
)
geneclusters
}
if(interactive()){
library(grDevices)
library(graphics)
colval <- c(hsv(0.56, 0.9, 1),
hsv(0, 0.27, 1),
hsv(0.52, 1, 0.5),
hsv(0, 0.55, 0.97),
hsv(0.66, 0.15, 1),
hsv(0, 0.81, 0.55),
hsv(0.7, 1, 0.7),
hsv(0.42, 0.33, 1)
)
n <- length(colval); y <- 1:n
op <- par(mar=rep(1.5,4))
plot(y, axes = FALSE, frame.plot = TRUE,
xlab = "", ylab = "", pch = 21, cex = 8,
bg = colval, ylim=c(-1,n+1), xlim=c(-1,n+1),
main = "Color scale"
)
par(op)
plot1GS(expr=expr_1grp, TimePoint=design$TimePoint,
Subject_ID=design$Patient_ID, gmt=gmt_sim,
geneset.name="Gene set 5",
indiv="genes",
time_unit="H",
title="",
gg.add=list(scale_color_manual(values=colval),
guides(colour = guide_legend(reverse=TRUE))),
lab.cex=0.7
)
plot1GS(expr=expr_2grp, TimePoint=design$TimePoint,
Subject_ID=design$Patient_ID, gmt=gmt_sim,
geneset.name="Gene set 3",
indiv="genes",
group.var = design$group.var,
time_unit="H",
gg.add=list(scale_color_manual(values=colval),
guides(colour = guide_legend(reverse=TRUE))),
lab.cex=0.7
)
}