This function plots a gene sets dynamic trends heatmap.
Usage
# S3 method for TcGSA
plot(
x,
threshold = 0.05,
myproc = "BY",
nbsimu_pval = 1e+06,
expr,
Subject_ID,
TimePoint,
baseline = NULL,
only.signif = TRUE,
group.var = NULL,
Group_ID_paired = NULL,
ref = NULL,
group_of_interest = NULL,
ranking = FALSE,
FUNcluster = NULL,
clustering_metric = "euclidian",
clustering_method = "ward",
B = 500,
max_trends = 4,
aggreg.fun = "median",
na.rm.aggreg = TRUE,
methodOptiClust = "firstSEmax",
indiv = "genes",
verbose = TRUE,
clust_trends = NULL,
N_clusters = NULL,
myclusters = NULL,
label.clusters = NULL,
prev_rowCL = NULL,
descript = TRUE,
plot = TRUE,
color.vec = c("darkred", "#D73027", "#FC8D59", "snow", "#91BFDB", "#4575B4",
"darkblue"),
legend.breaks = NULL,
label.column = NULL,
time_unit = "",
cex.label.row = 1,
cex.label.column = 1,
margins = c(5, 25),
heatKey.size = 1,
dendrogram.size = 1,
heatmap.height = 1,
heatmap.width = 1,
cex.clusterKey = 1,
cex.main = 1,
horiz.clusterKey = TRUE,
main = NULL,
subtitle = NULL,
...
)
Arguments
- x
an object of class'
TcGSA
'.- threshold
the threshold at which the FDR or the FWER should be controlled.
- myproc
a vector of character strings containing the names of the multiple testing procedures for which adjusted p-values are to be computed. This vector should include any of the following: "
Bonferroni
", "Holm
", "Hochberg
", "SidakSS
", "SidakSD
", "BH
", "BY
", "ABH
", "TSBH
" or "none
". "none
" indicates no adjustment for multiple testing. Seemt.rawp2adjp
for details. Default is "BY
", the Benjamini & Yekutieli (2001) step-up FDR-controlling procedure (general dependency structures). In order to control the FWER(in case of an analysis that is more a hypothesis confirmation than an exploration of the expression data), we recommend to use "Holm
", the Holm (1979) step-down adjusted p-values for strong control of the FWER.- nbsimu_pval
the number of observations under the null distribution to be generated in order to compute the p-values. Default is
1e+06
.- expr
either a matrix or dataframe of gene expression upon which dynamics are to be calculated, or a list of gene sets estimation of gene expression. In the case of a matrix or dataframe, its dimension are \(n\) x \(p\), with the \(p\) sample in column and the \(n\) genes in row. In the case of a list, its length should correspond to the number of gene sets under scrutiny and each element should be an 3 dimension array of estimated gene expression, such as for the list returned in the
'Estimations'
element ofTcGSA.LR
. See details.- Subject_ID
a factor of length \(p\) that is in the same order as the columns of
expr
(when it is a dataframe) and that contains the patient identifier of each sample. Ignored ifexpr
is a list of estimations.- TimePoint
a numeric vector or a factor of length \(p\) that is in the same order as
Subject_ID
and the columns ofexpr
(when it is a dataframe), and that contains the time points at which gene expression was measured. Ignored ifexpr
is a list of estimations.- baseline
the value of
TimePoint
to be used as baseline. Default isNULL
in which case expression is centered and no baseline is used.- only.signif
logical flag for plotting only the significant gene sets. If
FALSE
, all the gene sets from the gmt object contained inx
are plotted. Default isTRUE
.- group.var
in the case of several treatment` groups, this is a factor of length \(p\) that is in the same order as
Timepoint
,Subject_ID
,sample_name
and the columns ofexpr
. It indicates to which treatment group each sample belongs to. Default isNULL
, which means that there is only one treatment group. See Details.- Group_ID_paired
a character vector of length \(p\) that is in the same order as
Timepoint
,Subject_ID
,sample_name
,group.var
and the columns ofexpr
. This argument must not beNULL
in the case of a paired analysis, and must beNULL
otherwise. Default isNULL
. See Details.- ref
the group which is used as reference in the case of several treatment groups. Default is
NULL
, which means that reference is the first group in alphabetical order of the labels ofgroup.var
. See Details.- group_of_interest
the group of interest, for which dynamics are to be computed in the case of several treatment groups. Default is
NULL
, which means that group of interest is the second group in alphabetical order of the labels ofgroup.var
. See Details.group_of_interest
here~~- ranking
a logical flag. If
TRUE
, the gene set trends are not hierarchically classified, but ordered by decreasing Likelihood ratios. Default isFALSE
.- FUNcluster
the clustering function used to agglomerate genes in trends. Default is
NULL
, in which a hierarchical clustering is performed via the functionagnes
, using the metricclustering_metric
and the methodclustering_method
. SeeclusGap
- clustering_metric
character string specifying the metric to be used for calculating dissimilarities between observations in the hierarchical clustering when
FUNcluster
isNULL
. The currently available options are"euclidean"
and"manhattan"
. Default is"euclidean"
. Seeagnes
. Also, a"sts"
option is available in TcGSA. It implements the 'Short Time Series' distance [Moller-Levet et al., Fuzzy Clustering of short time series and unevenly distributed sampling points, Advances in Intelligent Data Analysis V:330-340 Springer, 2003] designed specifically for clustering time series.- clustering_method
character string defining the agglomerative method to be used in the hierarchical clustering when
FUNcluster
isNULL
. The six methods implemented are"average"
([unweighted pair-]group average method, UPGMA),"single"
(single linkage),"complete"
(complete linkage),"ward"
(Ward's method),"weighted"
(weighted average linkage). Default is"ward"
. Seeagnes
.- B
integer specifying the number of Monte Carlo ("bootstrap") samples used to compute the gap statistics. Default is
500
. SeeclusGap
.- max_trends
integer specifying the maximum number of different clusters to be tested. Default is
4
.- aggreg.fun
a character string such as
"mean"
,"median"
or the name of any other statistics function defined that returns a single numeric value. It specifies the function used to aggregate the observations before the clustering. Default is tomedian
. Default is"median"
.- na.rm.aggreg
a logical flag indicating whether
NA
should be remove to prevent propagation throughaggreg.fun
. Can be useful to set to TRUE with unbalanced design as those will generate structuralNA
s in$Estimations
. Default isTRUE
.- methodOptiClust
character string indicating how the "optimal"" number of clusters is computed from the gap statistics and their standard deviations. Possible values are
"globalmax"
,"firstmax"
,"Tibs2001SEmax"
,"firstSEmax"
and"globalSEmax"
. Default is"firstSEmax"
. See'method'
inclusGap
, Details and Tibshirani et al., 2001 in References.- indiv
a character string indicating by which unit observations are aggregated (through
aggreg.fun
) before the clustering. Possible values are"genes"
or"patients"
. Default is"genes"
.- verbose
logical flag enabling verbose messages to track the computing status of the function. Default is
TRUE
.- clust_trends
object of class ClusteredTrends containing already computed trends for the plotted gene sets. Default is
NULL
.- N_clusters
an integer that is the number of clusters in which the dynamics should be regrouped. The cutoff of the clustering tree is automatically calculated accordingly. Default is
NULL
, in which case the dendrogram is not cut and no clusters are identified.- myclusters
a character vector of colors for predefined clusters of the represented gene sets, with as many levels as the value of
N_clusters
. Default isNULL
, in which case the clusters are automatically identified and colored via thecutree
function and theN_clusters
argument only.- label.clusters
if
N_clusters
is notNULL
, a character vector of lengthN_clusterss
. Default isNULL
, in which case ifN_clusters
is notNULL
, clusters are simply labeled with numbers.- prev_rowCL
a hclust object, such as the one return by the present plotting function (see Value) for instance. If not
NULL
, no clustering is calculated by the present plotting function and this tree is used to represent the gene sets dynamics. Default isNULL
.- descript
logical flag indicating that the description of the gene sets should appear after their name on the right side of the plot if
TRUE
. Default isTRUE
. See Details.- plot
logical flag indicating whether the heatmap should be plotted or not. Default is
TRUE
.- color.vec
a character strings vector used to define the color palette used in the plot. Default is
c("#D73027", "#FC8D59","lightyellow", "#91BFDB", "#4575B4")
.- legend.breaks
a numeric vector indicating the splitting points for coloring. Default is
NULL
, in which case the break points will be spaced equally and symmetrically about 0.- label.column
a vector of character strings with the labels to be displayed for the columns (i.e. the time points). Default is
NULL
.- time_unit
the time unit to be displayed (such as
"Y"
,"M"
,"W"
,"D"
,"H"
, etc) next to the values ofTimePoint
in the columns labels whenlabel.column
isNULL
. Default is""
.- cex.label.row
a numerical value giving the amount by which row labels text should be magnified relative to the default
1
.- cex.label.column
a numerical value giving the amount by which column labels text should be magnified relative to the default
1
.- margins
numeric vector of length 2 containing the margins (see
par(mar= *))
for column and row names, respectively. Default isc(15, 100)
. See Details.- heatKey.size
the size of the color key for the heatmap fill. Default is
1
.- dendrogram.size
the horizontal size of the dendrogram. Default is
1
- heatmap.height
the height of the heatmap. Default is
1
- heatmap.width
the width of the heatmap. Default is
1
- cex.clusterKey
a numerical value giving the amount by which the clusters legend text should be magnified relative to the default
1
, whenN_clusters
is notNULL
.- cex.main
a numerical value giving the amount by which title text should be magnified relative to the default
1
.- horiz.clusterKey
a logical flag; if
TRUE
, set the legend for clusters horizontally rather than vertically. Only used if theN_clusters
argument is notNULL
. Default isTRUE
.- main
a character string for an optional title. Default is
NULL
.- subtitle
a character string for an optional subtitle. Default is
NULL
.- ...
other parameters to be passed through to plotting functions.
Value
An object of class hclust which describes the tree produced by the clustering process. The object is a list with components:
merge
an \(n-1\) by \(2\) matrix. Row \(i\) ofmerge
describes the merging of clusters at step i of the clustering. If an element \(j\) in the row is negative, then observation -\(j\) was merged at this stage. If \(j\) is positive then the merge was with the cluster formed at the (earlier) stage \(j\) of the algorithm. Thus negative entries in merge indicate agglomerations of singletons, and positive entries indicate agglomerations of non-singletons.height
a set of \(n-1\) real values (non-decreasing for ultrametric trees). The clustering height: that is, the value of the criterion associated with the Ward clustering method.order
a vector giving the permutation of the original observations suitable for plotting, in the sense that a cluster plot using this ordering and matrix merge will not have crossings of the branches.labels
the gene set trends name.call
the call which produced the result clustering:hclust(d = dist(map2heat, method = "euclidean"), method = "ward.D2")
method
"ward.D2", as it is the clustering method that has been used for clustering the gene set trends.dist.method
"euclidean", as it is the distance that has been used for clustering the gene set trends.legend.breaks
a numeric vector giving the splitting points used for coloring the heatmap. Ifplot
isFALSE
, then it isNULL
.myclusters
a character vector of colors for the dynamic clusters of the represented gene set trends, with as many levels as the value ofN_clusters
. If no dynamic clusters were represented, than this isNULL
.ddr
a dendrogram object with the reordering used for the heatmap. Seeheatmap.2
function from packagegplots
.gene set.names character vector with the names of the gene sets used in the heatmap.
clust.trends
a ClusteredTrends object.clustersExport
a data frame with 2 variables containing the two following variables :GeneSet
: the gene set trends clustered.Cluster
: the dynamic cluster they belong to.
The data frame is order by the variable
Cluster
.data_plotted
: the data matrix represented by the heatmap
Details
On the heatmap, each line corresponds to a gene set, and each column to a time point.
If expr
is a matrix or a dataframe, then the "original" data are
plotted. On the other hand, if expr
is a list returned in the
'Estimations'
element of TcGSA.LR
, then it is those
"estimations" made by the TcGSA.LR
function that are plotted.
If descript
is FALSE
, the second element of margins
can
be reduced (for instance use margins = c(5, 10)
), as there is not so
much need for space in order to display only the gene set names, without
their description.
If there is a large number of significant gene sets, the hierarchical clustering
step repeated for each of them can take a few minutes. To speed things up
(especially) when playing with the plotting parameters for having a nice plot,
one can run the clustTrend
function beforehand, and plug its results
in the plot.TcGSA
function via the clust_trends
argument.
References
Hejblum BP, Skinner J, Thiebaut R, (2015) Time-Course Gene Set Analysis for Longitudinal Gene Expression Data. PLOS Comput. Biol. 11(6): e1004310. doi: 10.1371/journal.pcbi.1004310
Examples
if(interactive()){
data(data_simu_TcGSA)
tcgsa_sim_1grp <- TcGSA.LR(expr=expr_1grp, gmt=gmt_sim, design=design,
subject_name="Patient_ID", time_name="TimePoint",
time_func="linear", crossedRandom=FALSE)
summary(tcgsa_sim_1grp)
plot(x=tcgsa_sim_1grp, expr=tcgsa_sim_1grp$Estimations,
Subject_ID=design$Patient_ID, TimePoint=design$TimePoint,
baseline=1,
B=100,
time_unit="H",
dendrogram.size=0.4, heatmap.width=0.8, heatmap.height=2, cex.main=0.7
)
tcgsa_sim_2grp <- TcGSA.LR(expr=expr_2grp, gmt=gmt_sim, design=design,
subject_name="Patient_ID", time_name="TimePoint",
time_func="linear", crossedRandom=FALSE,
group_name="group.var")
summary(tcgsa_sim_2grp)
plot(x=tcgsa_sim_2grp, expr=expr_2grp,
Subject_ID=design$Patient_ID, TimePoint=design$TimePoint,
B=100,
time_unit="H",
)
}