This function clusters the genes dynamics of one gene sets into different
dominant trends. The optimal number of clusters is computed thanks to the gap
statistics. See clusGap
.
Usage
clustTrend(
tcgs,
expr,
Subject_ID,
TimePoint,
threshold = 0.05,
myproc = "BY",
nbsimu_pval = 1e+06,
baseline = NULL,
only.signif = TRUE,
group.var = NULL,
Group_ID_paired = NULL,
ref = NULL,
group_of_interest = NULL,
FUNcluster = NULL,
clustering_metric = "euclidian",
clustering_method = "ward",
B = 100,
max_trends = 4,
aggreg.fun = "median",
na.rm.aggreg = TRUE,
trend.fun = "median",
methodOptiClust = "firstSEmax",
indiv = "genes",
verbose = TRUE
)
# S3 method for ClusteredTrends
print(x, ...)
# S3 method for ClusteredTrends
plot(x, ...)
Arguments
- tcgs
a tcgsa object for
clustTrend
, or a ClusteredTrends object forprint.ClusteredTrends
andplot.ClusteredTrends
.- expr
either a matrix or dataframe of gene expression upon which dynamics are to be calculated, or a list of gene sets estimation of gene expression. In the case of a matrix or dataframe, its dimension are \(n\) x \(p\), with the \(p\) sample in column and the \(n\) genes in row. In the case of a list, its length should correspond to the number of gene sets under scrutiny and each element should be an 3 dimension array of estimated gene expression, such as for the list returned in the
'Estimations'
element ofTcGSA.LR
. See details.- Subject_ID
a factor of length \(p\) that is in the same order as the columns of
expr
(when it is a dataframe) and that contains the patient identifier of each sample.- TimePoint
a numeric vector or a factor of length \(p\) that is in the same order as
Subject_ID
and the columns ofexpr
(when it is a dataframe), and that contains the time points at which gene expression was measured.- threshold
the threshold at which the FDR or the FWER should be controlled.
- myproc
a vector of character strings containing the names of the multiple testing procedures for which adjusted p-values are to be computed. This vector should include any of the following: "
Bonferroni
", "Holm
", "Hochberg
", "SidakSS
", "SidakSD
", "BH
", "BY
", "ABH
", "TSBH
". Seemt.rawp2adjp
for details. Default is "BY
", the Benjamini & Yekutieli (2001) step-up FDR-controlling procedure (general dependency structures). In order to control the FWER(in case of an analysis that is more a hypothesis confirmation than an exploration of the expression data), we recommend to use "Holm
", the Holm (1979) step-down adjusted p-values for strong control of the FWER.- nbsimu_pval
the number of observations under the null distribution to be generated in order to compute the p-values. Default is
1e+06
.- baseline
a character string which is the value of
TimePoint
that can be used as a baseline. Default isNULL
, in which case no time point is used as a baseline value for gene expression. Has to beNULL
when comparing several treatment groups.- only.signif
logical flag for analyzing the trends in only the significant gene sets. If
FALSE
, all the gene sets from the gmt object contained inx
are clustered. Default isTRUE
.- group.var
in the case of several treatment groups, this is a factor of length \(p\) that is in the same order as
Timepoint
,Subject_ID
and the columns ofexpr
. It indicates to which treatment group each sample belongs to. Default isNULL
, which means that there is only one treatment group.- Group_ID_paired
a character vector of length \(p\) that is in the same order as
Timepoint
,Subject_ID
,group.var
and the columns ofexpr
. This argument must not beNULL
in the case of a paired analysis, and must beNULL
otherwise. Default isNULL
.- ref
the group which is used as reference in the case of several treatment groups. Default is
NULL
, which means that reference is the first group in alphabetical order of the labels ofgroup.var
.- group_of_interest
the group of interest, for which dynamics are to be computed in the case of several treatment groups. Default is
NULL
, which means that group of interest is the second group in alphabetical order of the labels ofgroup.var
.- FUNcluster
the clustering function used to agglomerate genes in trends. Default is
NULL
, in which a hierarchical clustering is performed via the functionagnes
, using the metricclustering_metric
and the methodclustering_method
. SeeclusGap
- clustering_metric
character string specifying the metric to be used for calculating dissimilarities between observations in the hierarchical clustering when
FUNcluster
isNULL
. The currently available options are"euclidean"
and"manhattan"
. Default is"euclidean"
. Seeagnes
. Also, a"sts"
option is available in TcGSA. It implements the 'Short Time Series' distance [Moller-Levet et al., Fuzzy Clustering of short time series and unevenly distributed sampling points, Advances in Intelligent Data Analysis V:330-340 Springer, 2003] designed specifically for clustering time series.- clustering_method
character string defining the agglomerative method to be used in the hierarchical clustering when
FUNcluster
isNULL
. The six methods implemented are"average"
([unweighted pair-]group average method, UPGMA),"single"
(single linkage),"complete"
(complete linkage),"ward"
(Ward's method),"weighted"
(weighted average linkage). Default is"ward"
. Seeagnes
.- B
integer specifying the number of Monte Carlo ("bootstrap") samples used to compute the gap statistics. Default is
500
. SeeclusGap
.- max_trends
integer specifying the maximum number of different clusters to be tested. Default is
4
.- aggreg.fun
a character string such as
"mean"
,"median"
or the name of any other defined statistics function that returns a single numeric value. It specifies the function used to aggregate the observations before the clustering. Default is tomedian
. Default is"median"
.- na.rm.aggreg
a logical flag indicating whether
NA
should be remove to prevent propagation throughaggreg.fun
. Can be useful to set to TRUE with unbalanced design as those will generate structuralNA
s in$Estimations
. Default isTRUE
.- trend.fun
a character string such as
"mean"
,"median"
or the name of any other function that returns a single numeric value. It specifies the function used to calculate the trends of the identified clustered. Default is tomedian
.- methodOptiClust
character string indicating how the "optimal" number of clusters is computed from the gap statistics and their standard deviations. Possible values are
"globalmax"
,"firstmax"
,"Tibs2001SEmax"
,"firstSEmax"
and"globalSEmax"
. Default is"firstSEmax"
. See'method'
inclusGap
, Details and Tibshirani et al., 2001 in References.- indiv
a character string indicating by which unit observations are aggregated (through
aggreg.fun
) before the clustering. Possible values are"genes"
or"patients"
. Default is"genes"
.- verbose
logical flag enabling verbose messages to track the computing status of the function. Default is
TRUE
.- x
an object of class '
ClusteredTrends
'.- ...
further arguments passed to or from other methods.
Value
An object of class ClusteredTrends which is a list with the 4 following components:
NbClust
a vector that contains the optimal number of clusters for each analyzed gene sets.ClustsMeds
a list of the same length asNsClust
(the number of analyzed gene sets). Each element of the list is a data frame, in which there is as many column as the optimal number of clusters for the corresponding gene sets for each cluster. Each column of the data frame contains the median trend values for the corresponding cluster.GenesPartition
a list of the same length asNsClust
(the number of analyzed gene sets). Each element of the list is a vector which gives the partition of the genes inside the corresponding gene set.MaxNbClust
an integer storing the maximum number of different clusters tested, as given by the argument'max_trends'
.
Details
If expr
is a matrix or a dataframe, then the genes dynamics are
clustered on the "original" data. On the other hand, if expr
is a
list returned in the 'Estimations'
element of TcGSA.LR
,
then the dynamics are computed on the estimations made by the
TcGSA.LR
function.
This function uses the Gap statistics to determine the optimal number of
clusters in the plotted gene set. See
clusGap
.
References
Tibshirani, R., Walther, G. and Hastie, T., 2001, Estimating the number of data clusters via the Gap statistic, Journal of the Royal Statistical Society, Series B (Statistical Methodology), 63, 2: 41--423.
Examples
if(interactive()){
data(data_simu_TcGSA)
tcgsa_sim_1grp <- TcGSA.LR(expr=expr_1grp, gmt=gmt_sim, design=design,
subject_name="Patient_ID", time_name="TimePoint",
time_func="linear", crossedRandom=FALSE)
CT <- clustTrend(tcgsa_sim_1grp,
expr=expr_1grp, Subject_ID=design$Subject_ID, TimePoint=design$TimePoint)
CT
plot(CT)
CT$NbClust
CT$NbClust["Gene set 5"]
CT$ClustMeds[["Gene set 4"]]
CT$ClustMeds[["Gene set 5"]]
}