This function can plot different representations of the gene expression in a list of gene sets.
Usage
plotMultipleGS(
genesets_list,
ncolumns = 1,
labels = NULL,
expr,
gmt,
Subject_ID,
TimePoint,
baseline = NULL,
group.var = NULL,
Group_ID_paired = NULL,
ref = NULL,
group_of_interest = NULL,
FUNcluster = NULL,
clustering_metric = "euclidian",
clustering_method = "ward",
B = 500,
max_trends = 4,
aggreg.fun = "median",
na.rm.aggreg = TRUE,
trend.fun = "median",
methodOptiClust = "firstSEmax",
indiv = "genes",
verbose = TRUE,
clustering = TRUE,
showTrend = TRUE,
smooth = TRUE,
time_unit = "",
y.lab = NULL,
desc = TRUE,
lab.cex = 1,
axis.cex = 1,
main.cex = 1,
y.lab.angle = 90,
x.axis.angle = 45,
margins = 1,
line.size = 1,
y.lim = NULL,
x.lim = NULL,
gg.add = list(),
show_plot = TRUE
)
Arguments
- genesets_list
a list of the character strings giving the names of the gene sets to be plotted as they appear in
gmt
.- ncolumns
the number of columns used to display the multiple plots. Default is
1
.- labels
List of labels to be added to the plots. You can also set
labels="AUTO"
to auto-generate upper-case labels (such asA
,B
, ...) orlabels="auto"
to auto-generate lower-case labels. Default isNULL
- expr
either a matrix or dataframe of gene expression upon which dynamics are to be calculated, or a list of gene sets estimation of gene expression. In the case of a matrix or dataframe, its dimension are \(n\) x \(p\), with the \(p\) sample in column and the \(n\) genes in row. In the case of a list, its length should correspond to the number of gene sets under scrutiny and each element should be an 3 dimension array of estimated gene expression, such as for the list returned in the
'Estimations'
element ofTcGSA.LR
. See details.- gmt
a gmt object containing the gene sets definition. See
GSA.read.gmt
and definition on www.software.broadinstitute.org.- Subject_ID
a factor of length \(p\) that is in the same order as the columns of
expr
(when it is a dataframe) and that contains the patient identifier of each sample.- TimePoint
a numeric vector or a factor of length \(p\) that is in the same order as
Subject_ID
and the columns ofexpr
(when it is a dataframe), and that contains the time points at which gene expression was measured.- baseline
a character string which is the value of
TimePoint
that can be used as a baseline. Default isNULL
, in which case no time point is used as a baseline value for gene expression. Has to beNULL
when comparing two treatment groups.- group.var
in the case of several treatment groups, this is a factor of length \(p\) that is in the same order as
Timepoint
,Subject_ID
and the columns ofexpr
. It indicates to which treatment group each sample belongs to. Default isNULL
, which means that there is only one treatment group.- Group_ID_paired
a character vector of length \(p\) that is in the same order as
Timepoint
,Subject_ID
,group.var
and the columns ofexpr
. This argument must not beNULL
in the case of a paired analysis, and must beNULL
otherwise. Default isNULL
.- ref
the group which is used as reference in the case of several treatment groups. Default is
NULL
, which means that reference is the first group in alphabetical order of the labels ofgroup.var
. See Details.- group_of_interest
the group of interest, for which dynamics are to be computed in the case of several treatment groups. Default is
NULL
, which means that group of interest is the second group in alphabetical order of the labels ofgroup.var
.- FUNcluster
a function which accepts as first argument a matrix
x
and as second argument the number of clusters desiredk
, and which returns a list with a component named'cluster'
which is a vector of lengthn = nrow(x)
of integers in 1:k, determining the clustering or grouping of the n observations. Default isNULL
, in which case a hierarchical clustering is performed via the functionagnes
, using the metricclustering_metric
and the methodclustering_method
. See'FUNcluster'
inclusGap
and Details.- clustering_metric
character string specifying the metric to be used for calculating dissimilarities between observations in the hierarchical clustering when
FUNcluster
isNULL
. The currently available options are"euclidean"
and"manhattan"
. Default is"euclidean"
. Seeagnes
. Also, a"sts"
option is available in TcGSA. It implements the 'Short Time Series' distance [Moller-Levet et al., Fuzzy Clustering of short time series and unevenly distributed sampling points, Advances in Intelligent Data Analysis V:330-340 Springer, 2003] designed specifically for clustering time series.- clustering_method
character string defining the agglomerative method to be used in the hierarchical clustering when
FUNcluster
isNULL
. The six methods implemented are"average"
([unweighted pair-]group average method, UPGMA),"single"
(single linkage),"complete"
(complete linkage),"ward"
(Ward's method),"weighted"
(weighted average linkage). Default is"ward"
. Seeagnes
.- B
integer specifying the number of Monte Carlo ("bootstrap") samples used to compute the gap statistics. Default is
500
. SeeclusGap
.- max_trends
integer specifying the maximum number of different clusters to be tested. Default is
4
.- aggreg.fun
a character string such as
"median"
or"mean"
or the name of any other defined statistics function that returns a single numeric value. It specifies the function used to aggregate the observations before the clustering. Default is to"median"
.- na.rm.aggreg
a logical flag indicating whether
NA
should be remove to prevent propagation throughaggreg.fun
. Can be useful to set to TRUE with unbalanced design as those will generate structuralNA
s in$Estimations
. Default isTRUE
.- trend.fun
a character string such as
"mean"
or the name of any other function that returns a single numeric value. It specifies the function used to calculate the trends of the identified clustered. Default is to"mean"
.- methodOptiClust
character string indicating how the "optimal" number of clusters is computed from the gap statistics and their standard deviations. Possible values are
"globalmax"
,"firstmax"
,"Tibs2001SEmax"
,"firstSEmax"
and"globalSEmax"
. Default is"firstSEmax"
. See'method'
inclusGap
, Details and Tibshirani et al., 2001 in References.- indiv
a character string indicating by which unit observations are aggregated (through
aggreg.fun
) before the clustering. Possible values are"genes"
or"patients"
. Default is"genes"
. See Details.- verbose
logical flag enabling verbose messages to track the computing status of the function. Default is
TRUE
.- clustering
logical flag. If
FALSE
, there is no clustering representation; ifTRUE
, the lines are colored according to which cluster they belong to. Default isTRUE
. See Details.- showTrend
logical flag. If
TRUE
, a black line is added for each cluster, representing the correspondingtrend.fun
. Default isTRUE
.- smooth
logical flag. If
TRUE
andshowTrend
is alsoTRUE
, the representation of each clustertrend.fun
is smoothed using cubic polynomials (seegeom_smooth
. Default isTRUE
. At the moment, must accept parameter"na.rm"
(which is automatically set toTRUE
). This might change in future versions- time_unit
the time unit to be displayed (such as
"Y"
,"M"
,"W"
,"D"
,"H"
, etc) next to the values ofTimePoint
on the x-axis. Default is""
, in which case the time scale on the x-axis is proportional to the time values.- y.lab
character specifying the annotation of the y axis. If
NULL
, an annotation is automatically generated, if""
, no annotation appears. Default isNULL
.- desc
a logical flag. If
TRUE
, a line is added to the title of the plot with the description of the gene set plotted (from the gmt file). Default isTRUE
.- lab.cex
a numerical value giving the amount by which lab labels text should be magnified relative to the default
1
.- axis.cex
a numerical value giving the amount by which axis annotation text should be magnified relative to the default
1
.- main.cex
a numerical value giving the amount by which title text should be magnified relative to the default
1
.- y.lab.angle
a numerical value (in [0, 360]) giving the orientation by which y-label text should be turned (anti-clockwise). Default is
90
. Seeelement_text
.- x.axis.angle
a numerical value (in [0, 360]) giving the orientation by which x-axis annotation text should be turned (anti-clockwise). Default is
45
.- margins
a numerical value giving the amount by which the margins should be reduced or increased relative to the default
1
.- line.size
a numerical value giving the amount by which the line sizes should be reduced or increased relative to the default
1
.- y.lim
a numeric vector of length 2 giving the range of the y-axis. See
plot.default
.- x.lim
if numeric, will create a continuous scale, if factor or character, will create a discrete scale. Observations not in this range will be dropped. See
xlim
.- gg.add
A list of instructions to add to the
ggplot2
instructions. See+.gg
. Default islist(theme())
, which adds nothing to the plot.- show_plot
logical flag. If
FALSE
, no plot is drawn. Default isTRUE
.
Value
A list with 2 elements:
classif
: adata.frame
with the 2 following variables:ProbeID
which contains the IDs of the probes of the plotted gene set, andCluster
containing $ which cluster the probe belongs to. Ifclustering
isFALSE
, thenCluster
isNA
for all the probes.p
: aggplot
object containing the plot
Details
If expr
is a matrix or a dataframe, then the "original" data are
plotted. On the other hand, if expr
is a list returned in the
'Estimations'
element of TcGSA.LR
, then it is those
"estimations" made by the TcGSA.LR
function that are plotted.
If indiv
is 'genes', then each line of the plot is the median of a
gene expression over the patients. On the other hand, if indiv
is
'patients', then each line of the plot is the median of a patient genes
expression in this gene set.
This function uses the Gap statistics to determine the optimal number of
clusters in the plotted gene set. See
clusGap
.