CytOpT package

CytOpT.CytOpt module

CytOpT.CytOpt.CytOpT(xSource, xTarget, labSource, labTarget=None, thetaTrue=None, method=None, eps=0.0001, nIter=4000, power=0.99, stepGrad=10, step=5, lbd=0.0001, nItGrad=10000, nItSto=10, cont=True, monitoring=False, minMaxScaler=True, thresholding=True)[source]

CytOpT algorithm. This methods is designed to estimate the proportions of cells in an unclassified Cytometry data set denoted xTarget. CytOpT is a supervised method that levarge the classification denoted labSource associated to the flow cytometry data set xSource. The estimation relies on the resolution of an optimization problem. two procedures are provided “minmax” and “desasc”. We recommend to use the default method that is minmax.

Parameters

xSource – np.array of shape (n_samples_source, n_biomarkers). The source cytometry data set. A cytometry dataframe. The columns correspond to the different biological markers tracked. One line corresponds to the cytometry measurements performed on one cell. The classification of this Cytometry data set must be provided with the labSource parameters.
xTarget – np.array of shape (n_samples_target, n_biomarkers). The target cytometry data set. A cytometry dataframe. The columns correspond to the different biological markers tracked. One line corresponds to the cytometry measurements performed on one cell. The CytOpT algorithm targets the cell type proportion in this Cytometry data set
labSource – np.array of shape (n_samples_source,). The classification of the source data set.
labTarget – np.array of shape (n_samples_target,), default=None. The classification of the target data set.
thetaTrue – np.array of shape (K,), default=None. This array stores the true proportions of the K type of cells estimated in the target data set. This parameter is required if the user enables the monitoring option.
method – {“minmax”, “desasc”, “both”}, default="minmax". Method chosen to to solve the optimization problem involved in CytOpT. It is advised to rely on the default choice that is “minmax”.
eps – float, default=0.0001. Regularization parameter of the Wasserstein distance. This parameter must be positive.
nIter – int, default=10000. Number of iterations of the stochastic gradient ascent for the Minmax swapping optimization method.
power – float, default=0.99. Decreasing rate for the step-size policy of the stochastic gradient ascent for the Minmax swapping optimization method. The step-size decreases at a rate of 1/n^power.
stepGrad – float, default=10. Constant step_size policy for the gradient descent of the descent-ascent optimization strategy.
step – float, default=5. Multiplication factor of the stochastic gradient ascent step-size policy for the minmax optimization method.
lbd – float, default=0.0001. Additionnal regularization parameter of the Minmax swapping optimization method. This parameter lbd should be greater or equal to eps.
nItGrad – int, default=10000. Number of iterations of the outer loop of the descent-ascent optimization method. This loop corresponds to the descent part of descent-ascent strategy.
nItSto – int, default = 10. Number of iterations of the inner loop of the descent-ascent optimization method. This loop corresponds to the stochastic ascent part of this optimization procedure.
cont – bool, default=True. When set to true, the progress is displayed.
monitoring – bool, default=False. When set to true, the evolution of the Kullback-Leibler between the estimated proportions and the benchmark proportions is tracked and stored.
minMaxScaler – bool, default = True. When set to True, the source and target data sets are scaled in [0,1]^d, where d is the number of biomarkers monitored.
thresholding – bool, default = True. When set to True, all the coefficients of the source and target data sets are replaced by their positive part. This preprocessing is relevant for Cytometry Data as the signal acquisition of the cytometer can induce convtrived negative values.

Returns

hat_theta : np.array of shape (K,), where K is the number of different type of cell populations in the source data set.
KL_monitoring: np.array of shape (n_out, ) or (nIter,) depending on the choice of the optimization method. This array stores the evolution of the Kullback-Leibler divergence between the estimate and benchmark proportions, if monitoring==True.

Reference:: Paul Freulon, Jérémie Bigot,and Boris P. Hejblum CytOpT: Optimal Transport with Domain Adaptation for Interpreting Flow Cytometry data, arXiv:2006.09003 [stat.AP].

CytOpT.descentAscent module

CytOpT.descentAscent.cytoptDesasc(xSource, xTarget, labSource, eps=1, nItGrad=4000, nItSto=10, stepGrad=0.001, cont=True, thetaTrue=None, monitoring=True, thresholding=True, minMaxScaler=True)[source]

CytOpT algorithm. This methods is designed to estimate the proportions of cells in an unclassified Cytometry data set denoted xTarget. CytOpT is a supervised method that leverage the classification denoted labSource associated to the flow cytometry data set xSource. The estimation relies on the resolution of an optimization problem. The optimization problem of this function is solved with a descent-ascent optimization procedure.

Parameters

xSource – np.array of shape (n_samples_source, n_biomarkers). The source cytometry data set.
xTarget – np.array of shape (n_samples_target, n_biomarkers). The target cytometry data set.
labSource – np.array of shape (n_samples_source,). The classification of the source data set.
eps – float, default=0.0001. Regularization parameter of the Wasserstein distance. This parameter must be positive.
nItGrad – int, default=10000. Number of iterations of the outer loop of the descent-ascent optimization method. This loop corresponds to the descent part of descent-ascent strategy.
nItSto – int, default = 10. Number of iterations of the inner loop of the descent-ascent optimization method. This loop corresponds to the stochastic ascent part of this optimization procedure.
stepGrad – float, default=10. Constant step_size policy for the gradient descent of the descent-ascent optimization strategy.
cont – bool, default=True. When set to true, the progress is displayed.
thetaTrue – np.array of shape (K,), default=None. This array stores the true proportions of the K type of cells estimated in the target data set. This parameter is required if the user enables the monitoring option.
monitoring – bool, default=False. When set to true, the evolution of the Kullback-Leibler between the estimated proportions and the benchmark proportions is tracked and stored.
minMaxScaler – bool, default = True. When set to True, the source and target data sets are scaled in [0,1]^d, where d is the number of biomarkers monitored.
thresholding – bool, default = True. When set to True, all the coefficients of the source and target data sets are replaced by their positive part. This preprocessing is relevant for Cytometry Data as the signal acquisition of the cytometer can induce convtrived negative values.

Returns

hat_theta - np.array of shape (K,), where K is the number of different type of cell populations in the source data set.
KLStorage - np.array of shape (n_out, ). This array stores the evolution of the Kullback-Leibler divergence between the estimate and benchmark proportions, if monitoring==True.

CytOpT.minmaxSwapping module

CytOpT.minmaxSwapping.cytoptMinmax(xSource, xTarget, labSource, eps=0.0001, lbd=0.0001, nIter=4000, cont=True, step=5, power=0.99, thetaTrue=None, monitoring=False, thresholding=True, minMaxScaler=True)[source]

CytOpT algorithm. This methods is designed to estimate the proportions of cells in an unclassified Cytometry data set denoted xTarget. CytOpT is a supervised method that leverage the classification denoted labSource associated to the flow cytometry data set xSource. The estimation relies on the resolution of an optimization problem. The optimization problem of this function involves an additional regularization term lambda. This regularization allows the application of a simple stochastic gradient-ascent to solve the optimization problem. We advocate the use of this method as it is faster than ‘cytopt_desasc’.

Parameters

xSource – np.array of shape (n_samples_source, n_biomarkers). The source cytometry data set.
xTarget – np.array of shape (n_samples_target, n_biomarkers). The target cytometry data set.
labSource – np.array of shape (n_samples_source,). The classification of the source data set.
eps – float, default=0.0001. Regularization parameter of the Wasserstein distance. This parameter must be positive.
lbd – float, default=0.0001. Additionnal regularization parameter of the Minmax swapping optimization method. This parameter lbd should be greater or equal to eps.
nIter – int, default=10000. Number of iterations of the stochastic gradient ascent.
cont – bool, default=True. When set to true, the progress is displayed.
step – float, default=5. Multiplication factor of the stochastic gradient ascent step-size policy for the minmax optimization method.
power – float, default=0.99. Decreasing rate for the step-size policy of the stochastic gradient ascent for the Minmax swapping optimization method. The step-size decreases at a rate of 1/n^power.
thetaTrue – np.array of shape (K,), default=None. This array stores the true proportions of the K type of cells estimated in the target data set. This parameter is required if the user enables the monitoring option.
monitoring – bool, default=False. When set to true, the evolution of the Kullback-Leibler between the estimated proportions and the benchmark proportions is tracked and stored.
minMaxScaler – bool, default = True. When set to True, the source and target data sets are scaled in [0,1]^d, where d is the number of biomarkers monitored.
thresholding – bool, default = True. When set to True, all the coefficients of the source and target data sets are replaced by their positive part. This preprocessing is relevant for Cytometry Data as the signal acquisition of the cytometer can induce convtrived negative values.

Returns

hat_theta - np.array of shape (K,), where K is the number of different type of cell populations in the source data set.
KL_storage - np.array of shape (nIter,) This array stores the evolution of the Kullback-Leibler divergence between the estimate and benchmark proportions, if monitoring==True and the thetaTrue variable is completed.

Reference:: Paul Freulon, Jérémie Bigot,and Boris P. Hejblum CytOpT: Optimal Transport with Domain Adaptation for Interpreting Flow Cytometry data, arXiv:2006.09003 [stat.AP].

CytOpT.labelPropSto module

CytOpT.labelPropSto.cTransform(f, xSource, xTarget, j, beta, eps=0.1)[source]

Calculate the c_transform of f in the non regularized case if eps=0. Otherwise, it computes the smooth c_transform with respect to the usual entropy.

Parameters

f – np.array of shape (X.shape[0],). The optimal dual vector associated to the source distribution. Here, the Wasserstein distance is computed between the distribution with weights alpha and support X and the distribution with weights beta and support Y.
xSource – np.array of shape (n_obs_source, dimension). Support of the source distribution.
xTarget – np.array of shape (n_obs_target, dimension). Support of the target distribution
j –
beta – np.array of shape (n_obs_target,). Weights of the target distribution.
eps – float, default=0.1. Regularization parameter of the Wasserstein distance. This parameter should be greater than 0.

Returns

CytOpT.labelPropSto.cost(xSource, y)[source]

Squared euclidean distance between y and the I points of xSource.

Parameters

xSource – np.array of shape (n_obs_source, dimension). Support of the source distribution.
y –

Returns

CytOpT.labelPropSto.gradH(f, xSource, y, alpha, eps=0.1)[source]

This function calculates the gradient of the function that we aim to maximize. The expectation of this function computed at a maximizer equals the wasserstein disctance, or its regularized counterpart.

Parameters

f – np.array of shape (X.shape[0],). The optimal dual vector associated to the source distribution. Here, the Wasserstein distance is computed between the distribution with weights alpha and support X and the distribution with weights beta and support Y.
xSource – np.array of shape (n_obs_source, dimension). Support of the source distribution.
y –
alpha – np.array of shape (n_obs_source,). Weights of the source distribution.
eps – float, default=0.1. Regularization parameter of the Wasserstein distance. This parameter should be greater than 0.

Returns

CytOpT.labelPropSto.hFunction(f, xSource, xTarget, j, alpha, beta, eps=0.1)[source]

Calculate the function h whose expectation equals the semi-dual loss. Maximizing the semi-dual loss allows us to compute the wasserstein distance.

Parameters

f – np.array of shape (X.shape[0],). The optimal dual vector associated to the source distribution. Here, the Wasserstein distance is computed between the distribution with weights alpha and support X and the distribution with weights beta and support Y.
xSource – np.array of shape (n_obs_source, dimension). Support of the source distribution.
xTarget – np.array of shape (n_obs_target, dimension). Support of the target distribution
j –
beta – np.array of shape (n_obs_target,). Weights of the target distribution.
eps – float, default=0.1. Regularization parameter of the Wasserstein distance. This parameter should be greater than 0.
alpha – np.array of shape (n_obs_source,). Weights of the source distribution.

Returns

CytOpT.labelPropSto.labelPropSto(labSource, f, X, Y, alpha, beta, eps=0.0001, cont=True)[source]

Function that calculates a classification of the target data with an optimal-transport based soft assignment. For optimal result, the source distribution must be re-weighted thanks to the estimation of the class proportions in the target data set. This estimation can be produced with the Cytopt function. To compute an optimal dual vector f associated to the source distribution, we advocate the use of the robbinsWass function with a CytOpT re-weighting of the source distribution.

Parameters

labSource – np.array of shape (X.shape[0],). The labels associated to the source data set X_s
f – np.array of shape (X.shape[0],). The optimal dual vector associated to the source distribution. Here, the Wasserstein distance is computed between the distribution with weights alpha and support X and the distribution with weights beta and support Y.
X – np.array of shape (n_obs_source, dimension). The support of the source distribution.
Y – np.array of shape (n_obs_target, dimension). The support of the target distribution.
alpha – np.array of shape (n_obs_source,). The weights of the source distribution.
beta – np.array of shape (n_obs_target,). The weights of the target distribution.
eps – float, default=0.0001. The regularization parameter of the Wasserstein distance.
cont – bool, default=True. When set to true, the progress is displayed.

Returns

labTarget - np.array of shape (K,n_obs_target), where K is the number of different type of cell populations in the source data set. The coefficient labTarget[k,j] corresponds to the probability that the observation xTarget[j] belongs to the class k.
clustarget - np.array of shape (n_obs_target,). This array stores the optimal transport based classification of the target data set.

CytOpT.labelPropSto.robbinsWass(xSource, xTarget, alpha, beta, eps=0.1, nIter=10000)[source]

Function that calculates the approximation of the optimal dual vector associated to the source distribution. The regularized optimal-transport problem is computed between a distribution with support xSource and weights alpha, and a distribution with support xTarget and weights beta. This function solves the semi-dual formulation of the regularized OT problem with the stochastic algorithm of Robbins-Monro.

Parameters

xSource – np.array of shape (n_obs_source, dimension). Support of the source distribution.
xTarget – np.array of shape (n_obs_target, dimension). Support of the target distribution
alpha – np.array of shape (n_obs_source,). Weights of the source distribution.
beta – np.array of shape (n_obs_target,). Weights of the target distribution.
eps – float, default=0.1. Regularization parameter of the Wasserstein distance. This parameter should be greater than 0.
nIter – int, default=10000. Number of iterations of the Robbins-Monro algorithm.

Returns

f - np.array of shape (n_obs_source,). Optimal kantorovich potential associated to the source distribution.

CytOpT.plots module

CytOpT.plots.BlandAltman(proportions, Class=None, Center=None)[source]

Function to display a bland plot in order to visually assess the agreement between CytOpt estimation of the class proportions and the estimate of the class proportions provided through manual gating.

Parameters

proportions – proportions data.frame of true and proportion estimates from CytOpt()
Class – Population classes
Center – Center of class population

CytOpT.plots.KLPlot(monitoring, n0=10, nStop=10000, title='Kullback-Liebler divergence trace')[source]

Function to display a bland plot in order to visually assess the agreement between CytOpt estimation of the class proportions and the estimate of the class proportions provided through manual gating.

Parameters

monitoring – list of monitoring estimates from CytOpt() output.
n0 – first iteration to plot. Default is 10.
nStop – last iteration to plot. Default is 1000.
title – plot title. Default is Kullback-Liebler divergence trace.

Returns

CytOpT.plots.barPlot(proportions, Class=None, title='CytOpt estimation and Manual estimation')[source]

Function to display a bland plot in order to visually assess the agreement between CytOpt estimation of the class proportions and the estimate of the class proportions provided through manual gating.

Parameters

proportions – proportions data.frame of true and proportion estimates from CytOpt() and
Class – Population classes
title – plot title. Default is CytOpt estimation and Manual estimation, i.e. no title.

CytOpT.plots.resultPlot(results, Class=None, n0=10, nStop=1000)[source]

Function to display a graph to visually assess the agreement between the CytOpt estimate of class proportions; the estimate of class proportions provided by manual selection and to visually assess the agreement between the CytOpt estimate of follow-up and the estimate of follow-up provided by manual selection.

Parameters

results – a list of data.frame of true and proportion estimates from CytOpt() and dataframe ``of monitoring estimates from ``CytOpt() output.
Class – Population classes
n0 – first iteration to plot. Default is 10.
nStop – last iteration to plot. Default is 1000.