Title: | Stable Specification Search in Structural Equation Models |
---|---|
Description: | An exploratory and heuristic approach for specification search in Structural Equation Modeling. The basic idea is to subsample the original data and then search for optimal models on each subset. Optimality is defined through two objectives: model fit and parsimony. As these objectives are conflicting, we apply a multi-objective optimization methods, specifically NSGA-II, to obtain optimal models for the whole range of model complexities. From these optimal models, we consider only the relevant model specifications (structures), i.e., those that are both stable (occur frequently) and parsimonious and use those to infer a causal model. |
Authors: | Ridho Rahmadi [aut, cre], Perry Groot [aut, ths], Tom Heskes [aut, ths], Christoph Stich [ctb] |
Maintainer: | Ridho Rahmadi <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.0 |
Built: | 2025-02-21 05:00:42 UTC |
Source: | https://github.com/rahmarid/stablespec |
A data set of 500 intances, generated from a network consisting of six continuous variables.
crossdata6V
crossdata6V
A data frame with six continuous variables: x1,...,x6
.
Reshape longitudinal data with t
time slices into a longitudinal
data with two time slices.
dataReshape(theData = NULL, numTime = NULL)
dataReshape(theData = NULL, numTime = NULL)
theData |
a data frame containing longitudinal data to which the model will be fit. |
numTime |
number of time slices. |
A data frame representing longitudinal data with
two time slices, such that the first n
data points contain the
relations that occur in the first two time slices
t_0
and t_1
. The next n
data points contain the
relations that occur in time slices t_1
and t_2
.
The i-th
subset of n
data points contain the relations
in time slices t_i-1
and t_i
. The reshaped data can be used
as data input for function stableSpec
when computing longitudinal data.
Ridho Rahmadi [email protected]
the_data <- longiData4V3T num_time <- 3 reshaped_the_data <- dataReshape(the_data, num_time)
the_data <- longiData4V3T num_time <- 3 reshaped_the_data <- dataReshape(the_data, num_time)
Compute the model chi-square
and model complexity
of the given SEM models.
getModelFitness(theData = NULL, allModelString = NULL, numTime = NULL, longitudinal = NULL, co = NULL, mixture = NULL)
getModelFitness(theData = NULL, allModelString = NULL, numTime = NULL, longitudinal = NULL, co = NULL, mixture = NULL)
theData |
a data frame containing the data to which the model is to
be fit. If parameter |
allModelString |
|
numTime |
number of time slices. If the data is cross-sectional, this argument must be set to 1. |
longitudinal |
|
co |
whether to use |
mixture |
if the data contains both continuous and
categorical (or ordinal) variables, this argument can be set
to |
a matrix
of models including their fitness':
chi-square
and model complexity.
Ridho Rahmadi [email protected]
the_data <- crossdata6V #assummed that variable 5 does not cause variables 1, 2, and 3 models <- modelPop(nPop=5, numVar=6, longitudinal=FALSE, consMatrix = matrix(c(5, 1, 5, 2, 5, 3), 3, 2, byrow=TRUE)) model_fitness <- getModelFitness(theData=the_data, allModelString=models, numTime=1, longitudinal=FALSE, co="covariance", mixture = FALSE)
the_data <- crossdata6V #assummed that variable 5 does not cause variables 1, 2, and 3 models <- modelPop(nPop=5, numVar=6, longitudinal=FALSE, consMatrix = matrix(c(5, 1, 5, 2, 5, 3), 3, 2, byrow=TRUE)) model_fitness <- getModelFitness(theData=the_data, allModelString=models, numTime=1, longitudinal=FALSE, co="covariance", mixture = FALSE)
A data set of 400 intances, that is generated from a network consisting of
four continuous variables and three time slices t_0,...,t_2
.
longiData4V3T
longiData4V3T
A data frame with twelve continuous variables:
x1,...,x4
are for time slice t_0
,
x5,...,x8
are for time slice t_1
, and
x9,...,x12
are for time slice t_2
Generating recursive (acyclic) SEM models represented by binary vectors.
modelPop(nPop = NULL, numVar = NULL, longitudinal = NULL, consMatrix = NULL)
modelPop(nPop = NULL, numVar = NULL, longitudinal = NULL, consMatrix = NULL)
nPop |
number of models to generate or population size. |
numVar |
number of variables. |
longitudinal |
|
consMatrix |
|
This function generates nPop
random SEM models which are
represented by binary vectors; 1 means there is a causal path from,
e.g., variable A
to B
and 0 otherwise. In addition, the generated models
have passed the cyclic test to ensure they are all acyclic. The function
also includes minPop
models which representing models
from each model complexity, i.e., minPop = numVar(numVar-1)/2+1
,
if longitudinal = FALSE
, or
minPop = (numVar(numVar-1)/2+1)+numVar^2
, otherwise.
If nPop <= minPop
then
this function will generate minPop
models.
nPop
or minPop
by m
matrix
,
where m
is the length of the binary vector depending
of the given number of variables
and also whether longitudinal or cross-sectional model.
Ridho Rahmadi [email protected]
#assumming a prior knowledge that variable 1 does not cause variable 2 models <- modelPop(nPop=25, numVar=6, longitudinal=FALSE, consMatrix = matrix(c(1, 2), 1, 2)) models
#assumming a prior knowledge that variable 1 does not cause variable 2 models <- modelPop(nPop=25, numVar=6, longitudinal=FALSE, consMatrix = matrix(c(1, 2), 1, 2)) models
Plot each of the stability of causal path and edge including the threshold of stability and model complexity.
plotStability(listOfFronts = NULL, threshold = NULL, stableCausal = NULL, stableCausal_l1 = NULL, stableEdge = NULL, longitudinal = NULL)
plotStability(listOfFronts = NULL, threshold = NULL, stableCausal = NULL, stableCausal_l1 = NULL, stableEdge = NULL, longitudinal = NULL)
listOfFronts |
|
threshold |
threshold of stability selection. The default is 0.6. |
stableCausal |
|
stableCausal_l1 |
|
stableEdge |
|
longitudinal |
|
Plot of causal path and edge stability for every pair of variables, including plots of all edge stabilites and all cauasl path stabilities.
Ridho Rahmadi [email protected]
the_data <- crossdata6V numSubset <- 1 num_iteration <- 5 num_pop <- 10 mut_rate <- 0.075 cross_rate <- 0.85 longi <- FALSE num_time <- 1 the_co <- "covariance" #assummed that variable 5 does not cause variables 1, 2, and 3 cons_matrix <- matrix(c(5, 1, 5, 2, 5, 3), 3, 2, byrow=TRUE) th <- 0.1 to_plot <- FALSE result <- stableSpec(theData=the_data, nSubset=numSubset, iteration=num_iteration, nPop=num_pop, mutRate=mut_rate, crossRate=cross_rate, longitudinal=longi, numTime=num_time, co=the_co, consMatrix=cons_matrix, threshold=th, toPlot=to_plot) plotStability(listOfFronts=result$listOfFronts, threshold=th, stableCausal=result$causalStab, stableCausal_l1=result$causalStab_l1, stableEdge=result$edgeStab, longitudinal=longi)
the_data <- crossdata6V numSubset <- 1 num_iteration <- 5 num_pop <- 10 mut_rate <- 0.075 cross_rate <- 0.85 longi <- FALSE num_time <- 1 the_co <- "covariance" #assummed that variable 5 does not cause variables 1, 2, and 3 cons_matrix <- matrix(c(5, 1, 5, 2, 5, 3), 3, 2, byrow=TRUE) th <- 0.1 to_plot <- FALSE result <- stableSpec(theData=the_data, nSubset=numSubset, iteration=num_iteration, nPop=num_pop, mutRate=mut_rate, crossRate=cross_rate, longitudinal=longi, numTime=num_time, co=the_co, consMatrix=cons_matrix, threshold=th, toPlot=to_plot) plotStability(listOfFronts=result$listOfFronts, threshold=th, stableCausal=result$causalStab, stableCausal_l1=result$causalStab_l1, stableEdge=result$edgeStab, longitudinal=longi)
Repairing a SEM model that is cyclic.
repairCyclicModel(stringModel = NULL, numVar = NULL, longitudinal = NULL)
repairCyclicModel(stringModel = NULL, numVar = NULL, longitudinal = NULL)
stringModel |
binary vector with length
|
numVar |
number of variables. |
longitudinal |
|
The main idea of this function is to seek cyclic(s) with
any possible length from a given model, and then to cut the cyclic,
so as to make the model acyclic. Moreover, this function is used in
stableSpec
to ensure no cyclic model in the computation.
a binary vector with the same length of input, representing a repaired model (acyclic).
Ridho Rahmadi [email protected]
num_vars <- 6 longi_a <- FALSE longi_b <- TRUE # Assume that the generated model below is cyclic # a cross-sectional model model_a <- round(runif(num_vars * num_vars)) # a longitudinal model model_b <- c(round(runif(num_vars * num_vars)), round(runif(num_vars * (num_vars-1)))) repaired_model_a <- repairCyclicModel(stringModel=model_a, numVar=num_vars, longitudinal=longi_a) repaired_model_b <- repairCyclicModel(stringModel=model_b, numVar=num_vars, longitudinal=longi_b) repaired_model_a repaired_model_b
num_vars <- 6 longi_a <- FALSE longi_b <- TRUE # Assume that the generated model below is cyclic # a cross-sectional model model_a <- round(runif(num_vars * num_vars)) # a longitudinal model model_b <- c(round(runif(num_vars * num_vars)), round(runif(num_vars * (num_vars-1)))) repaired_model_a <- repairCyclicModel(stringModel=model_a, numVar=num_vars, longitudinal=longi_a) repaired_model_b <- repairCyclicModel(stringModel=model_b, numVar=num_vars, longitudinal=longi_b) repaired_model_a repaired_model_b
Search stable specifications (structures) of constrained structural equation models.
stableSpec(theData = NULL, nSubset = NULL, iteration = NULL, nPop = NULL, mutRate = NULL, crossRate = NULL, longitudinal = NULL, numTime = NULL, seed = NULL, co = NULL, consMatrix = NULL, threshold = NULL, toPlot = NULL, mixture = NULL, log = NULL)
stableSpec(theData = NULL, nSubset = NULL, iteration = NULL, nPop = NULL, mutRate = NULL, crossRate = NULL, longitudinal = NULL, numTime = NULL, seed = NULL, co = NULL, consMatrix = NULL, threshold = NULL, toPlot = NULL, mixture = NULL, log = NULL)
theData |
a data frame containing the data to which the model will be
be fit. If argument |
nSubset |
number of subsets to draw. In practice, it is suggested to have at least 25 subsets. The default is 10. |
iteration |
number of iterations/generations for NSGA-II. The default is 20. |
nPop |
population size (number of models) in a generation. The default is 50. |
mutRate |
mutation rate. The default is 0.075. |
crossRate |
crossover rate. The default is 0.85. |
longitudinal |
|
numTime |
number of time slices. If the data is cross-sectional, this argument must be set to 1. |
seed |
integer vector representing seeds that are used to subsample data.
The default is an integer vector with range |
co |
whether to use |
consMatrix |
|
threshold |
threshold of stability selection. The default is 0.6. |
toPlot |
if |
mixture |
if the data contains both continuous and
categorical (or ordinal) variables, this argument can be set
to |
log |
an optional logfile to monitor the progress of the algorithm. |
This function performs exploratory search over recursive (acyclic) SEM models. Models are scored along two objectives: the model fit and the model complexity. Since both objectives are often conflicting we use NSGA-II to search for Pareto optimal models. To handle the instability of small finite data samples, we repeatedly subsample the data and select those substructures that are both stable and parsimonious which are then used to infer a causal model.
a list of the following elements:
listofFronts
is a list
of optimal models for
the whole range of model complexity of all subsets.
causalStab
is a list
of causal path stability
for the whole range of model complexity
causalStab_l1
is a list
of
causal path stability of length 1
for the whole range of model complexity
edgeStab
is a list
of edge stability
for the whole range of mdoel complexity
relCausalPath
is n by n
matrix
of
relevant causal path,
where n
is the number of variables. Each positive element
i,j
represents the stability of causal path
from i
to j
.
relCausalPath_l1
is n by n
matrix
of relevant causal path with length 1, where n
is
the number of variables. Each positive element i,j
represents the stability of causal path
from i
to j
with length 1.
relEdge
is n by n
matrix
of relevant edge,
where n
is the number of variables. Each positive element
i,j
represents the stability of edge
between i
to j
.
If argument toPlot = TRUE
, then a visualization of relevant
model structures is generated. Otherwise an object of graph is returned.
An arc represents a causal path, and an (undirected)
edge represents strong association where the direction is undecidable. The
graph is annotated with reliability scores, which are
the highest selection probability in the top-left region of the edge
stability graph.
allSeed
is an integer vector representing seeds that are used in
subsampling data. This can be used to replicate the result
in next computation.
Ridho Rahmadi [email protected], Perry Groot, Tom Heskes. Christoph Stich is the contributor for parallel support.
Rahmadi, R., Groot, P., Heins, M., Knoop, H., and Heskes, T. (2016) Causality on cross-sectional data: Stable specification search in constrained structural equation modeling. Applied Soft Computing, ISSN 1568-4946, http://www.sciencedirect.com/science/article/pii/S1568494616305130.
Rahmadi, R., Groot, P., Heins, M., Knoop, H., & Heskes, T. (2015). Causality on Longitudinal Data: Stable Specification Search in Constrained Structural Equation Modeling. Proceedings of AALTD 2015, 101.
Fox, J., Nie, Z., and Byrnes, J. (2015). sem: Structural Equation Models. R package version 3.1-6. https://CRAN.R-project.org/package=sem
Ching-Shih Tsou (2013). nsga2R: Elitist Non-dominated Sorting Genetic Algorithm based on R. R package version 1.0. https://CRAN.R-project.org/package=nsga2R
Kalisch, M., Machler, M., Colombo, D., Maathuis, M. H., and Buehlmann, P. (2012). Causal inference using graphical models with the R package pcalg. Journal of Statistical Software, 47(11), 1-26.
Meinshausen, N., and Buehlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473.
Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2002), A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Transactions on Evolutionary Computation, 6(2), 182-197.
Chickering, D. M. (2002). Learning equivalence classes of Bayesian-network structures. The Journal of Machine Learning Research, 2, 445-498.
# Cross-sectional data example, # with an artificial data set of six continuous variables. # Detail about the data set can be found in the documentation. # As an example, we only run one subset. # Note that stableSpec() uses foreach to support # parallel computation, which could issue a warning # when running sequentially as the following example. However # the warning can be just ignored. the_data <- crossdata6V numSubset <- 1 num_iteration <- 5 num_pop <- 10 mut_rate <- 0.075 cross_rate <- 0.85 longi <- FALSE num_time <- 1 the_seed <- NULL the_co <- "covariance" #assummed that variable 5 does not cause variables 1, 2, and 3 cons_matrix <- matrix(c(5, 1, 5, 2, 5, 3), 3, 2, byrow=TRUE) th <- 0.1 to_plot <- FALSE mix <- FALSE result <- stableSpec(theData=the_data, nSubset=numSubset, iteration=num_iteration, nPop=num_pop, mutRate=mut_rate, crossRate=cross_rate, longitudinal=longi, numTime=num_time, seed=the_seed, co=the_co, consMatrix=cons_matrix, threshold=th, toPlot=to_plot, mixture = mix) ########################################################## ## Parallel computation is possible by ## registering parallel backend, e.g., package doParallel. ## For example, add the following lines on top of ## the example above. # # library(parallel) # library(doParallel) # cl <- makeCluster(detectCores()) # registerDoParallel(cl) # ## Then call stableSpec() as normal. ## ## Note that makeCluster() and detectCores() are ## from package parallel, and registerDoParallel() ## is from package doParallel. For more detail ## check the aforementioned packages' documentations. ###########################################################
# Cross-sectional data example, # with an artificial data set of six continuous variables. # Detail about the data set can be found in the documentation. # As an example, we only run one subset. # Note that stableSpec() uses foreach to support # parallel computation, which could issue a warning # when running sequentially as the following example. However # the warning can be just ignored. the_data <- crossdata6V numSubset <- 1 num_iteration <- 5 num_pop <- 10 mut_rate <- 0.075 cross_rate <- 0.85 longi <- FALSE num_time <- 1 the_seed <- NULL the_co <- "covariance" #assummed that variable 5 does not cause variables 1, 2, and 3 cons_matrix <- matrix(c(5, 1, 5, 2, 5, 3), 3, 2, byrow=TRUE) th <- 0.1 to_plot <- FALSE mix <- FALSE result <- stableSpec(theData=the_data, nSubset=numSubset, iteration=num_iteration, nPop=num_pop, mutRate=mut_rate, crossRate=cross_rate, longitudinal=longi, numTime=num_time, seed=the_seed, co=the_co, consMatrix=cons_matrix, threshold=th, toPlot=to_plot, mixture = mix) ########################################################## ## Parallel computation is possible by ## registering parallel backend, e.g., package doParallel. ## For example, add the following lines on top of ## the example above. # # library(parallel) # library(doParallel) # cl <- makeCluster(detectCores()) # registerDoParallel(cl) # ## Then call stableSpec() as normal. ## ## Note that makeCluster() and detectCores() are ## from package parallel, and registerDoParallel() ## is from package doParallel. For more detail ## check the aforementioned packages' documentations. ###########################################################