JANE User Guide
JANE Package Version 2.0.1
Source:vignettes/JANE-User-Guide.Rmd
JANE-User-Guide.Rmd
Description
JANE is an R package for fitting latent space network cluster models using an expectation-maximization (EM) algorithm. It enables flexible modeling of unweighted or weighted network data, with or without noise edges, and supports both directed and undirected networks, with or without degree and strength heterogeneity. Designed to efficiently handle large networks, JANE allows users to explore latent structure, identify actor-centric communities, and simulate networks with customizable clustering and connectivity patterns.
Details on the methodology underlying the package can be found in Arakkal and Sewell (2025).
Features
- Fine-tuned algorithm for fast and accurate actor-centric clustering using latent space models
- Supports both weighted and unweighted networks, with or without noisy edges
- Allows for modeling of discrete and continuous edge weights
- Multiple initialization strategies for the EM algorithm, including random initialization, a novel graphical neural network initialization, and user-specified values
- Customizable prior hyperparameter specification
- Supports parallel implementation for faster inference using the future and future.apply packages (Bengtsson 2021)
- Supports case-control likelihood approximation for scalable inference (Raftery et al. 2012)
- Built-in visualization and summary tools
Getting Started
Installation
# Current release from CRAN
install.packages("JANE")
# Development version from GitHub
# install.packages("devtools")
devtools::install_github("a1arakkal/JANE")
# Development version with vignettes
devtools::install_github("a1arakkal/JANE", build_vignettes = TRUE)
Fitting JANE on networks without noise edges
When the noise_weights
argument in JANE()
is set to FALSE
(default), a standard latent space cluster
model is fit to the supplied unweighted network.
Available models
model = "NDH"
Applicable for undirected networks and assumes no degree heterogeneity. However, in real-world networks, it is rare to find no degree heterogeneity, as most networks exhibit considerable variation in node connectivity. Below is an example that fits an NDH model, specifying the dimension of the latent space as 2 and the number of clusters as 3.
JANE(A = network_adjacency_matrix,
D = 2,
K = 3,
model = "NDH",
noise_weights = FALSE)
model = "RS"
Applicable for undirected networks and adjusts for degree heterogeneity through the inclusion of actor-specific sociality effects. Below is an example that fits an RS model, where the number of clusters is specified to be 3 and a range of dimensions between 2 and 5 for the latent space is considered.
JANE(A = network_adjacency_matrix,
D = 2:5,
K = 3,
model = "RS",
noise_weights = FALSE)
model = "RSR"
Applicable for directed networks and adjusts for degree heterogeneity through the inclusion of actor-specific sender and receiver effects. Below is an example that fits an RSR model, where a range of cluster numbers between 2 and 10 and a range of dimensions between 2 and 5 for the latent space is considered.
JANE(A = network_adjacency_matrix,
D = 2:5,
K = 2:10,
model = "RSR",
noise_weights = FALSE)
Fitting JANE on networks with noise edges
With JANE()
, weighted networks are particularly
useful in settings where noise edges are present. In such settings, when
the noise_weights
argument in JANE()
is set to
TRUE
, a latent space hurdle model (LSHM) is fit. The LSHM
leverages information from both the propensity to form an edge and the
observed edge weights to probabilistically down-weight noisy edges,
while preserving edges that are structurally meaningful, subsequently
enhancing community detection.
If the supplied network is a weighted network, in the absence of
noise it can be shown that the latent positions, regression parameters
associated with the logistic regression model, finite mixture model
parameters, and actor-specific cluster membership probabilities can all
be estimated separately from the generalized linear model parameters for
the edge weights. Thus, when noise_weights = FALSE
,
JANE()
will simply dichotomize the supplied weighted
network based on a threshold value of 0 and fit a standard latent space
cluster model.
Available models
You can fit the "NDH"
, "RS"
, or
"RSR"
models using any of the supported weight
distributions described below. When working with weighted networks, the
"RS"
and "RSR"
models account not only for
degree heterogeneity but also for strength heterogeneity.
family = "poisson"
Use this option for count-weighted networks. This setting models observed edge weights using a zero-truncated Poisson (ZTP) distribution:
- Signal edge weights are modeled with a ZTP using a log link
- Noise edge weights are modeled with a ZTP using a fixed
user-specified mean via the
guess_noise_weights
argument
JANE(A = network_adjacency_matrix,
D = 2,
K = 5,
model = "RS",
noise_weights = TRUE,
family = "poisson",
guess_noise_weights = 1L)
family = "lognormal"
Use this option when edge weights are positive, continuous values. This setting models observed edge weights using a log-normal distribution:
- Signal edges are modeled using a log-normal distribution with an identity link
- Noise edge weights are modeled using a log-normal distribution with
a user-specified log-scale mean via the
guess_noise_weights
argument
JANE(A = network_adjacency_matrix,
D = 2,
K = 5,
model = "RS",
noise_weights = TRUE,
family = "lognormal",
guess_noise_weights = -3.5) # log-scale mean
family = "bernoulli"
(default)
This setting is used for unweighted (binary) networks. In this case:
- Signal and noise edges share the same observed weight (i.e., 1), so only the edge propensity model is used to identify and down-weight likely noise edges
-
guess_noise_weights
is interpreted as the expected proportion of all edges that are likely to be noise
JANE(A = network_adjacency_matrix,
D = 2,
K = 5,
model = "RSR",
noise_weights = TRUE,
family = "bernoulli",
guess_noise_weights = 0.2) # expected noise edge proportion
guess_noise_weights
If guess_noise_weights is left as NULL (the default),
JANE()
will automatically set this value based on the
family
argument:
- For
family %in% c("lognormal", "poisson")
, the 1st percentile of the non-zero edge weights is used (log-transformed weights forlognormal
) - For
family = "bernoulli"
, a default value of 0.01 (i.e., 1%) is used, representing the assumed proportion of noisy edges
Simulating networks
JANE includes a built-in function,
sim_A()
, for simulating networks with varying clustering
and connectivity patterns. The function returns a list, which includes
the following two components:
-
A
: A sparse adjacency matrix of class ‘dgCMatrix’ representing the “true” underlying unweighted network with no noise edges -
W
: A sparse adjacency matrix of class ‘dgCMatrix’ representing the unweighted or weighted network, with or without noise edges
The relationship between A
and W
depends on
the values specified for family
and
noise_weights_prob
:
- If
family = "bernoulli"
andnoise_weights_prob = 0
, thenA
andW
are identical (i.e., no noise simulated) - If
family = "bernoulli"
andnoise_weights_prob > 0
, thenW
contains added noise edges, whileA
represents the noise-free underlying network - If
family %in% c("poisson", "lognormal")
andnoise_weights_prob = 0
, thenW
contains continuous or count-valued edge weights, but dichotomizingW
at 0 will reproduceA
- If
family %in% c("poisson", "lognormal")
andnoise_weights_prob > 0
, thenW
is a noisy, weighted network with additional spurious edges, andA
remains the true underlying structure
Below is an example to simulate an unweighted, undirected, noise-free network with a 2-dimensional latent space and 3 clusters, assuming no degree heterogeneity.
# Specify the 3 x 2 matrix containing the 1 x 2 mean vectors of the 3 bivariate normals
mus <- matrix(c(-1,-1, # Mean vector 1
1,-1, # Mean vector 2
1, 1), # Mean vector 3
nrow = 3,
ncol = 2,
byrow = TRUE)
# Specify the 2 x 2 x 3 array containing the 2 x 2 precision matrices of the 3 bivariate normals
omegas <- array(c(diag(rep(7,2)), # Precision matrix 1
diag(rep(7,2)), # Precision matrix 2
diag(rep(7,2))), # Precision matrix 3
dim = c(2,2,3))
# Simulate a network
sim_A(N = 100L,
model = "NDH",
mus = mus,
omegas = omegas,
p = rep(1/3, 3),
params_LR = list(beta0 = 1.0),
remove_isolates = TRUE)
The parameter beta0
above can be tuned to achieve a
desired network density:
# Specify the 3 x 2 matrix containing the 1 x 2 mean vectors of the 3 bivariate normals
mus <- matrix(c(-1,-1, # Mean vector 1
1,-1, # Mean vector 2
1, 1), # Mean vector 3
nrow = 3,
ncol = 2,
byrow = TRUE)
# Specify the 2 x 2 x 3 array containing the 2 x 2 precision matrices of the 3 bivariate normals
omegas <- array(c(diag(rep(7,2)), # Precision matrix 1
diag(rep(7,2)), # Precision matrix 2
diag(rep(7,2))), # Precision matrix 3
dim = c(2,2,3))
desired_density <- 0.1 # Target network density
min_density <- desired_density * 0.99 # Lower bound for acceptable density
max_density <- desired_density * 1.01 # Upper bound for acceptable density
n_act <- 100L # Number of actors in the network
density <- Inf # Initialize density to enter while loop
beta0 <- 0.5 # Initial value for intercept parameter
n_while_loop <- 0 # Counter for outer loop iterations
max_its <- 100 # Maximum number of iterations
change_beta0 <- 0.1 # Amount to adjust beta0 by
# Adjust beta0 until simulated network has the approximate desired density
while(! (density >= min_density & density <= max_density) ){
if(n_while_loop>max_its){
break
}
n_retry_isolate <- 0
retry_isolate <- T
# Retry until a network with no isolates is generated (this while loop is optional)
while(retry_isolate){
sim_data <- sim_A(N = n_act,
model = "NDH",
mus = mus,
omegas = omegas,
p = rep(1/3, 3),
params_LR = list(beta0 = beta0),
remove_isolates = TRUE)
n_retry_isolate <- n_retry_isolate + 1
# Accept network if no isolates remain, or if retried more than 10 times at the same beta0
if(nrow(sim_data$A) == n_act | n_retry_isolate>10){
retry_isolate <- F
}
}
# Compute network density
density <- igraph::graph.density(
igraph::graph_from_adjacency_matrix(sim_data$A, mode = "undirected")
)
# Adjust beta0 based on density feedback
if (density > max_density) {
beta0 <- beta0 - change_beta0
}
if (density < min_density) {
beta0 <- beta0 + change_beta0
}
n_while_loop <- n_while_loop + 1
}
A <- sim_data$A # Final simulated adjacency matrix
igraph::graph.density(igraph::graph_from_adjacency_matrix(A, mode = "undirected")) # Verify density
Below is an example that simulates a directed, weighted network with noise, degree and strength heterogeneity, a 2-dimensional latent space, and 3 clusters.
# Specify the 3 x 2 matrix containing the 1 x 2 mean vectors of the 3 bivariate normals
mus <- matrix(c(-1,-1, # Mean vector 1
1,-1, # Mean vector 2
1, 1), # Mean vector 3
nrow = 3,
ncol = 2,
byrow = TRUE)
# Specify the 2 x 2 x 3 array containing the 2 x 2 precision matrices of the 3 bivariate normals
omegas <- array(c(diag(rep(7,2)), # Precision matrix 1
diag(rep(7,2)), # Precision matrix 2
diag(rep(7,2))), # Precision matrix 3
dim = c(2,2,3))
# Simulate a network
sim_A(N = 100L,
model = "RSR",
family = "poisson",
mus = mus,
omegas = omegas,
p = rep(1/3, 3),
params_LR = list(beta0 = 1),
params_weights = list(beta0 = 2),
noise_weights_prob = 0.1,
mean_noise_weights = 1,
remove_isolates = TRUE)
Model selection criteria for choosing the number of clusters
JANE()
allows for the following model selection criteria
to choose the number of clusters and the best initialization of the EM
algorithm (smaller values are favored):
-
"Total_BIC"
(i.e., BIC): Sum of and , where is the BIC computed from logistic regression or Hurdle model component and is the BIC computed from model based clustering component. This is the model selection criterion proposed by Handcock, Raftery, and Tantrum (2007) -
"Total_ICL"
(i.e., BICL): (default) sum of and , this criterion is similar to"Total_BIC"
, but uses the integrated completed likelihood (ICL) for the model based clustering component, which tends to favor more well-separated clusters
Based on simulation studies, Biernacki, Celeux, and Govaert (2000) recommends that when the interest in mixture modeling is cluster analysis, instead of density estimation, the criterion should be favored over the criterion, as the criterion tends to overestimate the number of clusters. The criterion is designed to choose the number of components in a mixture model rather than the number of clusters.
Warning: It is not certain whether it is appropriate to use the model selection criterion above to select the dimension of the latent space .
Below is an example that fits an RSR model, where a range of cluster
numbers between 2 and 10 is considered for a 2-dimensional latent space
and "Total_BIC"
is used to select the optimal number of
clusters and best initialization of the EM algorithm.
JANE(A = network_adjacency_matrix,
D = 2,
K = 2:10,
model = "RSR",
noise_weights = FALSE,
control = list(IC_selection = "Total_BIC"))
Below is an example that fits an RSR model for a 2-dimensional latent
space with 3 clusters and "Total_ICL"
is used to select the
optimal initialization of the EM algorithm from 10 unique starts. Note:
the number of starts for the EM algorithm is controlled through the
control
argument by supplying a value for
n_start
.
Initialization of EM algorithm
JANE()
has three initialization strategies to generate
starting values for the EM algorithm, controlled through the
initialization
argument:
-
"GNN"
: uses a type of graphical neural network approach to generate initial values (default). Details of the graphical neural network approach can be found in Arakkal and Sewell (2025) -
"random"
: uses random initial values - A user-supplied list of S3 class
"JANE.initial_values"
representing initial values for the EM algorithm. See?specify_initial_values
for details on how to specify initial values
Below is an example where starting values are supplied to
JANE()
using specify_initial_values()
.
# Specify starting values
D <- 3
K <- 5
N <- nrow(sim_data$A)
n_interior_knots <- 5L
U <- matrix(stats::rnorm(N*D), nrow = N, ncol = D)
omegas <- stats::rWishart(n = K, df = D+1, Sigma = diag(D))
mus <- matrix(stats::rnorm(K*D), nrow = K, ncol = D)
p <- extraDistr::rdirichlet(n = 1, rep(3,K))[1,]
Z <- extraDistr::rdirichlet(n = N, alpha = rep(1, K))
beta <- stats::rnorm(n = 1 + 2*(1 + n_interior_knots))
my_starting_values <- specify_initial_values(A = network_adjacency_matrix,
D = D,
K = K,
model = "RSR",
n_interior_knots = n_interior_knots,
U = U,
omegas = omegas,
mus = mus,
p = p,
Z = Z,
beta = beta)
# Run JANE using my_starting_values (no need to specify D and K as function will
# determine those values from my_starting_values)
JANE(A = network_adjacency_matrix,
initialization = my_starting_values,
model = "RSR",
noise_weights = FALSE)
Specification of prior hyperparameters
Prior distributions
The prior distributions are specified as follows:
Prior on
For the current implementation, we require that all elements of the
nu
vector be
to prevent negative mixture weights for empty clusters:
Default hyperparameters
where,
is the dimension of the latent space,
is the number of clusters,
and
is the number of interior knots used in fitting a natural cubic spline
for degree heterogeneity (and connection strength heterogeneity if
working with weighted network) models (i.e., "RS"
and
"RSR"
only).
Example of specifying hyperparameters for a single combination of and
# Specify prior hyperparameters
D <- 3
K <- 5
n_interior_knots <- 5L
a <- rep(1, D)
b <- 3
c <- 4
G <- 10*diag(D)
nu <- rep(2, K)
e <- rep(0.5, 1 + (n_interior_knots + 1))
f <- diag(c(0.1, rep(0.5, n_interior_knots + 1)))
my_prior_hyperparameters <- specify_priors(D = D,
K = K,
model = "RS",
n_interior_knots = n_interior_knots,
a = a,
b = b,
c = c,
G = G,
nu = nu,
e = e,
f = f)
# Run JANE using supplied prior hyperparameters (no need to specify D and K
# as function will determine those values from my_prior_hyperparameters)
JANE(A = network_adjacency_matrix,
initialization = "GNN",
model = "RS",
noise_weights = FALSE,
control = list(priors = my_prior_hyperparameters))
Example of specifying hyperparameters for multiple combinations of and
Nested list
# Specify first set of prior hyperparameters
D <- 3
K <- 5
n_interior_knots <- 5L
a <- rep(1, D)
b <- 3
c <- 4
G <- 10*diag(D)
nu <- rep(2, K)
e <- rep(0.5, 1 + (n_interior_knots + 1))
f <- diag(c(0.1, rep(0.5, n_interior_knots + 1)))
my_prior_hyperparameters_1 <- specify_priors(D = D,
K = K,
model = "RS",
n_interior_knots = n_interior_knots,
a = a,
b = b,
c = c,
G = G,
nu = nu,
e = e,
f = f)
# Specify second set of prior hyperparameters
D <- 2
K <- 3
n_interior_knots <- 5L
a <- rep(1, D)
b <- 3
c <- 4
G <- 10*diag(D)
nu <- rep(2, K)
e <- rep(0.5, 1 + (n_interior_knots + 1))
f <- diag(c(0.1, rep(0.5, n_interior_knots + 1)))
my_prior_hyperparameters_2 <- specify_priors(D = D,
K = K,
model = "RS",
n_interior_knots = n_interior_knots,
a = a,
b = b,
c = c,
G = G,
nu = nu,
e = e,
f = f)
# Create nested list
my_prior_hyperparameters <- list(my_prior_hyperparameters_1,
my_prior_hyperparameters_2)
# Run JANE using supplied prior hyperparameters (no need to specify D and K
# as function will determine those values from my_prior_hyperparameters)
JANE(A = network_adjacency_matrix,
initialization = "GNN",
model = "RS",
noise_weights = FALSE,
control = list(priors = my_prior_hyperparameters))
Unevaluated calls
Unevaluated calls using quote()
can be supplied as
values for specific hyperparameters. This is particularly useful when
running JANE()
for multiple combinations of
and
.
# Specify prior hyperparameters as unevaluated calls
n_interior_knots <- 5L
e <- rep(0.5, 1 + (n_interior_knots + 1))
f <- diag(c(0.1, rep(0.5, n_interior_knots + 1)))
my_prior_hyperparameters <- specify_priors(model = "RS",
n_interior_knots = n_interior_knots,
a = quote(rep(1, D)),
b = b,
c = quote(D + 1),
G = quote(10*diag(D)),
nu = quote(rep(2, K)),
e = e,
f = f)
# Run JANE using supplied prior hyperparameters
JANE(A = network_adjacency_matrix,
D = 2:5,
K = 2:10,
initialization = "GNN",
model = "RS",
noise_weights = FALSE,
control = list(priors = my_prior_hyperparameters))
Parallel implementation of JANE()
You can speed up fitting models for multiple combinations of cluster
number K
, latent space dimension D
, and number
of initializations of the EM algorithm n_start
, by running
JANE()
in parallel. This leverages the future
and future.apply
packages to distribute computation across
cores.
Below is an example using 5 parallel workers to run
JANE()
with parallel backend enabled, followed by resetting
to sequential processing.
# Set up parallel plan with 5 workers (cores)
future::plan(future::multisession, workers = 5)
# Run JANE in parallel
res_parallel <- JANE(A = network_adjacency_matrix,
D = 2:5,
K = 3:10,
initialization = "GNN",
model = "RSR")
# Reset to sequential processing
future::plan(future::sequential)
Case-control implementation of JANE()
When working with large sparse networks, the case-control
approximation implemented in JANE()
can reduce
computational cost. This approach leverages approximation methods
developed by Raftery et al. (2012). With
the case-control implementation, the likelihood includes all edges
(where most of the information lies) but uses a Monte Carlo
approximation of terms involving the unconnected dyads of the network
(i.e., non-edges). Thus, the cost of evaluating the likelihood becomes
linear, rather than quadratic, in
.
Below is an example running JANE()
with the case-control
approach, sampling 20 controls (i.e., non-edges) per actor.
Using S3 methods with JANE objects
Once you have fit a model using the JANE()
function, you
can inspect and analyze the results using several built-in S3 methods.
These methods work with the JANE
S3 class object returned
from a model fit.
For the examples in this section, we assume that res
is
the object returned by JANE()
.
print()
The print()
method gives a summary of the optimal model
fit and available components.
print(res)
This displays:
- The model type, number of clusters, and latent space dimension
- Sizes of each cluster, that are estimated using a hard cluster assignment rule defined as , where is the estimated conditional probability that the actor belongs to the cluster. Note, here the optimal number of clusters selected may include empty clusters based on the hard cluster assignments
- Available components in the fitted object
res
summary()
Use the summary()
method to extract and display detailed
information about the estimated model.
summary(res)
You can also compare estimated cluster assignments to known true
assignments using the true_labels
argument, or inspect
starting values using initial_values = TRUE
.
Output includes:
- Coefficients from logistic or GLM components
- Latent positions, cluster means, and precision matrices
- Soft cluster memberships and hard cluster assignments
- Clustering uncertainty
- ARI, NMI, classification error rate (if
true_labels
is supplied) - Model selection criteria
- Summary of noise edges if
noise_weights = TRUE
. This includes the estimated conditional probability that a specific edge is a noise or a non-noise edge, and their corresponding noise edge cluster assignment based on a hard clustering rule of for , where and are the estimated conditional probabilities that the edge is a non-noise and noise edge, respectively, and represents the total number of edges in the network (for undirected networks, only the upper diagonal edges are retained). Noise edge cluster labels are defined as: 1 = non-noise edge and 2 = noise edge
plot()
The plot()
method provides visualizations depending on
the type
argument.
Latent space clustering (default type = "lsnc"
)
plot(res)
- Plots latent positions colored by cluster
- Contours show the fitted Gaussian components when = 2. When > 2, pairwise plots of the dimensions are displayed instead
Misclassified actors (type = "misclassified"
)
plot(res, type = "misclassified", true_labels = true_labels_vec)
- Highlights misclassified actors in black based on supplied true
cluster assignments (i.e.,
true_labels_vec
in this example)
Clustering uncertainty (type = "uncertainty"
)
plot(res, type = "uncertainty")
- Colors nodes by uncertainty (i.e., 1 - )
EM trace plot (type = "trace_plot"
)
plot(res, type = "trace_plot")
- Shows convergence metrics over EM iterations
Additional options
-
alpha_edge
,alpha_node
,zoom
,swap_axes
,rotation_angle
,cluster_cols
, etc. -
remove_noise_edges = TRUE
removes noise edges based on noise edge hard cluster assignment (only applicable ifJANE()
was run withnoise_weights = TRUE
)