Title: | Finite Mixture Modeling for Raw and Binned Data |
---|---|
Description: | Performs maximum likelihood estimation for finite mixture models for families including Normal, Weibull, Gamma and Lognormal by using EM algorithm, together with Newton-Raphson algorithm or bisection method when necessary. It also conducts mixture model selection by using information criteria or bootstrap likelihood ratio test. The data used for mixture model fitting can be raw data or binned data. The model fitting process is accelerated by using R package 'Rcpp'. |
Authors: | Youjiao Yu [aut, cre] |
Maintainer: | Youjiao Yu <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2.0 |
Built: | 2024-11-13 06:01:20 UTC |
Source: | https://github.com/garybaylor/mixr |
The package mixR
performs maximum likelihood estimation for finite
mixture models for families including Normal, Weibull, Gamma and Lognormal via EM algorithm.
It also conducts model selection by using information criteria or bootstrap likelihood ratio
test. The data used for mixture model fitting can be raw data or binned data. The model fitting
is accelerated by using R package Rcpp.
Finite mixture models can be represented by
where is the probability density function (p.d.f.) or probability mass function
(p.m.f.) of the mixture model,
is the p.d.f. or p.m.f. of the
th
component of the mixture model,
is the proportion of the
th component and
is the parameter of the
th component, which can be a scalar or a vector,
is a vector of all the parameters of the mixture model. The maximum likelihood
estimate of the parameter vector
can be obtained by using
the EM algorithm (Dempster et al, 1977).
The binned data is present sometimes instead of the raw data, for the reason of storage
convenience or necessity. The binned data is recorded in the form of
where
is the lower bound of the
th bin,
is
the upper bound of the
th bin, and
is the number of observations that fall
in the
th bin, for
, and
is the total number of bins.
To obtain maximum likelihood estimate of the finite mixture model for binned data, we can
introduce two types of latent variables and
, where
represents the
value of the unknown raw data, and
is a vector of zeros and one indicating the
component that
belongs to. To use the EM algorithm we first write the complete-data
log-likelihood
where is the expected value of
given the estimated value of
and expected value
at
th iteration. The estimated value of
can be updated iteratively via the E-step, in which we estimate
by maximizing
the complete-data loglikelihood, and M-step, in which we calculate the expected value of
the latent variables
and
. The EM algorithm is terminated by using a stopping
rule.
The M-step of the EM algorithm may or may not have closed-form solution (e.g. the Weibull
mixture model or Gamma mixture model). If not, an iterative approach like Newton's algorithm
or bisection method may be used.
For a given data set, when we have no prior information about the number of components
, its value should be estimated from the data. Because mixture models don't satisfy
the regularity condition for the likelihood ratio test (which requires that the true
parameter under the null hypothesis should be in the interior of the parameter space
of the full model under the alternative hypothesis), a bootstrap approach is usually
used in the literature (see McLachlan (1987, 2004), Feng and McCulloch (1996)). The general
step of bootstrap likelihood ratio test is as follows.
For the given data , estimate
under both the null and the alternative
hypothesis to get
and
. Calculate the observed log-likelihood
and
. The likelihood ratio test
statistic is defined as
Generate random data of the same size as the original data from the model
under the null hypothesis using estimated parameter
, then repeat step
1 using the simulated data. Repeat this process for
times to get a vector of the
simulated likelihood ratio test statistics
.
Calculate the empirical p-value
where is the indicator function.
This package does the following three things.
Fitting finite mixture models for both raw data and binned data by using EM algorithm, together with Newton-Raphson algorithm and bisection method.
Do parametric bootstrap likelihood ratio test for two candidate models.
Do model selection by Bayesian information criterion.
To speed up computation, the EM algorithm is fulfilled in C++ by using Rcpp (Eddelbuettel and Francois (2011)).
Maintainer: Youjiao Yu [email protected]
Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pages 1-38, 1977.
Dirk Eddelbuettel and Romain Francois (2011). Rcpp: Seamless R and C++ Integration. Journal of Statistical Software, 40(8), 1-18. URL http://www.jstatsoft.org/v40/i08/.
Efron, B. Bootstrap methods: Another look at the jackknife. Ann. Statist., 7(1):1-26, 01 1979.
Feng, Z. D. and McCulloch, C. E. Using bootstrap likelihood ratios in finite mixture models. Journal of the Royal Statistical Society. Series B (Methodological), pages 609-617, 1996.
Lo, Y., Mendell, N. R., and Rubin, D. B. Testing the number of components in a normal mixture. Biometrika, 88(3):767-778, 2001.
McLachlan, G. J. On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied statistics, pages 318-324, 1987.
McLachlan, G. and Jones, P. Fitting mixture models to grouped and truncated data via the EM algorithm. Biometrics, pages 571-578, 1988.
McLachlan, G. and Peel, D. Finite mixture models. John Wiley & Sons, 2004.
This function creates a binned data set from the raw data
bin(x, brks)
bin(x, brks)
x |
a numeric vector of raw data |
brks |
a numeric vector in increasing order, representing the bin values within each of which we want to calculate the frequency of the data |
Given a numeric vector, the function bin
creates a binned data set with bin values provided
by brks
. Fitting mixture models with a large data set may be slow, especially when
we want to fit non-normal mixture models. Binning the data with a relatively
small bin width speeds up the computation of EM algorithm while at the same time keeps the
precision of the estimation result.
The function bin
returns a matrix with three columns, representing the value of the left bin, the value of the right bin and the number of observations in x
that falls in each bin.
set.seed(99) x <- rmixnormal(200, c(0.5, 0.5), c(2, 5), c(1, 1)) data <- bin(x, seq(-2, 10, 0.1)) fit1 <- mixfit(x, ncomp = 2) fit2 <- mixfit(data, ncomp = 2)
set.seed(99) x <- rmixnormal(200, c(0.5, 0.5), c(2, 5), c(1, 1)) data <- bin(x, seq(-2, 10, 0.1)) fit1 <- mixfit(x, ncomp = 2) fit2 <- mixfit(data, ncomp = 2)
This function performs the likelihood ratio test by parametric bootstrapping for two mixture models with different number of components.
bs.test( x, ncomp = c(1, 2), family = c("normal", "weibull", "gamma", "lnorm"), B = 100, ev = FALSE, mstep.method = c("bisection", "newton"), init.method = c("kmeans", "hclust"), tol = 1e-06, max_iter = 500 )
bs.test( x, ncomp = c(1, 2), family = c("normal", "weibull", "gamma", "lnorm"), B = 100, ev = FALSE, mstep.method = c("bisection", "newton"), init.method = c("kmeans", "hclust"), tol = 1e-06, max_iter = 500 )
x |
a numeric vector for the raw data or a three-column matrix for the binned data. |
ncomp |
a vector of two positive integers specifying the number of components of the
mixture model under the null and alternative hypothesis.
The first integer should be smaller than the second one. The default value is
|
family |
a character string specifying the family of the mixture model, which can be one
of |
B |
the number of bootstrap iterations (default 100). |
ev |
a logical value indicating whether the variance of each component should be the same
or not (default |
mstep.method |
the method used in M-step of EM algorithm for |
init.method |
a character string specifying the method used for providing the initial values
for the parameters for the EM algorithm. It can be one of |
tol |
the tolerance for the stopping rule of EM algorithm. It is the value to stop
EM algorithm when the two consecutive iterations produces log-likelihood with difference
less than |
max_iter |
the maximum number of iterations for the EM algorithm (default 500). |
For the given data x
and the specified family, the function bs.test
conducts
a bootstrap likelihood ratio test for two mixture models with the number of components
under the null and the alternative hypothesis specified in ncomp
.
The function bs.test
returns an object of class bootEM
which
contains the following three items.
pvalue |
The p-value of the bootstrap likelihood ratio test |
w0 |
the observed likelihood ratio test statistic |
w1 |
a vector of simulated likelihood ratio test statistics |
## testing normal mixture models with 2 and 3 components set.seed(100) x <- rmixnormal(200, c(0.5, 0.5), c(2, 5), c(1, 0.7)) ret <- bs.test(x, ncomp = c(2, 3), B = 30) ret ## (not run) testing Weibull mixture models with 2 and 3 components ## set.seed(101) ## x <- rmixweibull(200, c(0.3, 0.4, 0.3), c(2, 5, 8), c(1, 0.6, 0.8)) ## ret <- bs.test(x, ncomp = c(2, 3), family = "weibull", B = 30) ## ret ## (not run) testing Gamma mixture models with 1 and 2 components ## set.seed(102) ## x <- rgamma(200, 2, 1) ## ret <- bs.test(x, ncomp = c(1, 2), family = "gamma", B = 30) ## ret
## testing normal mixture models with 2 and 3 components set.seed(100) x <- rmixnormal(200, c(0.5, 0.5), c(2, 5), c(1, 0.7)) ret <- bs.test(x, ncomp = c(2, 3), B = 30) ret ## (not run) testing Weibull mixture models with 2 and 3 components ## set.seed(101) ## x <- rmixweibull(200, c(0.3, 0.4, 0.3), c(2, 5, 8), c(1, 0.6, 0.8)) ## ret <- bs.test(x, ncomp = c(2, 3), family = "weibull", B = 30) ## ret ## (not run) testing Gamma mixture models with 1 and 2 components ## set.seed(102) ## x <- rgamma(200, 2, 1) ## ret <- bs.test(x, ncomp = c(1, 2), family = "gamma", B = 30) ## ret
This function calculates the probability density of a finite mixture model.
## S3 method for class 'mixfitEM' density(x, at, smoothness = 512, cut = 3.8, ...)
## S3 method for class 'mixfitEM' density(x, at, smoothness = 512, cut = 3.8, ...)
x |
an object of class |
at |
a scalar or a numeric vector of locations where densities are calculated |
smoothness |
a positive integer controlling the smoothness of the density curve (default 512). The higher this value is, the more locations of the mixture model the density is calculated. |
cut |
the number of standard deviations away the density is to be computed (default 3.8) |
... |
other arguments |
The function density.mixfitEM
is the method of the generic function
density
for the class mixfitEM
.
This function returns a list of class densityEM
, which contains the following
items.
x |
a scalar or a numeric vector of locations where densities are calculated. |
y |
a vector of the densities of the mixture model at the corresponding locations in |
comp |
a matrix with columns representing the densities of each component in the mixture model at the corresponding locations in |
set.seed(102) x <- rmixnormal(200, c(0.5, 0.5), c(2, 5), c(1, 0.7)) fit1 <- mixfit(x, ncomp = 2) d1 = density(fit1) d2 = density(fit1, at = 0)
set.seed(102) x <- rmixnormal(200, c(0.5, 0.5), c(2, 5), c(1, 0.7)) fit1 <- mixfit(x, ncomp = 2) d1 = density(fit1) d2 = density(fit1, at = 0)
This function returns the mean and standard deviation of each component by using K-means clustering or hierarchical clustering.
initz(x, ncomp, init.method = c("kmeans", "hclust"))
initz(x, ncomp, init.method = c("kmeans", "hclust"))
x |
a numeric vector of the raw data or a three-column matrix of the binned data |
ncomp |
a positive integer specifying the number of components for a mixture model |
init.method |
the method used for providing initial values, which can be one of
|
The function initz
returns the mean and standard deviation of each component
of a mixture model by using K-means clustering algorithm, or hierarchical clustering
method. It is used for automatically selecting initial values for the EM algorithm,
so as to enable mixture model selection by bootstrapping likelihood ratio test or
using information criteria.
initz
returns a list with three items
pi |
a numeric vector of component proportions |
mu |
a numeric vector of component means |
sd |
a numeric vector of component standard deviations |
x <- rmixnormal(500, c(0.5, 0.5), c(2, 5), c(1, 0.7)) data <- bin(x, seq(-2, 8, 0.25)) par1 <- initz(x, 2) par2 <- initz(data, 2)
x <- rmixnormal(500, c(0.5, 0.5), c(2, 5), c(1, 0.7)) data <- bin(x, seq(-2, 8, 0.25)) par1 <- initz(x, 2) par2 <- initz(data, 2)
This function is used to perform the maximum likelihood estimation for a variety of finite mixture models for both raw and binned data by using the EM algorithm, together with Newton-Raphson algorithm or bisection method when necessary.
mixfit( x, ncomp = NULL, family = c("normal", "weibull", "gamma", "lnorm"), pi = NULL, mu = NULL, sd = NULL, ev = FALSE, mstep.method = c("bisection", "newton"), init.method = c("kmeans", "hclust"), tol = 1e-06, max_iter = 500 )
mixfit( x, ncomp = NULL, family = c("normal", "weibull", "gamma", "lnorm"), pi = NULL, mu = NULL, sd = NULL, ev = FALSE, mstep.method = c("bisection", "newton"), init.method = c("kmeans", "hclust"), tol = 1e-06, max_iter = 500 )
x |
a numeric vector for the raw data or a three-column matrix for the binned data |
ncomp |
a positive integer specifying the number of components of the mixture model |
family |
a character string specifying the family of the mixture model. It can only be
one element from |
pi |
a vector of the initial value for the proportion |
mu |
a vector of the initial value for the mean |
sd |
a vector of the initial value for the standard deviation |
ev |
a logical value controlling whether each component has the same variance when
fitting normal mixture models. It is ignored when fitting other mixture models. The default is |
mstep.method |
a character string specifying the method used in M-step of the EM algorithm
when fitting weibull or gamma mixture models. It can be either |
init.method |
a character string specifying the method used for providing initial values
for the parameters for EM algorithm. It can be one of |
tol |
the tolerance for the stopping rule of EM algorithm. It is the value to stop EM algorithm when the two
consecutive iterations produces loglikelihood with difference less than |
max_iter |
the maximum number of iterations for the EM algorithm (default 500). |
The function mixfit
is the core function in this package. It is used to perform
the maximum likelihood estimation for finite mixture models from the families of normal,
weibull, gamma or lognormal by using the EM algorithm. When the family is weibull
or gamma
, the M-step of the EM algorithm has no closed-form solution and we can
use Newton algorithm by specifying method = "newton"
or use bisection method by
specifying method = "bisection"
.
The initial values of the EM algorithm can be provided by specifying the proportion of each
component pi
, the mean of each component mu
and the standard deviation of
each component sd
. If one or more of these initial values are not provided, then
their values are estimated by using K-means clustering method or hierarchical clustering
method. If all of pi
, mu
, and sd
are not provided, then ncomp
should be provided so initial values are automatically
generated. For the normal mixture models, we can
control whether each component has the same variance or not.
the function mixfit
return an object of class mixfitEM
, which contains a list of
different number of items when fitting different mixture models. The common items include
pi |
a numeric vector representing the estimated proportion of each component |
mu |
a numeric vector representing the estimated mean of each component |
sd |
a numeric vector representing the estimated standard deviation of each component |
iter |
a positive integer recording the number of EM iteration performed |
loglik |
the loglikelihood of the estimated mixture model for the data |
aic |
the value of AIC of the estimated model for the data |
bic |
the value of BIC of the estimated model for the data |
data |
the data |
comp.prob |
the probability that |
family |
the family the mixture model belongs to |
For the Weibull mixture model, the following extra items are returned.
k |
a numeric vector representing the estimated shape parameter of each component |
lambda |
a numeric vector representing the estimated scale parameter of each component |
For the Gamma mixture model, the following extra items are returned.
alpha |
a numeric vector representing the estimated shape parameter of each component |
lambda |
a numeric vector representing the estimated rate parameter of each component |
For the lognormal mixture model, the following extra items are returned.
mulog |
a numeric vector representing the estimated logarithm mean of each component |
sdlog |
a numeric vector representing the estimated logarithm standard deviation of each component |
plot.mixfitEM
, density.mixfitEM
,
select
, bs.test
## fitting the normal mixture models set.seed(103) x <- rmixnormal(200, c(0.3, 0.7), c(2, 5), c(1, 1)) data <- bin(x, seq(-1, 8, 0.25)) fit1 <- mixfit(x, ncomp = 2) # raw data fit2 <- mixfit(data, ncomp = 2) # binned data fit3 <- mixfit(x, pi = c(0.5, 0.5), mu = c(1, 4), sd = c(1, 1)) # providing the initial values fit4 <- mixfit(x, ncomp = 2, ev = TRUE) # setting the same variance ## (not run) fitting the weibull mixture models ## x <- rmixweibull(200, c(0.3, 0.7), c(2, 5), c(1, 1)) ## data <- bin(x, seq(0, 8, 0.25)) ## fit5 <- mixfit(x, ncomp = 2, family = "weibull") # raw data ## fit6 <- mixfit(data, ncomp = 2, family = "weibull") # binned data ## (not run) fitting the Gamma mixture models ## x <- rmixgamma(200, c(0.3, 0.7), c(2, 5), c(1, 1)) ## data <- bin(x, seq(0, 8, 0.25)) ## fit7 <- mixfit(x, ncomp = 2, family = "gamma") # raw data ## fit8 <- mixfit(data, ncomp = 2, family = "gamma") # binned data ## (not run) fitting the lognormal mixture models ## x <- rmixlnorm(200, c(0.3, 0.7), c(2, 5), c(1, 1)) ## data <- bin(x, seq(0, 8, 0.25)) ## fit9 <- mixfit(x, ncomp = 2, family = "lnorm") # raw data ## fit10 <- mixfit(data, ncomp = 2, family = "lnorm") # binned data
## fitting the normal mixture models set.seed(103) x <- rmixnormal(200, c(0.3, 0.7), c(2, 5), c(1, 1)) data <- bin(x, seq(-1, 8, 0.25)) fit1 <- mixfit(x, ncomp = 2) # raw data fit2 <- mixfit(data, ncomp = 2) # binned data fit3 <- mixfit(x, pi = c(0.5, 0.5), mu = c(1, 4), sd = c(1, 1)) # providing the initial values fit4 <- mixfit(x, ncomp = 2, ev = TRUE) # setting the same variance ## (not run) fitting the weibull mixture models ## x <- rmixweibull(200, c(0.3, 0.7), c(2, 5), c(1, 1)) ## data <- bin(x, seq(0, 8, 0.25)) ## fit5 <- mixfit(x, ncomp = 2, family = "weibull") # raw data ## fit6 <- mixfit(data, ncomp = 2, family = "weibull") # binned data ## (not run) fitting the Gamma mixture models ## x <- rmixgamma(200, c(0.3, 0.7), c(2, 5), c(1, 1)) ## data <- bin(x, seq(0, 8, 0.25)) ## fit7 <- mixfit(x, ncomp = 2, family = "gamma") # raw data ## fit8 <- mixfit(data, ncomp = 2, family = "gamma") # binned data ## (not run) fitting the lognormal mixture models ## x <- rmixlnorm(200, c(0.3, 0.7), c(2, 5), c(1, 1)) ## data <- bin(x, seq(0, 8, 0.25)) ## fit9 <- mixfit(x, ncomp = 2, family = "lnorm") # raw data ## fit10 <- mixfit(data, ncomp = 2, family = "lnorm") # binned data
This function is the plot method for the class bootEM
.
## S3 method for class 'bootEM' plot(x, ...)
## S3 method for class 'bootEM' plot(x, ...)
x |
an object of class |
... |
the other parameters passed to the function |
The histogram of the bootstrap LRT statistics is plotted, with the
observed LRT statistic imposed in a red vertical line.
## plotting the bootstrap LRT result set.seed(100) x <- rmixnormal(200, c(0.5, 0.5), c(2, 5), c(1, 0.7)) ret <- bs.test(x, ncomp = c(2, 3), B = 30) plot(ret)
## plotting the bootstrap LRT result set.seed(100) x <- rmixnormal(200, c(0.5, 0.5), c(2, 5), c(1, 0.7)) ret <- bs.test(x, ncomp = c(2, 3), B = 30) plot(ret)
This is the plot method for the class mixfitEM
. It is used to plot the fitted mixture models
by using base R plotting system or using the package ggplot2.
## S3 method for class 'mixfitEM' plot( x, theme = NULL, add_hist = TRUE, add_poly = TRUE, add_legend = TRUE, smoothness = 512, trans = 0.5, cut = 3.8, xlab, ylab, title, breaks, plot.title = element_text(hjust = 0.5), axis.text.x = element_text(), axis.text.y = element_text(), axis.title.x = element_text(), axis.title.y = element_text(), legend.title = element_text(), legend.text = element_text(), legend.position = "right", legend.direction = ifelse(legend.position %in% c("top", "bottom"), "horizontal", "vertical"), ... )
## S3 method for class 'mixfitEM' plot( x, theme = NULL, add_hist = TRUE, add_poly = TRUE, add_legend = TRUE, smoothness = 512, trans = 0.5, cut = 3.8, xlab, ylab, title, breaks, plot.title = element_text(hjust = 0.5), axis.text.x = element_text(), axis.text.y = element_text(), axis.title.x = element_text(), axis.title.y = element_text(), legend.title = element_text(), legend.text = element_text(), legend.position = "right", legend.direction = ifelse(legend.position %in% c("top", "bottom"), "horizontal", "vertical"), ... )
x |
an object of class |
theme |
a string the specifies the appearance of the plot, which is from the ggplot2 and could be one of'gray', 'bw' (default), 'linedraw', 'light', 'dark', 'minimal', 'classic', or 'void'. |
add_hist |
a logical value specifying whether a histogram of data should be plotted |
add_poly |
a logical value specifying whether a polygon of each component should be plotted. |
add_legend |
a logical value specifying whether the legend should be plotted. |
smoothness |
a positive integer controlling the smoothness of the density curve in the plot. The default value is 512 and increasing this value will produce smoother curve. |
trans |
the transparency of the polygons if they are plotted (default 0.5) |
cut |
the number of standard deviations from the center of each component we want to plot the density (default 3.8) |
xlab |
the label for x axis |
ylab |
the label for y axis |
title |
the title of the plot |
breaks |
the number of bins used for plotting the histogram |
plot.title |
an object returned by element_text() to specify the appearance of the title |
axis.text.x |
an object returned by element_text() to specify the appearance of the x axis |
axis.text.y |
an object returned by element_text() to specify the appearance of the y axis |
axis.title.x |
an object returned by element_text() to specify the appearance of the label along x axis |
axis.title.y |
an object returned by element_text() to specify the appearance of the label along y axis |
legend.title |
an object returned by element_text() to specify the appearance of the legend title |
legend.text |
an object returned by element_text() to specify the appearance of the legend text |
legend.position |
the position of the legend, could be 'right'(default), 'right', 'top', or 'bottom' |
legend.direction |
the direction of the legend, could be 'vertical' (default) or 'horizontal' |
... |
other arguments |
The function plot.mixfitEM
is used for plotting an object of class mixfitEM
, which is
an output of the function mixfit
. Users can choose base R plotting system or ggplot2
(the package ggplot2 needs to be installed).
plotting system. The plot is a density plot of the fitted mixture model imposed on top of a histogram.
The parameters that control the appearance of the histogram and the density curve can be changed.
The density curve of each component can be shown or hidden.
x <- rmixnormal(200, c(0.3, 0.7), c(2, 5), c(1, 0.7)) mod <- mixfit(x, ncomp = 2) plot(mod) plot(mod, theme = 'classic')
x <- rmixnormal(200, c(0.3, 0.7), c(2, 5), c(1, 0.7)) mod <- mixfit(x, ncomp = 2) plot(mod) plot(mod, theme = 'classic')
selectEM
This function plots the result of mixture model selection by BIC.
## S3 method for class 'selectEM' plot(x, leg.loc = "topright", ...)
## S3 method for class 'selectEM' plot(x, leg.loc = "topright", ...)
x |
an object of class |
leg.loc |
the location of the legend, which is the same as the first argument of the function |
... |
other arguments passed to |
The function plot.selectEM
is the plot method for the class selectEM
. It plots
the number of components against the corresponding value of BIC. It is used to visually display
the mixture model selection result by BIC.
x <- rmixnormal(200, c(0.3, 0.7), c(2, 5), c(1, 1)) res <- select(x, ncomp = 1:3) plot(res)
x <- rmixnormal(200, c(0.3, 0.7), c(2, 5), c(1, 1)) res <- select(x, ncomp = 1:3) plot(res)
mixfitEM
This function is the print method for the mixfitEM
class.
## S3 method for class 'mixfitEM' print(x, digits = getOption("digits"), ...)
## S3 method for class 'mixfitEM' print(x, digits = getOption("digits"), ...)
x |
an object of class |
digits |
the digits to print for the values in the print output. The default
value is from the global option |
... |
other arguments passed to |
print.mixfitEM
prints the value of the parameters of a fitted mixture model, together
with some other information like the number of iterations of the EM algorithm, the loglikelihood,
the value of AIC and BIC.
x <- rmixnormal(200, c(0.5, 0.5), c(2, 5), c(1, 0.7)) fit <- mixfit(x, ncomp = 2) print(x)
x <- rmixnormal(200, c(0.5, 0.5), c(2, 5), c(1, 0.7)) fit <- mixfit(x, ncomp = 2) print(x)
selectEM
The function prints the result of mixture model selection.
## S3 method for class 'selectEM' print(x, ...)
## S3 method for class 'selectEM' print(x, ...)
x |
an object of class |
... |
other arguments passed to |
The function print.selectEM
is the print method for the classselectEM, which is the output
of the function select
. It prints a data frame which contains the following information of each
candidate mixture models: the number of components, whether the variance is the same for each component
in a mixture model (only for normal), the value of BIC, and an indicator of the best model.
This function creates a numeric vector approximating the raw data from binned data
reinstate(data)
reinstate(data)
data |
a three-column matrix representing the raw data |
The function reinstate
creates a numeric vector by generating
random data from the Uniform distribution
for
and then combine all random data together.
are the first, second and the third column of the matrix
data
and is the number of bins.
It is used for enabling parameter initialization for EM algorithm when we fit mixture
models for binned data.
The function returns a numeric vector.
x <- rnorm(100) data <- bin(x, seq(-3, 3, 0.25)) y <- reinstate(data)
x <- rnorm(100) data <- bin(x, seq(-3, 3, 0.25)) y <- reinstate(data)
The function rmixgamma
generates random data from a Gamma mixture model.
rmixgamma(n, pi, mu, sd)
rmixgamma(n, pi, mu, sd)
n |
a positive integer specifying the number of observations we want to generate from the mixture model |
pi |
a numeric vector for the proportion of each component |
mu |
a numeric vector for the mean of each component |
sd |
a numeric vector for the standard deviation of each component |
The number of random data from each component (a vector) is generated from a multinomial
distribution Multinom
. Then the random data from each component is generated with
the sample sized specified in
and parameters of Gamma distributions specified in
mu
and sd
.
The function rmixgamma
returns a numeric vector of random data from the specified Gamma mixture model.
rmixnormal
, rmixweibull
, rmixlnorm
x <- rmixgamma(1000, c(0.4, 0.6), c(2, 5), c(1, 0.5)) hist(x, breaks = 40)
x <- rmixgamma(1000, c(0.4, 0.6), c(2, 5), c(1, 0.5)) hist(x, breaks = 40)
The function rmixlnorm
generates random data from a lognormal mixture model.
rmixlnorm(n, pi, mu, sd)
rmixlnorm(n, pi, mu, sd)
n |
a positive integer specifying the number of observations we want to generate from the mixture model |
pi |
a numeric vector for the proportion of each component |
mu |
a numeric vector for the mean of each component |
sd |
a numeric vector for the standard deviation of each component |
The number of random data from each component (a vector) is generated from a multinomial
distribution Multinom
. Then the random data from each component is generated with
the sample sized specified in
and parameters of lognormal distributions specified in
mu
and sd
.
The function rmixlnorm
returns a numeric vector of random data from the specified lognormal mixture model.
rmixnormal
, rmixweibull
, rmixgamma
x <- rmixlnorm(1000, c(0.4, 0.6), c(2, 5), c(1, 0.5)) hist(x, breaks = 40)
x <- rmixlnorm(1000, c(0.4, 0.6), c(2, 5), c(1, 0.5)) hist(x, breaks = 40)
The function rmixnormal
generates random data from a normal mixture model.
rmixnormal(n, pi, mu, sd)
rmixnormal(n, pi, mu, sd)
n |
a positive integer specifying the number of observations we want to generate from the mixture model |
pi |
a numeric vector for the proportion of each component |
mu |
a numeric vector for the mean of each component |
sd |
a numeric vector for the standard deviation of each component |
The number of random data from each component (a vector) is generated from a multinomial
distribution Multinom
. Then the random data from each component is generated with
the sample sized specified in
and parameters of normal distributions specified in
mu
and sd
.
The function rmixnormal
returns a numeric vector of random data from the specified normal mixture model.
rmixweibull
, rmixgamma
, rmixlnorm
x <- rmixnormal(1000, c(0.4, 0.6), c(2, 5), c(1, 0.5)) hist(x, breaks = 40)
x <- rmixnormal(1000, c(0.4, 0.6), c(2, 5), c(1, 0.5)) hist(x, breaks = 40)
The function rmixweibull
generates random data from a normal Weibull model.
rmixweibull(n, pi, mu, sd)
rmixweibull(n, pi, mu, sd)
n |
a positive integer specifying the number of observations we want to generate from the mixture model |
pi |
a numeric vector for the proportion of each component |
mu |
a numeric vector for the mean of each component |
sd |
a numeric vector for the standard deviation of each component |
The number of random data from each component (a vector) is generated from a multinomial
distribution Multinom
. Then the random data from each component is generated with
the sample sized specified in
and parameters of Weibull distributions specified in
mu
and sd
.
The function rmixweibull
returns a numeric vector of random data from the specified Weibull mixture model.
rmixnormal
, rmixgamma
, rmixlnorm
x <- rmixweibull(1000, c(0.4, 0.6), c(2, 5), c(1, 0.5)) hist(x, breaks = 40)
x <- rmixweibull(1000, c(0.4, 0.6), c(2, 5), c(1, 0.5)) hist(x, breaks = 40)
This function selects the best model from a candidate of mixture models based on the information criterion BIC.
select( x, ncomp, family = c("normal", "weibull", "gamma", "lnorm"), mstep.method = c("bisection", "newton"), init.method = c("kmeans", "hclust"), tol = 1e-06, max_iter = 500 )
select( x, ncomp, family = c("normal", "weibull", "gamma", "lnorm"), mstep.method = c("bisection", "newton"), init.method = c("kmeans", "hclust"), tol = 1e-06, max_iter = 500 )
x |
a numeric vector for raw data or a three-column matrix for the binned data |
ncomp |
a vector of positive integers specifying the number of components of the candidate mixture models |
family |
a character string specifying the family of the mixture model. It can only be
one element from |
mstep.method |
a character string specifying the method used in M-step of the EM algorithm
when fitting weibull or gamma mixture models. It can be either |
init.method |
a character string specifying the method used for providing initial values
for the parameters for EM algorithm. It can be one of |
tol |
the tolerance for the stopping rule of EM algorithm. It is the value to stop EM algorithm when the two
consecutive iterations produces loglikelihood with difference less than |
max_iter |
the maximum number of iterations for the EM algorithm (default 500). |
By specifying different number of components, the function select
fits a series
of mixture models for a given family, and a mixture model with minimum value of BIC
is regarded as the best.
The function returns an object of class selectEM
which contains the following items.
ncomp |
the specified number of components of the candidate mixture models |
equal.var |
a logical vector indicating whether the variances of each component in each mixture model
are constrained to be the same (only for |
bic |
the value of BIC for each mixture model |
best |
an indicator of the best model |
family |
the family of the mixture model |
plot.selectEM
, bs.test
, mixfit
## selecting the optimal normal mixture model by BIC set.seed(105) x <- rmixnormal(1000, c(0.3, 0.4, 0.3), c(-4, 0, 4), c(1, 1, 1)) hist(x, breaks = 40) ret <- select(x, ncomp = 2:5) ## [1] "The final model: normal mixture (equal variance) with 3 components" ## (not run) selecting the optimal Weibull mixture model by BIC ## set.seed(106) ## x <- rmixweibull(1000, c(0.3, 0.4, 0.3), c(2, 5, 8), c(0.7, 0.6, 1)) ## ret <- select(x, ncomp = 2:5, family = "weibull") ## [1] "The final model: weibull mixture with 3 components" ## (not run) selecting the optimal Gamma mixture model by BIC ## set.seed(107) ## x <- rmixgamma(1000, c(0.3, 0.7), c(2, 5), c(0.7, 1)) ## ret <- select(x, ncomp = 2:5, family = "gamma") ## [1] "The final model: gamma mixture with 2 components" ## (not run) selecting the optimal lognormal mixture model by BIC ## set.seed(108) ## x <- rmixlnorm(1000, c(0.2, 0.3, 0.2, 0.3), c(4, 7, 9, 12), c(1, 0.5, 0.7, 1)) ## ret <- select(x, ncomp = 2:6, family = "lnorm") ## [1] "The final model: lnorm mixture with 4 components"
## selecting the optimal normal mixture model by BIC set.seed(105) x <- rmixnormal(1000, c(0.3, 0.4, 0.3), c(-4, 0, 4), c(1, 1, 1)) hist(x, breaks = 40) ret <- select(x, ncomp = 2:5) ## [1] "The final model: normal mixture (equal variance) with 3 components" ## (not run) selecting the optimal Weibull mixture model by BIC ## set.seed(106) ## x <- rmixweibull(1000, c(0.3, 0.4, 0.3), c(2, 5, 8), c(0.7, 0.6, 1)) ## ret <- select(x, ncomp = 2:5, family = "weibull") ## [1] "The final model: weibull mixture with 3 components" ## (not run) selecting the optimal Gamma mixture model by BIC ## set.seed(107) ## x <- rmixgamma(1000, c(0.3, 0.7), c(2, 5), c(0.7, 1)) ## ret <- select(x, ncomp = 2:5, family = "gamma") ## [1] "The final model: gamma mixture with 2 components" ## (not run) selecting the optimal lognormal mixture model by BIC ## set.seed(108) ## x <- rmixlnorm(1000, c(0.2, 0.3, 0.2, 0.3), c(4, 7, 9, 12), c(1, 0.5, 0.7, 1)) ## ret <- select(x, ncomp = 2:6, family = "lnorm") ## [1] "The final model: lnorm mixture with 4 components"
A vector containing the 1872 Hidalgo stamp data
Stamp
Stamp
A vector with 485 measurements of the thickness (nm) of the stamps
Izenman, A. J. and Sommer, C. J. Philatelic mixtures and multimodal densities. Journal of the American Statistical association, 83(404):941-953, 1988.
A dataset containing the 1872 Hidalgo stamp data in the form of binned data
Stamp2
Stamp2
A matrix with 62 rows and 3 columns:
the lower bin values
the upper bin values
the number of observations in each bin
The function to_k_lambda_weibull
converts the mean and standard deviation to the shape and scale for
the Weibull distributions.
to_k_lambda_weibull(mu, sd)
to_k_lambda_weibull(mu, sd)
mu |
a numeric vector representing the means of Weibull distributions |
sd |
a numeric vector representing the standard deviations of Weibull distributions.
|
The purpose of this function is to convert the parameterization of Weibull distribution in the form of mean and standard deviation to the form of shape and scale. It can be used for specifying the initial values for the EM algorithm when the first-hand initial values are in the form of mean and standard deviation from K-means clustering algorithm.
a list of two items
k |
a vector of the shapes of Weibull distributions |
lambda |
a vector of the scales of Weibull distributions |
to_k_lambda_weibull(2, 1) to_k_lambda_weibull(c(2, 5), c(1, 0.7))
to_k_lambda_weibull(2, 1) to_k_lambda_weibull(c(2, 5), c(1, 0.7))
The function to_mu_sd_gamma
converts the shape and rate to the mean and standard deviation
to_mu_sd_gamma(alpha, lambda)
to_mu_sd_gamma(alpha, lambda)
alpha |
a numeric vector representing the shape of one or more than one gamma distributions |
lambda |
a numeric vector representing the rate of one or more than one gamma distributions.
|
The purpose of this function is to convert the parameterization of gamma distribution in the form of shape and rate to the form of mean and standard deviation.
a list of two items
mu |
a vector of the means of gamma distributions |
sd |
a vector of the standard deviations of gamma distributions |
to_mu_sd_gamma(2, 1) to_mu_sd_gamma(c(2, 4), c(1, 1))
to_mu_sd_gamma(2, 1) to_mu_sd_gamma(c(2, 4), c(1, 1))
The function to_mu_sd_lnorm
converts the logarithm mean and logarithm standard deviation to the
mean and standard deviation
to_mu_sd_lnorm(mulog, sdlog)
to_mu_sd_lnorm(mulog, sdlog)
mulog |
a vector of logarithm means of lognormal distributions |
sdlog |
a vector of logarithm standard deviations of lognormal distributions |
The purpose of this function is to convert the parameterization of lognormal distribution in the form of logarithm mean and logarithm standard deviation to the form of mean and standard deviation.
a list of two items
mu |
a vector of the means of lognormal distributions |
sd |
a vector of the standard deviations of lognormal distributions |
to_mu_sd_lnorm(2, 1) to_mu_sd_lnorm(c(2, 4), c(1, 1))
to_mu_sd_lnorm(2, 1) to_mu_sd_lnorm(c(2, 4), c(1, 1))
The function to_mu_sd_weibull
converts the parameters of shape and scale of weibull distributions to
the parameters of the mean and standard deviation.
to_mu_sd_weibull(k, lambda)
to_mu_sd_weibull(k, lambda)
k |
a numeric vector representing the shape of a series of Weibull distributions |
lambda |
a numeric vector representing the scale of a series of Weibull distributions.
|
The purpose of this function is to convert the parameterization of Weibull distribution in the form of shape and scale to the form of mean and standard deviation.
a list of two items
mu |
a vector of the means of Weibull distributions |
sd |
a vector of the standard deviations of Weibull distributions |
to_mu_sd_weibull(2, 1) to_mu_sd_weibull(c(2, 4), c(1, 1))
to_mu_sd_weibull(2, 1) to_mu_sd_weibull(c(2, 4), c(1, 1))
The function to_mulog_sdlog_lnorm
converts the mean and standard deviation to
the logarithm mean and logarithm standard deviation
to_mulog_sdlog_lnorm(mu, sd)
to_mulog_sdlog_lnorm(mu, sd)
mu |
a vector of means of lognormal distributions |
sd |
a vector of standard deviations of lognormal distributions |
The purpose of this function is to convert the parameterization of lognormal distribution in the form of mean and standard deviation to the form of logarithm mean and logarithm standard deviation. It can be used for specifying the initial values for the EM algorithm when the first-hand initial values are in the form of mean and standard deviation from K-means clustering algorithm.
a list of two items
mulog |
a vector of lognormal means of lognormal distributions |
sdlog |
a vector of lognormal standard deviations of lognormal distributions |
to_mulog_sdlog_lnorm(2, 1) to_mulog_sdlog_lnorm(c(2, 4), c(1, 1))
to_mulog_sdlog_lnorm(2, 1) to_mulog_sdlog_lnorm(c(2, 4), c(1, 1))
The function to_shape_rate_gamma
converts the mean and standard deviation to the shape and rate
to_shape_rate_gamma(mu, sd)
to_shape_rate_gamma(mu, sd)
mu |
a numeric vector representing the means of gamma distributions |
sd |
a numeric vector representing the standard deviations of gamma distributions.
|
The purpose of this function is to convert the parameterization of gamma distribution in the form of mean and standard deviation to the form of shape and rate. It can be used for specifying the initial values for the EM algorithm when the first-hand initial values are in the form of mean and standard deviation from K-means clustering algorithm.
a list of two items
alpha |
a vector of the shapes of gamma distributions |
lambda |
a vector of the rates of gamma distributions |
to_shape_rate_gamma(2, 1) to_shape_rate_gamma(c(2, 4), c(1, 1))
to_shape_rate_gamma(2, 1) to_shape_rate_gamma(c(2, 4), c(1, 1))