Links to specific topics

Sunday, February 21, 2010

What are the inner and outer models in SEM?

In a structural equation modeling (SEM) analysis, the inner model is the part of the model that describes the relationships among the latent variables that make up the model. The outer model is the part of the model that describes the relationships among the latent variables and their indicators.

In this sense, the path coefficients are inner model parameter estimates. The weights and loadings are outer model parameter estimates. The inner and outer models are also frequently referred to as the structural and measurement models, respectively.

More precisely, the mathematical equations that express the relationships among latent variables are referred to as the structural model. The equations that express the relationships among latent variables and indicators are referred to as the measurement model.

The term structural equation model is used to refer to both the structural and measurement model, combined.

Nonlinearity and type I and II errors in SEM analysis

Many relationships between variables studied in the natural and behavioral sciences seem to be nonlinear, often following a J-curve pattern (a.k.a. U-curve pattern). Other common relationships include the logarithmic, hyperbolic decay, exponential decay, and exponential. These and other relationships are modeled by WarpPLS.

Yet, the vast majority of statistical analysis methods used in the natural and behavioral sciences, from simple correlation analysis to structural equation modeling, assume relationships to be linear in the estimation of coefficients of association (e.g., Pearson correlations, standardized partial regression coefficients).

This may significantly distort results, especially in multivariate analyses, increasing the likelihood that researchers will commit type I and II errors in the same study. A type I error occurs in SEM analysis when an insignificant (the technical term is "non-significant") association is estimated as being significant (i.e., a “false positive”); a type II error occurs when a significant association is estimated as being insignificant (i.e., an existing association is “missed”).

The figure below shows a distribution of points typical of a J-curve pattern involving two variables, disrupted by uncorrelated error. The pattern, however, is modeled as a linear relationship. The line passing through the points is the best linear approximation of the distribution of points. It yields a correlation coefficient of .582. In this situation, the variable on the horizontal axis explains 33.9 percent of the variance of the variable on the vertical axis.

The figure below shows the same J-curve scatter plot pattern, but this time modeled as a nonlinear relationship. The curve passing through the points is the best nonlinear approximation of the distribution of the underlying J-curve, and excludes the uncorrelated error. That is, the curve does not attempt to model the uncorrelated error, only the underlying nonlinear relationship. It yields a correlation coefficient of .983. Here the variable on the horizontal axis explains 96.7 percent of the variance of the variable on the vertical axis.

WarpPLS transforms (or “warps”) J-curve relationship patterns like the one above BEFORE the corresponding path coefficients between each pair of variables are calculated. It does the same for many other nonlinear relationship patterns. In multivariate analyses, this may significantly change the values of the path coefficients, reducing the risk that researchers will commit type I and II errors.

The risk of committing type I and II errors is particularly high when: (a) a block of latent variables includes multiple predictor variables pointing and the same criterion variable; (b) one or more relationships between latent variables are significantly nonlinear; and (c) the predictor latent variables are correlated, even if they clearly measure different constructs (suggested by low variance inflation factors).

Sunday, February 14, 2010

How do I control for the effects of one or more demographic variables in an SEM analysis?

As part of an SEM analysis using WarpPLS, a researcher may want to control for the effects of one ore more variables. This is typically the case with what are called “demographic variables”, or variables that measure attributes of a given unit of analysis that are (usually) not expected to influence the results of the SEM analysis.

For example, let us assume that one wants to assess the effect of a technology, whose intensity of use is measured by a latent variable T, on a behavioral variable measured by B. The unit of analysis for B is the individual user; that is, each row in the dataset refers to an individual user of the technology. The researcher hypothesizes that the association between T and B is significant, so a direct link between T and B is included in the model.

If the researcher wants to control for age (A) and gender (G), which have also been collected for each individual, in relation to B, all that is needed is to include the variables A and G in the model, with direct links pointing at B. No hypotheses are made. For that to work, gender (G) has to be included in the dataset as a numeric variable. For example, the gender "male" may be replaced with 1 and "female" with 2, in which case the variable G will essentially measure the "degree of femaleness" of each individual. Sounds odd, but works.

After the analysis is conducted, let us assume that the path coefficient between T and B is found to be statistically significant, with the variables A and G included in the model as described above. In this case, the researcher can say that the association between T and B is significant, “regardless of A and G” or “when the effects of A and G are controlled for”.

In other words, the technology (T) affects behavior (B) in the hypothesized way regardless of age (A) and gender (B). This conclusion would remain the same whether the path coefficients between A and/or G and B were significant, because the focus of the analysis is on B, the main dependent variable of the model.

The discussion above is expanded in the publication below, which also contains a graphical representation of a model including control variables.

Kock, N. (2011). Using WarpPLS in e-collaboration studies: Mediating effects, control and second order variables, and algorithm choices. International Journal of e-Collaboration, 7(3), 1-13.

Some special considerations and related analysis decisions usually have to be made in more complex models, with multiple endogenous variables, and also regarding the fit indices.

Tuesday, February 9, 2010

Variance inflation factors: What they are and what they contribute to SEM analyses

Note: This post refers to the use of variance inflation factors to identify "vertical", or classic, multicolinearity. For a broader discussion of variance inflation factors in the context of "lateral" and "full" collinearity tests, please see Kock & Lynn (2012). For a discussion in the context of common method bias tests, see the post linked here.

Variance inflation factors are provided in table format by WarpPLS for each latent variable that has two or more predictors. Each variance inflation factor is associated with one predictor, and relates to the link between that predictor and its latent variable criterion. (Or criteria, when one predictor latent variable points at two or more different latent variables in the model.)

A variance inflation factor is a measure of the degree of multicolinearity among the latent variables that are hypothesized to affect another latent variable. For example, let us assume that there is a block of latent variables in a model, with three latent variables A, B, and C (predictors) pointing at latent variable D. In this case, variance inflation factors are calculated for A, B, and C, and are estimates of the multicolinearity among these predictor latent variables.

Two criteria, one more conservative and one more relaxed, are traditionally recommended in connection with variance inflation factors. More conservatively, it is recommended that variance inflation factors be lower than 5; a more relaxed criterion is that they be lower than 10 (Hair et al., 1987; Kline, 1998). High variance inflation factors usually occur for pairs of predictor latent variables, and suggest that the latent variables measure the same thing. This problem can be solved through the removal of one of the offending latent variables from the block.


Hair, J.F., Anderson, R.E., & Tatham, R.L. (1987). Multivariate data analysis. New York, NY: Macmillan.

Kline, R.B. (1998). Principles and practice of structural equation modeling. New York, NY: The Guilford Press.

Kock, N., & Lynn, G.S. (2012). Lateral collinearity and misleading results in variance-based SEM: An illustration and recommendations. Journal of the Association for Information Systems, 13(7), 546-580.

Two new WarpPLS workshops in March and April 2010 will conduct two new online workshops on WarpPLS in March and April 2010!

For more information on these and other WarpPLS workshops please visit:

Thursday, February 4, 2010

Reading data into WarpPLS: An easy and flexible step

Through Step 2, you will read the raw data used in the SEM analysis. While this should be a relatively trivial step, it is in fact one of the steps where users have the most problems with other SEM software. Often an SEM software application will abort, or freeze, if the raw data is not in the exact format required by the SEM software, or if there are any problems with the data, such as missing values (empty cells).

WarpPLS employs an import "wizard" that avoids most data reading problems, even if it does not entirely eliminate the possibility that a problem will occur. Click only on the “Next” and “Finish” buttons of the file import wizard, and let the wizard to the rest. Soon after the raw data is imported, it will be shown on the screen, and you will be given the opportunity to accept or reject it. If there are problems with the data, such as missing column names, simply click “No” when asked if the data looks correct.

Raw data can be read directly from Excel files, with extension “.xls” (or newer Excel extensions), or text files where the data is tab-delimited or comma-delimited. When reading from an “.xls” file, make sure that the spreadsheet file that has your numeric data is in the first worksheet; that is, if the spreadsheet has multiple worksheets. Raw data files, whether Excel or text files, must have indicator names in the first row, and numeric data in the following rows. They may contain empty cells, or missing values; these will be automatically replaced with column averages in a later step (assuming that the default missing data imputation algorithm is used).

One simple test can be used to try to find out if there are problems with a raw data file. Try to open it with a spreadsheet software (e.g., Excel), if it is originally a text file; or try to create a tab-delimited text file with it, if it is originally a spreadsheet file. If you try to do either of these things, and the data looks messed up (e.g., corrupted, or missing column names), then it is likely that the original file has problems, which may be hidden from view. For example, a spreadsheet file may be corrupted, but that may not be evident based on a simple visual inspection of the contents of the file.