Links to specific topics

Monday, July 26, 2010

Testing the significance of mediating effects with WarpPLS using the Baron & Kenny approach

This post discusses how you can use WarpPLS to test a mediating effect using what is often referred to as the classic Baron and Kenny approach (for a recent discussion, see: Kock, 2014). You can also test mediating effects directly with WarpPLS, using indirect and total effect outputs:

Using WarpPLS, one can test the significance of a mediating effect of a variable M, which is hypothesized to mediate the relationship between two other variables X and Y, by using Baron & Kenny’s (1986) criteria. The procedure is outlined below. It can be easily adapted to test multiple mediating effects, and more complex mediating effects (e.g., with multiple mediators). Please note that we are not referring to moderating effects here; these can be tested directly with WarpPLS, by adding moderating links to a model.

First two models must be built. The first model should have X pointing at Y, without M being included in the model. (You can have the variable in the WarpPLS model, but there should be no links from or to it.) The second model should have X pointing at Y, X pointing at M, and M pointing at Y. This is a “triangle”-looking model. A WarpPLS analysis must be conducted with both models, which may be saved in two different project files; this analysis may use linear or nonlinear analysis algorithms. The mediating effect will be significant if the three following criteria are met:

- In the first model, the path between X and Y is significant (e.g., P < 0.05, if this is the significance level used).

- In the second model, the path between X and M is significant.

- In the second model, the path between M and Y is significant.

Note that, in the second model, the path between M and Y controls for the effect of X. That is the way it should be. Also note that the effect of X on Y in the second model is irrelevant for this mediation significance test. Nevertheless, if the effect of X on Y in the second model is insignificant (i.e., indistinguishable from zero, statistically speaking), one can say that the case is one of “perfect” mediation. On the other hand, if the effect of X on Y in the second model is significant, one can say that the case is one of “partial” mediation. This of course assumes that the three criteria are met.

Generally, the lower the direct effect of X on Y in the second model, the more “perfect” the mediation is, if the three criteria for mediating effect significance are met.


Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality & Social Psychology, 51(6), 1173-1182.

Kock, N. (2014). Advanced mediating effects tests, multi-group analyses, and measurement model assessments in PLS-based SEM. International Journal of e-Collaboration, 10(1), 1-13.

Friday, July 23, 2010

Use formative latent variables with caution

One should use formative latent variables (LVs) with caution in structural equation modeling analyses using WarpPLS. It is not uncommon to see formative LVs being created simply by casually aggregating indicators, without much concern about the indicators being actually facets of the same construct. See this post for more details.

It is also important to stress that formative LVs are better assessed when included as part of a model. This is preferable to analyzing formative LVs individually; that is, as “models” that include one single LV. The loadings and cross-loadings table takes into consideration both formative and reflective LVs in its calculation, and may suggest that some indicators do not “belong” to a formative LV.

Also, certain model parameters may become unstable due to collinearity. High collinearity among indicators is to be expected in reflective LV measurement, but not in formative LV measurement. In the context of formative LV assessment, collinearity may be reflected in unstable weights, where unexpected P values (usually statistically non-significant) are associated with weights.

In formative LVs, indicators are expected to measure different facets of the LV, not the same thing. If two (or more) indicators are collinear in a formative LV, it may be a good idea to collapse them into one indicator. This can be done by defining second order LVs (a two-step, somewhat complex procedure), averaging the indicators, or simply eliminating one of the indicators from the analysis.

Thursday, July 22, 2010

Using WarpPLS in an exploratory path analysis of health-related data

There has been quite a lot of debate lately on the findings of a study known as the China Study. One of the key hypotheses of that study is that animal protein consumption (e.g., meat, dairy) causes various types of cancer, including colorectal cancer. Total cholesterol has been proposed as one of the intervening variables in connection with this effect. Given that, I decided to take a look at some of the data from the China Study and do a couple of multivariate data analysis on it using WarpPLS.

First I built a model that explores relationships with the goal of testing the assumption that the consumption of animal protein causes colorectal cancer, via an intermediate effect on total cholesterol. I built the model with various hypothesized associations to explore several relationships simultaneously, including some commonsense ones. Including commonsense relationships is usually a good idea in exploratory multivariate analyses.

The model is shown on the graph below, with the results. (Click on it to enlarge. Use the "CRTL" and "+" keys to zoom in, and CRTL" and "-" to zoom out.) The arrows explore causative associations between variables. The variables are shown within ovals. The meaning of each variable is the following: aprotein = animal protein consumption; pprotein = plant protein consumption; cholest = total cholesterol; crcancer = colorectal cancer.

The “(R)1i” below the variable names simply means that each of the variables is measured through a single indicator. This characterizes this analysis as a path analysis, rather than a true structural equation modeling (SEM) analysis. The P values were calculated through jackknifing. Like bootstrapping and other nonparametric resampling techniques, jackknifing does not require the assumption that the data be normally distributed. This is good, because I checked the data, and it does not look like it is normally distributed. So what does the model above tell us? It tells us that:

- As animal protein consumption increases, colorectal cancer decreases, but not in a statistically significant way (beta=-0.13; P=0.11).

- As animal protein consumption increases, plant protein consumption decreases significantly (beta=-0.19; P<0.01).

- As plant protein consumption increases, colorectal cancer increases significantly (beta=0.30; P=0.03). This is statistically significant because the P is lower than 0.05.

- As animal protein consumption increases, total cholesterol increases significantly (beta=0.20; P<0.01).

- As plant protein consumption increases, total cholesterol decreases significantly (beta=-0.23; P=0.02).

- As total cholesterol increases, colorectal cancer increases significantly (beta=0.45; P<0.01). Big surprise here!

Why the big surprise with the apparently strong relationship between total cholesterol and colorectal cancer? The reason is that it does not make sense, because animal protein consumption seems to increase total cholesterol, and yet animal protein consumption seems to decrease colorectal cancer.

When something like this happens in a multivariate analysis, it may be due to the model not incorporating a variable that has important relationships with the other variables. In other words, the model is incomplete, hence the nonsensical results. Relationships among variables that are implied by coefficients of association must also make sense to be credible.

Now, it has been pointed out that the missing variable here possibly is schistosomiasis infection. The dataset from the China Study included that variable, even though there were some missing values (about 28 percent of the data for that variable was missing), so I added it to the model in a way that seems to make sense. The new model is shown on the graph below. In the model, schisto = schistosomiasis infection.

So what does this new, and more complete, model tell us? It tells us some of the things that the previous model told us, but a few new things, which make a lot more sense. Note that this model fits the data much better than the previous one, particularly regarding the overall effect on colorectal cancer, which is indicated by the high R-squared value for that variable (R-squared=0.73). Most notably, this new model tells us that:

- As schistosomiasis infection increases, colorectal cancer increases significantly (beta=0.83; P<0.01). This is a MUCH STRONGER relationship than the previous one between total cholesterol and colorectal cancer; even though some data on schistosomiasis infection for a few counties is missing (the relationship might have been even stronger with a complete dataset). And this strong relationship makes sense, because schistosomiasis infection is indeed associated with increased cancer rates. More information on schistosomiasis infections can be found here.

- Schistosomiasis infection has no significant relationship with these variables: animal protein consumption, plant protein consumption, or total cholesterol. This makes sense, as the infection is caused by a worm that is not normally present in plant or animal food, and the infection itself is not specifically associated with abnormalities that would lead one to expect major increases in total cholesterol.

- Total cholesterol has no significant relationship with colorectal cancer (beta=0.24; P=0.11). The beta here is nontrivial, but too low to be significant; i.e., we cannot discard chance within the context of this relatively small dataset.

- Animal protein consumption has no significant relationship with colorectal cancer. The beta here is very low, and negative (beta=-0.03).

- Plant protein consumption has no significant relationship with colorectal cancer. The beta for this association is positive and nontrivial (beta=0.15), but the P value is too high (P=0.20) for us to discard chance within the context of this dataset.

Below is the plot showing the relationship between schistosomiasis infection and colorectal cancer. The values are standardized, which means that the zero on the horizontal axis is the mean of the schistosomiasis infection numbers in the dataset. The shape of the plot is the same as the one with the unstandardized data. As you can see, the data points are very close to a line, which suggests a very strong linear association.

In summary, an exploratory path analysis with WarpPLS can shed light on data patterns that would look rather “mysterious” otherwise. Still, one has to use commonsense, good theory, and past empirical results to derive conclusions.

Tuesday, July 13, 2010

Using WarpPLS for multiple regression analyses

There are two main advantages of using WarpPLS to conduct a multiple regression analysis. The advantages are over a traditional multiple regression analysis, where the independent and dependent variables are measured through single indicators. With WarpPLS, this would be implemented through the creation of "latent" variables that would each be associated with a single indicator; which means that they would not be true latent variables in the sense normally assumed in structural equation modeling.

The first advantage is that the calculation of P values with WarpPLS is based on nonparametric algorithms, resampling or "stable" algorithms, and thus does not require that the variables be normally distributed. A traditional multiple regression analysis, on the other hand, requires that the variables be normally distributed. In this sense, WarpPLS can be seen as conducting a robust, or nonparametric, multiple regression analysis. This first advantage assumes that all one is doing is a plain linear analysis with WarpPLS, for which one would typically use the algorithm Robust Path Analysis. See the software's User Manual for more details.

The second advantage is that WarpPLS allows for nonlinear relationships between the independent and dependent variables to be explicitly modeled. This provides a much richer view of the associations between variables, and sometimes leads to path coefficients that are different from (often higher than) those obtained through a linear analysis (as in a traditional multiple regression analysis). The nonlinear analysis algorithms available are Warp3 and variants, which yield S curves; and Warp2 and variants, which yield U curves. Again, see the software's User Manual for more details.