Links to specific topics

Thursday, July 22, 2010

Using WarpPLS in an exploratory path analysis of health-related data


There has been quite a lot of debate lately on the findings of a study known as the China Study. One of the key hypotheses of that study is that animal protein consumption (e.g., meat, dairy) causes various types of cancer, including colorectal cancer. Total cholesterol has been proposed as one of the intervening variables in connection with this effect. Given that, I decided to take a look at some of the data from the China Study and do a couple of multivariate data analysis on it using WarpPLS.

First I built a model that explores relationships with the goal of testing the assumption that the consumption of animal protein causes colorectal cancer, via an intermediate effect on total cholesterol. I built the model with various hypothesized associations to explore several relationships simultaneously, including some commonsense ones. Including commonsense relationships is usually a good idea in exploratory multivariate analyses.

The model is shown on the graph below, with the results. (Click on it to enlarge. Use the "CRTL" and "+" keys to zoom in, and CRTL" and "-" to zoom out.) The arrows explore causative associations between variables. The variables are shown within ovals. The meaning of each variable is the following: aprotein = animal protein consumption; pprotein = plant protein consumption; cholest = total cholesterol; crcancer = colorectal cancer.


The “(R)1i” below the variable names simply means that each of the variables is measured through a single indicator. This characterizes this analysis as a path analysis, rather than a true structural equation modeling (SEM) analysis. The P values were calculated through jackknifing. Like bootstrapping and other nonparametric resampling techniques, jackknifing does not require the assumption that the data be normally distributed. This is good, because I checked the data, and it does not look like it is normally distributed. So what does the model above tell us? It tells us that:

- As animal protein consumption increases, colorectal cancer decreases, but not in a statistically significant way (beta=-0.13; P=0.11).

- As animal protein consumption increases, plant protein consumption decreases significantly (beta=-0.19; P<0.01).

- As plant protein consumption increases, colorectal cancer increases significantly (beta=0.30; P=0.03). This is statistically significant because the P is lower than 0.05.

- As animal protein consumption increases, total cholesterol increases significantly (beta=0.20; P<0.01).

- As plant protein consumption increases, total cholesterol decreases significantly (beta=-0.23; P=0.02).

- As total cholesterol increases, colorectal cancer increases significantly (beta=0.45; P<0.01). Big surprise here!

Why the big surprise with the apparently strong relationship between total cholesterol and colorectal cancer? The reason is that it does not make sense, because animal protein consumption seems to increase total cholesterol, and yet animal protein consumption seems to decrease colorectal cancer.

When something like this happens in a multivariate analysis, it may be due to the model not incorporating a variable that has important relationships with the other variables. In other words, the model is incomplete, hence the nonsensical results. Relationships among variables that are implied by coefficients of association must also make sense to be credible.

Now, it has been pointed out that the missing variable here possibly is schistosomiasis infection. The dataset from the China Study included that variable, even though there were some missing values (about 28 percent of the data for that variable was missing), so I added it to the model in a way that seems to make sense. The new model is shown on the graph below. In the model, schisto = schistosomiasis infection.


So what does this new, and more complete, model tell us? It tells us some of the things that the previous model told us, but a few new things, which make a lot more sense. Note that this model fits the data much better than the previous one, particularly regarding the overall effect on colorectal cancer, which is indicated by the high R-squared value for that variable (R-squared=0.73). Most notably, this new model tells us that:

- As schistosomiasis infection increases, colorectal cancer increases significantly (beta=0.83; P<0.01). This is a MUCH STRONGER relationship than the previous one between total cholesterol and colorectal cancer; even though some data on schistosomiasis infection for a few counties is missing (the relationship might have been even stronger with a complete dataset). And this strong relationship makes sense, because schistosomiasis infection is indeed associated with increased cancer rates. More information on schistosomiasis infections can be found here.

- Schistosomiasis infection has no significant relationship with these variables: animal protein consumption, plant protein consumption, or total cholesterol. This makes sense, as the infection is caused by a worm that is not normally present in plant or animal food, and the infection itself is not specifically associated with abnormalities that would lead one to expect major increases in total cholesterol.

- Total cholesterol has no significant relationship with colorectal cancer (beta=0.24; P=0.11). The beta here is nontrivial, but too low to be significant; i.e., we cannot discard chance within the context of this relatively small dataset.

- Animal protein consumption has no significant relationship with colorectal cancer. The beta here is very low, and negative (beta=-0.03).

- Plant protein consumption has no significant relationship with colorectal cancer. The beta for this association is positive and nontrivial (beta=0.15), but the P value is too high (P=0.20) for us to discard chance within the context of this dataset.

Below is the plot showing the relationship between schistosomiasis infection and colorectal cancer. The values are standardized, which means that the zero on the horizontal axis is the mean of the schistosomiasis infection numbers in the dataset. The shape of the plot is the same as the one with the unstandardized data. As you can see, the data points are very close to a line, which suggests a very strong linear association.


In summary, an exploratory path analysis with WarpPLS can shed light on data patterns that would look rather “mysterious” otherwise. Still, one has to use commonsense, good theory, and past empirical results to derive conclusions.

No comments: