Sunday, January 31, 2010

Project files in WarpPLS: Small but information-rich

Project files in WarpPLS are saved with the “.prj” extension, and contain all of the elements needed to perform an SEM analysis. That is, they contain the original data used in the analysis, the graphical model, the inner and outer model structures, and the results.

Once an original data file is read into a project file, the original data file can be deleted without effect on the project file. The project file will store the original location and file name of the data file, but it will no longer use it.

Project files may be created with one name, and then renamed using Windows Explorer or another file management tool. Upon reading a project file that has been renamed in this fashion, the software will detect that the original name is different from the file name, and will adjust the name of the project file accordingly.

Different users of this software can easily exchange project files electronically if they are collaborating on a SEM analysis project. This way they will have access to all of the original data, intermediate data, and SEM analysis results in one single file.

Project files are relatively small. For example, a complete project file of a model containing 5 latent variables and 32 indicators will typically be only approximately 200 KB in size. Simpler models may be stored in project files as small as 50 KB.

Saturday, January 30, 2010

Reflective and formative latent variable measurement in WarpPLS


A reflective latent variable is one in which all the indicators are expected to be highly correlated with the latent variable score. For example, the answers to certain question-statements by a group of people, measured on a 1 to 7 scale (1=strongly disagree; 7 strongly agree) and answered after a meal, are expected to be highly correlated with the latent variable “satisfaction with a meal”. The question-statements are: “I am satisfied with this meal”, and “After this meal, I feel good”. Therefore, the latent variable “satisfaction with a meal”, can be said to be reflectively measured through two indicators. Those indicators store answers to the two question-statements. This latent variable could be represented in a model graph as “Satisf”, and the indicators as “Satisf1” and “Satisf2”.

A formative latent variable is one in which the indicators are expected to measure certain attributes of the latent variable, but the indicators are not expected to be highly correlated with the latent variable score, because they (i.e., the indicators) are not expected to be correlated with one other. For example, let us assume that the latent variable “Satisf” (“satisfaction with a meal”) is now measured using the two following question-statements: “I am satisfied with the main course” and “I am satisfied with the dessert”. Here, the meal comprises the main course, say, filet mignon; and a dessert, a fruit salad. Both main course and dessert make up the meal (i.e., they are part of the same meal) but their satistisfaction indicators are not expected to be highly correlated with each other. The reason is that some people may like the main course very much, and not like the dessert. Conversely, other people may be vegetarians and hate the main course, but may like the dessert very much.

If the indicators are not expected to be highly correlated with each other, they cannot be expected to be highly correlated with their latent variable’s score. So here is a general rule of thumb that can be used to decide if a latent variable is reflectively or formatively measured. If the indicators are expected to be highly correlated, then the measurement model should be set as reflective in WarpPLS. If the indicators are not expected to be highly correlated, even though they clearly refer to the same latent variable, then the measurement model should be set as formative.

Thursday, January 28, 2010

Bootstrapping or jackknifing (or both) in WarpPLS?

Arguably jackknifing does a better job at addressing problems associated with the presence of outliers due to errors in data collection. Generally speaking, jackknifing tends to generate more stable resample path coefficients (and thus more reliable P values) with small sample sizes (lower than 100), and with samples containing outliers. In these cases, outlier data points do not appear more than once in the set of resamples, which accounts for the better performance of jackknifing (see, e.g., Chiquoine & Hjalmarsson, 2009).

Bootstrapping tends to generate more stable resample path coefficients (and thus more reliable P values) with larger samples and with samples where the data points are evenly distributed on a scatter plot. The use of bootstrapping with small sample sizes (lower than 100) has been discouraged (Nevitt & Hancock, 2001).

Since the warping algorithms are also sensitive to the presence of outliers, in many cases it is a good idea to estimate P values with both bootstrapping and jackknifing, and use the P values associated with the most stable coefficients. An indication of instability is a high P value (i.e., statistically insignificant) associated with path coefficients that could be reasonably expected to have low P values. For example, with a sample size of 100, a path coefficient of .2 could be reasonably expected to yield a P value that is statistically significant at the .05 level. If that is not the case, there may be a stability problem. Another indication of instability is a marked difference between the P values estimated through bootstrapping and jackknifing.

P values can be easily estimated using both resampling methods, bootstrapping and jackknifing, by following this simple procedure. Run an SEM analysis of the desired model, using one of the resampling methods, and save the project. Then save the project again, this time with a different name, change the resampling method, and run the SEM analysis again. Then save the second project again. Each project file will now have results that refer to one of the two resampling methods. The P values can then be compared, and the most stable ones used in a research report on the SEM analysis.

References:

Chiquoine, B., & Hjalmarsson, E. (2009). Jackknifing stock return predictions. Journal of Empirical Finance, 16(5), 793-803.

Nevitt, J., & Hancock, G.R. (2001). Performance of bootstrapping approaches to model test statistics and parameter standard error estimation in structural equation modeling. Structural Equation Modeling, 8(3), 353-377.

How many resamples to use in bootstrapping?

The default number of resamples is 100 for bootstrapping in WarpPLS. This setting can be modified by entering a different number in the appropriate edit box. (Please note that we are talking about the number of resamples here, not the original data sample size.)

Leaving the number of resamples for bootstrapping as 100 is recommended because it has been shown that higher numbers of resamples lead to negligible improvements in the reliability of P values; in fact, even setting the number of resamples at 50 is likely to lead to fairly reliable P value estimates (Efron et al., 2004).

Conversely, increasing the number of resamples well beyond 100 leads to a higher computation load on the software, making the software look like it is having a hard time coming up with the results. In very complex models, a high number of resamples may make the software run very slowly.

Some researchers have suggested in the past that a large number of resamples can address problems with the data, such as the presence of outliers due to errors in data collection. This opinion is not shared by the original developer of the bootstrapping method, Bradley Efron (see, e.g., Efron et al., 2004).

Reference:

Efron, B., Rogosa, D., & Tibshirani, R. (2004). Resampling methods of estimation. In N.J. Smelser, & P.B. Baltes (Eds.). International Encyclopedia of the Social & Behavioral Sciences (pp. 13216-13220). New York, NY: Elsevier.

Viewing and changing settings in WarpPLS 1.0 and 2.0

(Note: This post refers to version 1.0 - 2.0 of WarpPLS. See this YouTube video on how to view and change settings for version 3.0.)

The view or change settings window (see figure below, click on it to enlarge) allows you to select an algorithm for the SEM analysis, select a resampling method, and select the number of resamples used, if the resampling method selected was bootstrapping. The analysis algorithms available are Warp3 PLS Regression, Warp2 PLS Regression, PLS Regression, and Robust Path Analysis.


Many relationships in nature, including relationships involving behavioral variables, are nonlinear and follow a pattern known as U-curve (or inverted U-curve). In this pattern a variable affects another in a way that leads to a maximum or minimum value, where the effect is either maximized or minimized, respectively. This type of relationship is also referred to as a J-curve pattern; a term that is more commonly used in economics and the health sciences.

The Warp2 PLS Regression algorithm tries to identify a U-curve relationship between latent variables, and, if that relationship exists, the algorithm transforms (or “warps”) the scores of the predictor latent variables so as to better reflect the U-curve relationship in the estimated path coefficients in the model. The Warp3 PLS Regression algorithm, the default algorithm used by the software, tries to identify a relationship defined by a function whose first derivative is a U-curve. This type of relationship follows a pattern that is more similar to an S-curve (or a somewhat distorted S-curve), and can be seen as a combination of two connected U-curves, one of which is inverted.

The PLS Regression algorithm does not perform any warping of relationships. It is essentially a standard PLS regression algorithm, whereby indicators’ weights, loadings and factor scores (a.k.a. latent variable scores) are calculated based on a least squares minimization sub-algorithm, after which path coefficients are estimated using a robust path analysis algorithm. A key criterion for the calculation of the weights, observed in virtually all PLS-based algorithms, is that the regression equation expressing the relationship between the indicators and the factor scores has an error term that equals zero. In other words, the factor scores are calculated as exact linear combinations of their indicators. PLS regression is the underlying weight calculation algorithm used in both Warp3 and Warp2 PLS Regression. The warping takes place during the estimation of path coefficients, and after the estimation of all weights and loadings in the model. The weights and loadings of a model with latent variables make up what is often referred to as outer model, whereas the path coefficients among latent variables make up what is often called the inner model.

Finally, the Robust Path Analysis algorithm is a simplified algorithm in which factor scores are calculated by averaging all of the indicators associated with a latent variable; that is, in this algorithm weights are not estimated through PLS regression. This algorithm is called “Robust” Path Analysis, because, as with most robust statistics methods, the P values are calculated through resampling. If all latent variables are measured with single indicators, the Robust Path Analysis and the PLS Regression algorithms will yield identical results.

One of two resampling methods may be selected: bootstrapping or jackknifing. Bootstrapping, the software’s default, is a resampling algorithm that creates a number of resamples (a number that can be selected by the user), by a method known as “resampling with replacement”. This means that each resample contains a random arrangement of the rows of the original dataset, where some rows may be repeated. (The commonly used analogy of a deck of cards being reshuffled, leading to many resample decks, is a good one, but not entirely correct because in bootstrapping the same card may appear more than once in each of the resample decks.) Jacknifing, on the other hand, creates a number of resamples that equals the original sample size, and each resample has one row removed. That is, the sample size of each resample is the original sample size minus 1. Thus, the choice of number of resamples has no effect on jackknifing, and is only relevant in the context of bootstrapping.

Saving and using grouped descriptive statistics in WarpPLS

When the “Save grouped descriptive statistics into a tab-delimited .txt file” option is selected, a data entry window is displayed. There you can choose a grouping variable, number of groups, and the variables to be grouped. This option is useful if one wants to conduct a comparison of means analysis using the software, where one variable (the grouping variable) is the predictor, and one or more variables are the criteria (the variables to be grouped).

The figure below (click on it to enlarge) shows the grouped statistics data saved through the “Save grouped descriptive statistics into a tab-delimited .txt file” option. The tab-delimited .txt file was opened with a spreadsheet program, and contained the data on the left part of the figure.




That data on the left part of the figure was organized as shown above the bar chart; next the bar chart was created using the spreadsheet program’s charting feature. If a simple comparison of means analysis using this software had been conducted in which the grouping variable (in this case, an indicator called “ECU1”) was the predictor, and the criterion was the indicator called “Effe1”, those two variables would have been connected through a path in a simple path model with only one path. Assuming that the path coefficient was statistically significant, the bar chart displayed in the figure, or a similar bar chart, could be added to a report describing the analysis.

Some may think that it is an overkill to conduct a comparison of means analysis using an SEM software package such as this, but there are advantages in doing so. One of those advantages is that this software calculates P values using a nonparametric class of estimation techniques, namely resampling estimation techniques. (These are sometimes referred to as bootstrapping techniques, which may lead to confusion since bootstrapping is also the name of a type of resampling technique.) Nonparametric estimation techniques do not require the data to be normally distributed, which is a requirement of other comparison of means techniques (e.g., ANOVA).

Another advantage of conducting a comparison of means analysis using this software is that the analysis can be significantly more elaborate. For example, the analysis may include control variables (or covariates), which would make it equivalent to an ANCOVA test. Finally, the comparison of means analysis may include latent variables, as either predictors or criteria. This is not usually possible with ANOVA or commonly used nonparametric comparison of means tests (e.g., the Mann-Whitney U test).

Saturday, January 23, 2010

How is the warping done in WarpPLS?


WarpPLS does linear and nonlinear analyses. That is, users can set WarpPLS to estimate parameters based on a standard linear algorithm, and without any warping. They can also choose one of two nonlinear algorithms, thus taking advantage of the warping capabilities of the software.

In nonlinear analyses, what WarpPLS does is relatively simple at a conceptual level. It identifies a set of functions F1(LVp1), F2(LVp2) … that relate blocks of latent variable predictors (LVp1, LVp2 ...) to a criterion latent variable (LVc) in this way:

LVc = p1*F1(LVp1) + p2*F2(LVp2) + … + E.

In the equation above, p1, p2 ... are path coefficients, and E is the error term of the equation. All variables are standardized. Any model can be decomposed into a set of blocks relating latent variable predictors and criteria in this way.

In the Warp2 mode, the functions F1(LVp1), F2(LVp2) ... take the form of U curves (also known as J curves); defaulting to lines, if the relationships are linear. The term "U curve" is used here for simplicity, as noncyclical nonlinear relationships (e.g., exponential growth) can be represented through sections of straight or rotated U curves; the term "S curve" is also used here for simplicity.

In the Warp3 mode, the functions F1(LVp1), F2(LVp2) ... take the form of S curves; defaulting to U curves or lines, if the relationships follow U-curve patterns or are linear, respectively.

S curves are curves whose first derivative is a U curve. Similarly, U curves are curves whose first derivative is a line. U curves seem to be the most commonly found in natural and behavioral phenomena. S curves are also found, but apparently not as frequently as U curves.

U curves can be used to model most of the commonly seen functions in natural and behavioral studies, such as logarithmic, exponential, and hyperbolic decay functions. For these common types of functions, S-curve approximations will usually default to U curves.

Other types of curves beyond S curves might be found in specific types of situations, and require specialized analysis methods that are typically outside the scope of structural equation modeling. Examples are time series and Fourier analyses. Therefore these are beyond the scope of application of WarpPLS.

Typically, the more the functions F1(LVp1), F2(LVp2) ... look like curves, and unlike lines, the greater is the difference between the path coefficients p1, p2 ... and those that would have been obtained through a strictly linear analysis.

So, what WarpPLS does is not unlike what a researcher would do if he or she modified predictor latent variables prior to the calculation of path coefficients using a function like the logarithmic function. For example, as in the equation below, where a log transformation is applied to LVp1.

LVc = p1*log(LVp1) + p2*LVp2 + … + E.

However, WarpPLS does that automatically, and for a much wider range of functions, since a fairly wide range of functions can be modeled as U or S curves. Exceptions are complex trigonometric functions, where the dataset comprises many cycles. These require different methods to be properly modeled, such as the Fourier analyses methods mentioned above, and are usually outside the scope of structural equation modeling (SEM; which is the analysis method that WarpPLS automates).

Often the path coefficients p1, p2 ... will go up in value due to warped analysis, but that may not always be the case. Given the nature of multivariate analysis, an increase in a path coefficient may lead to a decrease in a path coefficient for an arrow pointing at the same criterion latent variable, because each path coefficient in a block is calculated in a way that controls for the effects of the other predictor latent variables.

How are the model fit indices calculated by WarpPLS?


Three of the main model fit indices calculated by WarpPLS are the following: average path coefficient (APC), average R-squared (ARS), and average variance inflation factor (AFVIF).

They are discussed in the WarpPLS User Manual, which is available separately from the software, as a standalone document, on the WarpPLS web site.

The fit indices are calculated as their name implies, that is, as averages of: the (absolute values of the ) path coefficients in the model, the R-squared values in the model, and the variance inflation factors in the model. All of these are also provided individually by the software.

The P values for APC and ARS are calculated through re-sampling. A correction is made to account for the fact that these indices are calculated based on other parameters, which leads to a biasing effect – a variance reduction effect associated with the central limit theorem.

Typically the addition of new latent variables into a model will increase the ARS, even if those latent variables are weakly associated with the existing latent variables in the model. However, that will generally lead to a decrease in APC, since the path coefficients associated with the new latent variables will be low. Thus, the APC and ARS will counterbalance each other, and will only increase together if the latent variables that are added to the model enhance the overall predictive and explanatory quality of the model.

The AFVIF index will increase if new latent variables are added to the model in such a way as to add multicolinearity to the model, which may result from the inclusion of new latent variables that overlap in meaning with existing latent variables. It is generally undesirable to have different latent variables in the same model that measure the same thing; those should be combined into one single latent variable. Thus, the AFVIF brings in a new dimension that adds to a comprehensive assessment of a model’s overall predictive and explanatory quality.

As a final note, I would like to point out that the interpretation of the model fit indices depends on the goal of the SEM analysis. If the goal is to test hypotheses, where each arrow represents a hypothesis, then the model fit indices are of little importance. However, if the goal is to find out whether one model has a better fit with the original data than another, then the model fit indices are a useful set of measures related to model quality.

Friday, January 22, 2010

Why are pattern cross-loadings so low in WarpPLS?

I have recently received a few related questions from WarpPLS users. Essentially, they noted that the pattern loadings generated by WarpPLS were very similar to those generated by other PLS-based SEM software. However, they wanted to know why the pattern cross-loadings were so much lower in WarpPLS, compared to other PLS-based SEM software.

Low cross-loadings suggest good discriminant validity; a type of validity that is usually tested via WarpPLS using a separate procedure, involving tabulation of latent variable correlations and average variances extracted.

Nevertheless, low cross-loadings, combined with high loadings, are a good thing in the context of a PLS-based SEM analysis.

The pattern loadings and cross-loadings provided by WarpPLS are from a pattern matrix, which is obtained after the transformation of a structure matrix through an oblique rotation (similar to Promax).

The structure matrix contains the Pearson correlations between indicators and latent variables, which are not particularly meaningful prior to rotation in the context of measurement instrument validation (e.g., validity and reliability assessment).

In an oblique rotation the loadings shown on the pattern matrix are very similar to those on the structure matrix. The latter are the ones that other PLS-based SEM software usually report, which is why the loadings obtained through WarpPLS and other PLS-based SEM software are very similar. The cross-loadings though, can be very different in the pattern (rotated) matrix, as these WarpPLS users noted.

In short, the reason for the comparatively low cross-loadings is the oblique rotation employed by WarpPLS.

Here is a bit more information regarding rotation methods:

Because an oblique rotation is employed by WarpPLS, in some (relatively rare) cases pattern loadings may be higher than 1, which should have no effect on their interpretation. The expectation is that pattern loadings, which are shown within parentheses (on the "View indicator loadings and cross-loadings" option), will be high; and cross-loadings will be low.

The combined loadings and cross-loadings table always shows loadings lower than 1, because that table combines structure loadings with pattern cross-loadings. This obviates the need for a normalization step, which can distort loadings and cross-loadings somewhat.

Also, let me add that the main difference between oblique and orthogonal rotation methods (e.g., Varimax) is that the former assume that there are correlations, some of which may be strong, among latent variables.

Arguably oblique rotation methods are the most appropriate in PLS-based SEM analysis, because by definition latent variables are expected to be correlated. Otherwise, no path coefficient would be significant.

Technically speaking, it is possible that a research study will hypothesize only neutral relationships between latent variables, which could call for an orthogonal rotation. However, this is rarely, if ever, the case.

References:

Kock, N. (2010). WarpPLS 1.0 User Manual. Laredo, Texas: ScriptWarp Systems.

Kock, N. (2011). WarpPLS 2.0 User ManualLaredoTexasScriptWarp Systems.

Thursday, January 21, 2010

WarpPLS running on a Mac? Sure!

When WarpPLS was first made available, I told a colleague of mine that it would probably run on a Mac without problems. Without trying, he said: No way Jose!

Then he really tried (using virtualization software, more below); it worked, and his response: Maybe u wuz royt eh!?

I have since received a few emails from WarpPLS users who own Mac computers. They run WarpPLS on those computers, without problems, even though WarpPLS was designed to be used with Windows.

How is that possible?

Those users have virtualization (a.k.a., virtual machine) software installed on their computers, which allow them to run WarpPLS on different types of computers, including Mac computers.

The virtualization software actually allows them to run the Windows operating system (typically the XP or 7 versions) on a Mac computer. They then install WarpPLS on the Windows virtual machine created by the virtualization software.

It seems that VMware is one of the most popular virtualization software systems in this respect.

Monday, January 11, 2010

March 2010 online workshop on WarpPLS

PLS-SEM.com will conduct an online workshop on WarpPLS in March 2010!

The direct link to the workshop site is:

http://www.regonline.com/builder/site/Default.aspx?eventid=811252

The list of upcoming workshops is on:

http://pls-sem.com/cgi-bin/p/awtp-custom.cgi?d=plssem&page=10403

Saturday, January 2, 2010

Solve collinearity problems in WarpPLS: YouTube video

A new YouTube video for WarpPLS is available; please see link below.

http://www.youtube.com/watch?v=avPWO324E0g

The video shows how problems associated with latent variable collinearity, suggested by unstable path coefficients and high variance inflation factors, can be solved in a structural equation modeling (SEM) analysis using the software WarpPLS.

Note that this type of problem is different from problems related to indicators having low loadings and high cross-loadings. The problem here is associated with collinearity among latent variables.

Having said that, in many cases these two types of problems happen together: latent variable collinearity (often referred to as multicollinearity) and poor loadings/cross-loadings.

Happy New Year!