CS14 Statistical techniques for identifying consumer segments: Guide

Overview

This guide will help you understand the different statistical techniques used in the consumer segmentation process.

The key messages are:

  • Cluster analysis and CHAID or Classification Trees are the most commonly used techniques for consumer segmentation. Other lesser-used techniques are Discriminant Analysis, Factor Analysis, and Conjoint Analysis.
  • Cluster analysis is the most commonly used statistical technique for psychographic profiling while conjoint is widely used for needs-based segmentation.
  • From time to time, the research agency should update the client on new statistical techniques and their utility.

In this guide, you learn about the different statistical techniques and software available that help effectively identify the target segments for a brand.

Sections:

Popular statistical techniques for identifying segments

Technique Description Uses Advantages Disadvantages Sub-Types
Cluster Analysis An exploratory data analysis technique that identifies and classifies objects/individuals based on similarities in attitudinal or behavioral characteristics (motivation, aspiration, personality traits, etc). It sorts cases (people, things, events, etc) into groups or clusters such that there are substantial differences between groups but individuals within a group are very similar.

By establishing the needs, requirements, opportunities and threats presented by each group identified through the analysis one can ascertain their current and future worth to business.

For exploratory research where no hypothesis is available.

To describe and size consumer segments.

Helps identify consumer segments, based on subjective parameters such as attitude, motivation, aspiration, etc, unlike conventional ‘demographic’ classifications that are based on tangible characteristics such as sex, age, and social class.

The main advantage is that it can take complex inputs, and reveal associations and structures in data that would not otherwise be apparent.

While the analysis provides distinct groupings, it does not provide any explanation of why/how the groups are different. Some judgment is required to interpret the output of the analysis.

Members of each cluster group are similar but not necessarily identical to every other member on selected characteristics.

Sometimes the output generated may not be robust and can differ on running the analysis repeatedly on the same input data. Thus, the analysis needs to be re-run a few times on the same information to ensure robustness of classifications provided.

Hierarchical Cluster Analysis: Used for smaller samples. Respondents are treated as part of a single large cluster at the outset and then divided into small clusters, or each respondent is first considered as a separate cluster and then grouped into bigger clusters.

Non-Hierarchical/ K-Means Clustering: More suitable for clustering large amounts of data. In contrast to the hierarchical method, this technique allows objects to change group membership through the cluster formation process based on some optimizing criterion.

While there is no definite rule on which type of clustering to use, it is suggested that both be used. Start with hierarchical to generate and profile the clusters and then use non-hierarchical to fine-tune the cluster membership with its switching ability.

Factor Analysis Aims to describe a large number of variables or questions by using a much smaller set of underlying variables, called factors. Unlike cluster analysis, which classifies respondents, factor analysis groups variables. As a data reduction tool, i.e., to reduce the number of variables. For example, when a lengthy questionnaire needs to be shortened with its key questions retained, factor analysis will indicate which questions can be omitted without losing too much information.

To detect structure in the relationships between variables, or, to classify variables. For example, factor analysis is often used in consumer satisfaction studies to identify underlying service dimensions and in profiling studies to determine core attitudes.

Both objective and subjective attributes can be used.

It is fairly easy, inexpensive, and accurate; is based on direct inputs from consumers.

Factor analysis will always produce a pattern between variables, no matter how random they seem.

Its utility depends on the researchers’ ability to develop a complete and accurate set of product attributes. If important attributes are missed, the procedure is of no value.

Naming the factors can be difficult –multiple attributes can be highly correlated with no apparent reason.

Exploratory factor analysis: This is the most common form of factor analysis. There is no prior theory and this technique seeks to uncover the underlying structure of a relatively large set of variables. The researcher's assumption is that any indicator may be associated with any factor i.e. the data determines the factors.

Confirmatory factor analysis: This tests and confirms hypotheses. It seeks to determine if the number of factors and the indicative variables on them conform to what is expected on the basis of pre-established theory. Indicator variables are selected on the basis of prior theory and factor analysis is used to see if they fit as predicted.

Conjoint Analysis Measures respondent preferences about the attributes of a product or service. It aims to decompose preference into component parts, such as brand name, quality, price, etc. By viewing products as bundles of attributes, it asks respondents to make choices in the same fashion as consumers normally do – by trading off attributes one against the other, either by ranking or choosing one of several attribute combinations. For example, a trade-off could be: do you prefer a "flight that is cramped, costs £250 and has one stop?" or a "flight that is spacious, costs £500 and is direct"?. The analysis then works out how much each attribute contributes to preference.

Conjoint analysishelps determine the relative importance of each attribute (spaciousness, price, number of stops, etc) as well as which levels of each attribute are most preferred (how much more is a price of £250 preferred over a price of £500?).

Used in need-based segmentation to understand key drivers for individuals.

Frequently used in new product development, since it helps identify features most valued by consumers and helps understand consumer trade-offs. For example, will consumers prefer a cheaper bike with less speed or a premium price for more storage?

It has been used in product positioning, since it helps identify features most valued by consumers and understand consumer trade-offs.

It allows usage of physical objects.

It measures preferences at an individual level.

Only a limited set of features can be used because the number of combinations increases very quickly as more features are added.

The information gathering stage is complex

Respondents may be unable to react to features or articulate attitudes in case of new categories.

-
Discriminant/ Logistical Regression Analysis Determine which variables best separate clusters or segments. It explains why respondents belong to a certain group, and classifies new respondents based on their ratings. In other words, it explains and predicts classification.

Logistical regression is often preferred to Discriminant analysis as it requires fewer assumptions in its theory, is statistically more robust in practice, and is easier to use. It is also more flexible in types of data that can be analyzed. Logistical regression can take any type of variable while Discriminant analysis needs rating scales.

Determines the most parsimonious way (i.e. the fewest dimensions) to distinguish between groups.

Infers the meaning of the dimensions that distinguish groups from each other.

It can predict membership into two or more mutually exclusive groups from a set of predictors, when there is no natural ordering on the groups. - Multiple Discriminant Function analysis (MDA): It is used when the analysis involves more than two groups and more than one variable.
CHAID (Chi Squared Automatic Interaction Detection) Analysis An exploratory data analysis technique which breaks down large samples (respondents) into homogenous subsets (segments). It is also referred to as Classification Trees/ Answer Trees or Regression Trees.

CHAID hierarchically segments a sample of respondents based on a dependent variable (for eg. buy/ did not buy or satisfied/ unsatisfied) using a host of explanatory or predictor variables (demographics, attitudes, previous behavior). The resulting output is graphically represented in the form of a tree with the branches representing the predictor variables that split the sample into discriminating groups.

CHAID is typically used in the direct marketing industry to identify the type of people who have reacted to a specific campaign.

It is very often used to understand the characteristics of the most and least satisfied or interested consumers/ employees, thus allowing the organization to target its (potential) clients more efficiently and successfully.

It is also used as an alternative exploratory technique to multiple regression, especially when the data set is not well suited to regression analysis.

CHAID is especially useful when the sample size is very large and there are many explanatory variables.

The output is highly visual with no equations.

It does not work well with small sample sizes as respondent groups can quickly become too small for reliable analysis. -

Comparing techniques

A comparison of the different techniques based on their applications is provided below:

Discriminant Analysis & Factor Analysis

Both techniques look for underlying dimensions in responses to questions about product attributes. However, discriminant analysis builds these underlying dimensions based on differences between the attributes rather than similarities between them. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique; a distinction between independent variables and dependent variables must be made in discriminant analysis.

Discriminant Analysis & Cluster Analysis

In discriminant analysis the groups (clusters) are determined beforehand and the objective is to determine the combination of independent variables which best discriminates among the groups. Thus, it provides answers to what makes the groups different from each other. In cluster analysis the groups (clusters) are not predetermined and in fact the objective is to determine the best way in which cases may be clustered into groups. But cluster analysis does not provide any explanation of why/how the groups are distinct.

Conjoint Analysis and Cluster Analysis

Conjoint analysis is often an input into cluster analysis. Conjoint analysis identifies what product features/ attributes drive preferences for different individuals. Cluster analysis then groups individuals with similar needs/ requirements to produce distinct segments.

CHAID and Discriminant/Logistical Regression Analysis

CHAID is an exploratory tool that identifies which explanatory variables have the most influence on a dependant variable. Discriminant/ Logistic regression is a confirmatory tool used after CHAID to quantify and test significance of the relationship between the explanatory variable identified (by CHAID) and the dependant variable.

Statistical software

Software Description Highlights
SPSS (Statistical Package for Social Sciences) Amongst the most widely used programs for statistical analysis across sectors.

The program has a base software with data management and data documentation features. Advanced analysis is available through separate modules.

Amongst all packages it seems to be the easiest to use for the most widely used statistical techniques.

Can be used either with a Windows point-and-click approach or through syntax (i.e., writing out of SPSS commands).

SAS (Statistical Analysis System) The SAS System comprises products for managing large databases and statistical analyses of most classical statistical problems, including multivariate analysis, linear models, and clustering as well as data visualization and plotting features. All SAS statistical analyses may be interfaced with the graphical products to produce relevant graphical descriptions of the data.

The SAS System is available on PC and UNIX-based platforms, as well as on mainframe computers.

Many applications can be accomplished using simple point-and-click operations.
It also includes interface routines for linking with the other available statistical packages

MINITAB Provides tools to analyze data across a variety of disciplines and users, including scientists, business and industry, and education through an array of general statistics. Data can be imported directly from a variety of file formats, including Lotus, Excel, Symphony, Quattro Pro, dBase and text (ASCII) files.

It is available for most computer platforms, including Windows, DOS, Macintosh, OpenVMS, and Unix. One can transparently transition from the Macintosh version of MINITAB to the Windows version.

It is easy to learn and use with pull-down menus and dialog boxes to assist.

S-Plus A high-level programming language designed for easy implementation of statistical functions with capabilities for multivariate analysis, cluster analysis.

It also offers extensive graphics and hardcopy capability.

Flexible with regard to the implementation of user-defined functions and the customization of one’s environment.

It has dedicated modules targeted at specific application areas.

S-PLUS runs on both PC and UNIX-based platforms.