Market Segmentation Using SAS and Market Surveys

Charles J. Schwartz, Intelligent Analytical Services, Inc., Los Angeles, CA


Market segmentation is a combination of art and science. SAS provides the tools to create robust market segments and to evaluate their effectiveness using SASSTAT's clustering and Discriminant analysis tools. The paper will describe a typical segmentation project from data preparation, through cluster analysis reporting and application. It will examine problems and pitfalls and present several SAS macros to make the analysis quicker and easier.


Marketing research fads come and go, but market segmentation remains a staple of the business. With media fragmentation and the Internet, niche marketing is becoming more and more important, and with it the demand for segmentation studies has exploded.

Like many things in marketing research, segmentation is not a specific technique, but a family of analyses. Marketers have used everything from gut feelings and cross tabs to closet sociological theorizing and multivariate statistics to form their segments. Since a market segment is a group of people who are similar to one another on a variety of criteria and dissimilar from people in other segments, cluster analysis seems the natural technique to use and SAS the natural tool to implement it.

What of data? Many segmentation projects over the past decade or two have used demographics as a basis to group people. PRIZM clusters and competing schemes are ubiquitous evidence of this. Who in marketing has not heard of "Pools and Patio" and "Furs and Station Wagons"? With more fragmented media, however, marketers are concentrating on tailoring their messages. To do so, they have been relying more and more on attitudinal data derived from marketing surveys to group people on what they think of a product rather than who they are. Once these groups are determined, it is an easy task to target the segments most likely to buy the product, and target the product to that group's specific view of it.

This paper will go through the steps of a project of this kind. It will outline the techniques that I use and some of the tricks that need to be used to meet a client's objectives for this kind of research. It will not be a statistical treatment. In fact, some of it has been known to produce apoplectic fits among the statistically orthodox. It does, however, seem to get the job done - clients seem to get the guidance they need and the increased marketing effectiveness they want.

the sample

Marketing surveys are designed to be multipurpose and cheap. Sampling schemes may be less than ideal, and response rates are low and often biased. For a segmentation study, certain segments may not show up in the sample and those that do may not have the same distribution that they do in the population. This is something that analysts have to live with. They are often called in after the data collection is done. Even if the analyst gets in on the ground floor, clients balk at the cost and time involved in doing things right. So dubious samples are something that the segmenter has to deal with.

This puts two burdens on the analyst. One is client education. The client has to know how far to push the results and more important, be aware that he or she may be missing important segments that the survey may not tap. The client needs to know how the survey may be biased and what kind of segments may be missed. It is then the client's burden to assess whether the analysis is worth the candle.

The second burden lies most heavily on the analyst. With a potentially biased sample, it is up to the analyst to determine whether or not a segmentation scheme is valid. Even if a clear segmentation solution pops up on the first try, it needs at least to make sense on its face. Beyond that, it must have construct validity - it needs to be easily interpretable by the client based on his or her knowledge of the market and by the analyst in terms of his or her sociological or psychological knowledge and marketing research experience. The stress here is on easily. SAS' clustering procedures will always produce output, and we all know how fun it is to produce clever ex-post-facto explanations. If you cannot explain your results to a ninth grader, you are probably putting a clever gloss on an artifact. If it isn't simple you don't have a solution.

measurement and data preparation

Ideally, all clustering variables should be measured on the same interval scale. Attitudinal data, particularly in marketing surveys, rarely has this luxury. Most commonly, attitudes are measured on a four or five point Likert scale ranging from "strongly agree" to "strongly disagree". The five point scale contains a neutral point, the four point scale does not. If the analyst is lucky, the survey will have used a ten-point semantic differential:

Strongly Strongly
Disagree Neutral Agree


0 1 2 3 4 5 6 7 8 9 10.

I prefer the semantic differential because it graphically imposes an interval scale. In general the more points, the better.


A big problem, particularly when you are asking questions about importance, is the tendency for respondents to be agreeable (or disagreeable). Respondents often think everything is important. Give them a ten point scale and everything is eight, nine, or ten. Another group may think everything unimportant and answer one, two, or three. Unfortunately, put this into a cluster analysis and you will get a segment that thinks everything is important, and a segment that thinks nothing is important. Naturally this is an artifact of response bias. If this is the case, you will need to center the data. By centering I mean to standardize the responses of a single respondent to a battery of questions. Suppose you ask a battery of ten importance questions, for example. In your data step you will have to include the following to produce centered variables:

mn=mean(of qst6a_01-qst6a_10);
st=mean(of qst6a_01-qst6a_10);
array qq qst6a_01-qst6a_10;
do over qq;
if st gt 0 then qq=(qq-mn)/st;
else qq=.;

Implicitly this assumes that respondents are answering a set of questions on the basis of different internal measurement scales. Centering forces the scales to be the same for each respondent.

In practice, I will almost always center importance questions. For agree/disagree questions, I will go back to square one and center data scales when initial analyses produce all agree or all disagree clusters.

Missing Data

Missing values are a big problem with survey data. Given the multivariate nature of the analysis, ascription is almost always necessary in order to keep a sufficient portion of the sample in the analysis. If more than about ten percent of the respondents are excluded from the cluster analysis due to missing values, I will ascribe using mean substitution or a random assignment technique. In general, for each battery of questions, I will ascribe missing values for a respondent who has answered at least half of the questions in the set. Those who have answered fewer, I would consider as not presenting sufficient information to include in the analysis. Remember to center variables before ascribing missing values if you are going to use centering. After centering use your ascription technique to ascribe missing values to the centered data.

Factor Analysis

Factor analysis is crucial to this form of segmentation for practical and for theoretical reasons. On the practical side it has several advantages: