## counting distinct values over a range of variables

Solved
Super Contributor
Posts: 455

# counting distinct values over a range of variables

Hi,

suppsoe I have the following data:

 var1 var2 var3 var4 var5 abc d45g abc 1 2 a c rrrrrr g9 g9

var1 - var3 is one range, and var4 - var5 is another.

For each row, I would like to count the number of distinct values for each range of variables. So for the above case, here is the count:

 var1-var3 var4-var5 2 2 3 1

Thanks!

Accepted Solutions
Solution
‎01-26-2017 07:04 PM
PROC Star
Posts: 8,111

## Re: counting distinct values over a range of variables

Here is a brute force approach:

```data want;
set have;
array v1(*) \$ var1-var3;
array v2(*) \$ var4-var5;
call sortc(of v1(*));
do i=1 to dim(v1);
if not missing(v1(i)) then do;
if i lt dim(v1) then do;
if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
end;
else var1_var3=sum(var1_var3,1);
end;
end;
call sortc(of v2(*));
do i=1 to dim(v2);
if not missing(v2(i)) then do;
if i lt dim(v2) then do;
if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
end;
else var4_var5=sum(var4_var5,1);
end;
end;
run;
```

HTH,

Art, CEO, AnalystFinder.com

All Replies
Solution
‎01-26-2017 07:04 PM
PROC Star
Posts: 8,111

## Re: counting distinct values over a range of variables

Here is a brute force approach:

```data want;
set have;
array v1(*) \$ var1-var3;
array v2(*) \$ var4-var5;
call sortc(of v1(*));
do i=1 to dim(v1);
if not missing(v1(i)) then do;
if i lt dim(v1) then do;
if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
end;
else var1_var3=sum(var1_var3,1);
end;
end;
call sortc(of v2(*));
do i=1 to dim(v2);
if not missing(v2(i)) then do;
if i lt dim(v2) then do;
if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
end;
else var4_var5=sum(var4_var5,1);
end;
end;
run;
```

HTH,

Art, CEO, AnalystFinder.com

Posts: 3,832

## Re: counting distinct values over a range of variables

[ Edited ]

art297 wrote:

Here is a brute force approach:

```data want;
set have;
array v1(*) \$ var1-var3;
array v2(*) \$ var4-var5;
call sortc(of v1(*));
do i=1 to dim(v1);
if not missing(v1(i)) then do;
if i lt dim(v1) then do;
if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
end;
else var1_var3=sum(var1_var3,1);
end;
end;
call sortc(of v2(*));
do i=1 to dim(v2);
if not missing(v2(i)) then do;
if i lt dim(v2) then do;
if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
end;
else var4_var5=sum(var4_var5,1);
end;
end;
run;
```

HTH,

Art, CEO, AnalystFinder.com

@art297 I thought you might enjoy this Rube Goldberg machine with implicit and explicit arrays, arrays of implicit arrays and temporary arrays.

``````data t;
array a \$8 var1-var3;
array b \$8 var4-var5;
array d[2] _temporary_;
d[1]=dim(a); d[2]=dim(b);
array c(j) a b;
input (var:)(:);
array t[10] \$8 _temporary_;
array count[2];
do over c;
call missing(k,of t[*]);
do _i_ = 1 to d[j];
if c in t then continue;
k + 1;
t[k]=c;
end;
count[j] = k;
end;
drop j k;
cards;
abc d45g abc 1 2
a c rrrrrr g9 g9
;;;;
run;
proc print;
run;``````

Super Contributor
Posts: 258

## Re: counting distinct values over a range of variables

Can you please provide more data? My initial generic response (using proc transpose) may or may not be viable, depending on the number of observations.

If the source data are small enough, transpose would be quite good. If it's over 32k observations, it won't!

Posts: 3,832

## Re: counting distinct values over a range of variables

[ Edited ]

LaurieF wrote:

Can you please provide more data? My initial generic response (using proc transpose) may or may not be viable, depending on the number of observations.

If the source data are small enough, transpose would be quite good. If it's over 32k observations, it won't!

@LaurieF this is the second or third time @ilikesas has asked a similar question.  If @ilikesas would normalized their data they could easily deal with this type of problem and more, without resorting to ARRAYs and other clunky data step code.

Super Contributor
Posts: 258

## Re: counting distinct values over a range of variables

<nods enthusiastically>

Super Contributor
Posts: 455

## Re: counting distinct values over a range of variables

[ Edited ]

Hi data_null_,

could you please elaborate more about what you mean by "normalizing" the data? It is indeed true that arrays and do loops tend to be messy and complicated (at least for me)...

Thanks!

Super Contributor
Posts: 258

## Re: counting distinct values over a range of variables

The long answer is: Wikipedia Data Normalisation

The short answer: SAS works inherently better when it's processing very long, very narrow tables. One of the primary functions of proc transpose, for example, is to rotate data so that columns become rows and vice versa. That way you can do queries over columns, which proc sql naturally supports.

You can, of course, use data steps to walk along arrays, and that can work very well. But if you can manage to pre-process your data so that you can avoid that, the long term solution will tend to be more effective.

Super Contributor
Posts: 260

## Re: counting distinct values over a range of variables

@ilikesas

Without a little more feel for your data its hard to say what normalization you need to do.

If you had survey data, for example with each survey question a variable (column) and each survey responder an observation (row) like:

 ques1 ques2 ques3 responder1 responser2 responder3

that would be a pretty typical format in my experience.  For me (and I think for many) this sort of layout lends itself to differences between responders over a particular question easier than to differences in questions for a particular responder, although both are possible.

I think the suggestion is that perhaps a look at the data/database design/scheme and what you want to do with it could save you some work in the long run.

Super Contributor
Posts: 455

## Re: counting distinct values over a range of variables

Hi HB,

thanks for the insight. In my case my data is code names (such as abc, gf4). Could this be fine?

Super Contributor
Posts: 455

## Re: counting distinct values over a range of variables

Hi LaurieF,

here is some data that I prepared:

``````data have;
input var1 \$4. var2 \$4. var3 \$4. var4 \$4. var5 \$4.;
datalines;
abc abc def g55 g56
abc def ttt hhh hhh
ddd ddd ddd ddd ddd
111 111 112 111 111
a   a   a   a   a
ab  ab  ac  ac  ac
;
run;``````

I guess that in order to transpose more datalines are needed than variables. Here I also made al variables to be of character type and length 4, but I guess that this specification is not very important when data is imported from an external source such as Excel.

Thanks!

Super Contributor
Posts: 258

## Re: counting distinct values over a range of variables

Again, that's all very well, but it still doesn't define the problem very well: it's all a bit theoretical. Do you have some real-world data to play with?

Super Contributor
Posts: 455

## Re: counting distinct values over a range of variables

actually I don't have real data, but I am rather thinking of building a data which will consist of names or name codes (such as abc, gf5 etc.)

Super Contributor
Posts: 258

## Re: counting distinct values over a range of variables

I'd recommend, then, constructing your data so that you can avoid using arrays, and store the values in observations instead. You're making a rod for your own back otherwise.

Posts: 4,543

## Re: counting distinct values over a range of variables

@ilikesas

It's a bit hard to give you advice without any context and without knowing how you plan to use this data.

On a generic level: You could go for a key/value pair construct with one column holding the name of your variables and a second column holding the value. With such a structure it's then quite simple to implement distinct counts. You could even use such data to construct SAS Formats (via Proc Format, cntlin).

☑ This topic is solved.