DATA Step, Macro, Functions and more

counting distinct values over a range of variables

Accepted Solution Solved
Reply
Super Contributor
Posts: 413
Accepted Solution

counting distinct values over a range of variables

Hi,

 

suppsoe I have the following data:

 

var1 var2 var3 var4 var5
abc d45g abc 1 2
c rrrrrr g9 g9

 

var1 - var3 is one range, and var4 - var5 is another.

For each row, I would like to count the number of distinct values for each range of variables. So for the above case, here is the count:

 

var1-var3 var4-var5
2 2
3 1

 

Thanks!


Accepted Solutions
Solution
‎01-26-2017 07:04 PM
PROC Star
Posts: 7,364

Re: counting distinct values over a range of variables

Here is a brute force approach:

 

data want;
  set have;
  array v1(*) $ var1-var3;
  array v2(*) $ var4-var5;
  call sortc(of v1(*));
  do i=1 to dim(v1);
    if not missing(v1(i)) then do;
      if i lt dim(v1) then do;
        if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
      end;
      else var1_var3=sum(var1_var3,1);
    end;
  end;
  call sortc(of v2(*));
  do i=1 to dim(v2);
    if not missing(v2(i)) then do;
      if i lt dim(v2) then do;
        if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
      end;
      else var4_var5=sum(var4_var5,1);
    end;
  end;
run;

HTH,

Art, CEO, AnalystFinder.com

 

View solution in original post


All Replies
Solution
‎01-26-2017 07:04 PM
PROC Star
Posts: 7,364

Re: counting distinct values over a range of variables

Here is a brute force approach:

 

data want;
  set have;
  array v1(*) $ var1-var3;
  array v2(*) $ var4-var5;
  call sortc(of v1(*));
  do i=1 to dim(v1);
    if not missing(v1(i)) then do;
      if i lt dim(v1) then do;
        if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
      end;
      else var1_var3=sum(var1_var3,1);
    end;
  end;
  call sortc(of v2(*));
  do i=1 to dim(v2);
    if not missing(v2(i)) then do;
      if i lt dim(v2) then do;
        if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
      end;
      else var4_var5=sum(var4_var5,1);
    end;
  end;
run;

HTH,

Art, CEO, AnalystFinder.com

 

Respected Advisor
Posts: 3,777

Re: counting distinct values over a range of variables

[ Edited ]

art297 wrote:

Here is a brute force approach:

 

data want;
  set have;
  array v1(*) $ var1-var3;
  array v2(*) $ var4-var5;
  call sortc(of v1(*));
  do i=1 to dim(v1);
    if not missing(v1(i)) then do;
      if i lt dim(v1) then do;
        if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
      end;
      else var1_var3=sum(var1_var3,1);
    end;
  end;
  call sortc(of v2(*));
  do i=1 to dim(v2);
    if not missing(v2(i)) then do;
      if i lt dim(v2) then do;
        if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
      end;
      else var4_var5=sum(var4_var5,1);
    end;
  end;
run;

HTH,

Art, CEO, AnalystFinder.com

 


@art297 I thought you might enjoy this Rube Goldberg machine with implicit and explicit arrays, arrays of implicit arrays and temporary arrays. 

 

 

data t;
   array a $8 var1-var3;
   array b $8 var4-var5;
   array d[2] _temporary_;
   d[1]=dim(a); d[2]=dim(b);
   array c(j) a b;
   input (var:)(:);
   array t[10] $8 _temporary_;
   array count[2];
   do over c;
      call missing(k,of t[*]);
      do _i_ = 1 to d[j];
         if c in t then continue;
         k + 1;
         t[k]=c;
         end;
      count[j] = k;
      end;
   drop j k;
   cards;
abc d45g abc 1 2
a c rrrrrr g9 g9
;;;;
   run;
proc print;
   run;

Capture.PNG

 

Super Contributor
Posts: 251

Re: counting distinct values over a range of variables

Can you please provide more data? My initial generic response (using proc transpose) may or may not be viable, depending on the number of observations.

 

If the source data are small enough, transpose would be quite good. If it's over 32k observations, it won't!

Respected Advisor
Posts: 3,777

Re: counting distinct values over a range of variables

[ Edited ]

LaurieF wrote:

Can you please provide more data? My initial generic response (using proc transpose) may or may not be viable, depending on the number of observations.

 

If the source data are small enough, transpose would be quite good. If it's over 32k observations, it won't!


@LaurieF this is the second or third time @ilikesas has asked a similar question.  If @ilikesas would normalized their data they could easily deal with this type of problem and more, without resorting to ARRAYs and other clunky data step code.

Super Contributor
Posts: 251

Re: counting distinct values over a range of variables

<nods enthusiastically>

Super Contributor
Posts: 413

Re: counting distinct values over a range of variables

[ Edited ]

Hi data_null_,

 

could you please elaborate more about what you mean by "normalizing" the data? It is indeed true that arrays and do loops tend to be messy and complicated (at least for me)...

 

Thanks! 

Super Contributor
Posts: 251

Re: counting distinct values over a range of variables

The long answer is: Wikipedia Data Normalisation

 

The short answer: SAS works inherently better when it's processing very long, very narrow tables. One of the primary functions of proc transpose, for example, is to rotate data so that columns become rows and vice versa. That way you can do queries over columns, which proc sql naturally supports.

 

You can, of course, use data steps to walk along arrays, and that can work very well. But if you can manage to pre-process your data so that you can avoid that, the long term solution will tend to be more effective.

Frequent Contributor
Frequent Contributor
Posts: 89

Re: counting distinct values over a range of variables

 

@ilikesas

 

Without a little more feel for your data its hard to say what normalization you need to do.

 

If you had survey data, for example with each survey question a variable (column) and each survey responder an observation (row) like:

  ques1 ques2 ques3
responder1      
responser2      
responder3      

 

that would be a pretty typical format in my experience.  For me (and I think for many) this sort of layout lends itself to differences between responders over a particular question easier than to differences in questions for a particular responder, although both are possible. 

 

I think the suggestion is that perhaps a look at the data/database design/scheme and what you want to do with it could save you some work in the long run.

Super Contributor
Posts: 413

Re: counting distinct values over a range of variables

Hi HB,

 

thanks for the insight. In my case my data is code names (such as abc, gf4). Could this be fine?

Super Contributor
Posts: 413

Re: counting distinct values over a range of variables

Hi LaurieF,

 

here is some data that I prepared:

 

data have;
input var1 $4. var2 $4. var3 $4. var4 $4. var5 $4.;
datalines;
abc abc def g55 g56
abc def ttt hhh hhh
ddd ddd ddd ddd ddd
111 111 112 111 111
a   a   a   a   a  
ab  ab  ac  ac  ac 
;
run;

I guess that in order to transpose more datalines are needed than variables. Here I also made al variables to be of character type and length 4, but I guess that this specification is not very important when data is imported from an external source such as Excel.

Thanks!

Super Contributor
Posts: 251

Re: counting distinct values over a range of variables

Again, that's all very well, but it still doesn't define the problem very well: it's all a bit theoretical. Do you have some real-world data to play with?

Super Contributor
Posts: 413

Re: counting distinct values over a range of variables

actually I don't have real data, but I am rather thinking of building a data which will consist of names or name codes (such as abc, gf5 etc.)

Super Contributor
Posts: 251

Re: counting distinct values over a range of variables

I'd recommend, then, constructing your data so that you can avoid using arrays, and store the values in observations instead. You're making a rod for your own back otherwise.

Respected Advisor
Posts: 3,900

Re: counting distinct values over a range of variables

@ilikesas

It's a bit hard to give you advice without any context and without knowing how you plan to use this data.

 

On a generic level: You could go for a key/value pair construct with one column holding the name of your variables and a second column holding the value. With such a structure it's then quite simple to implement distinct counts. You could even use such data to construct SAS Formats (via Proc Format, cntlin).

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 19 replies
  • 404 views
  • 16 likes
  • 7 in conversation