BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ilikesas
Barite | Level 11

Hi,

 

suppsoe I have the following data:

 

var1 var2 var3 var4 var5
abc d45g abc 1 2
c rrrrrr g9 g9

 

var1 - var3 is one range, and var4 - var5 is another.

For each row, I would like to count the number of distinct values for each range of variables. So for the above case, here is the count:

 

var1-var3 var4-var5
2 2
3 1

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

Here is a brute force approach:

 

data want;
  set have;
  array v1(*) $ var1-var3;
  array v2(*) $ var4-var5;
  call sortc(of v1(*));
  do i=1 to dim(v1);
    if not missing(v1(i)) then do;
      if i lt dim(v1) then do;
        if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
      end;
      else var1_var3=sum(var1_var3,1);
    end;
  end;
  call sortc(of v2(*));
  do i=1 to dim(v2);
    if not missing(v2(i)) then do;
      if i lt dim(v2) then do;
        if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
      end;
      else var4_var5=sum(var4_var5,1);
    end;
  end;
run;

HTH,

Art, CEO, AnalystFinder.com

 

View solution in original post

20 REPLIES 20
art297
Opal | Level 21

Here is a brute force approach:

 

data want;
  set have;
  array v1(*) $ var1-var3;
  array v2(*) $ var4-var5;
  call sortc(of v1(*));
  do i=1 to dim(v1);
    if not missing(v1(i)) then do;
      if i lt dim(v1) then do;
        if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
      end;
      else var1_var3=sum(var1_var3,1);
    end;
  end;
  call sortc(of v2(*));
  do i=1 to dim(v2);
    if not missing(v2(i)) then do;
      if i lt dim(v2) then do;
        if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
      end;
      else var4_var5=sum(var4_var5,1);
    end;
  end;
run;

HTH,

Art, CEO, AnalystFinder.com

 

data_null__
Jade | Level 19

@art297 wrote:

Here is a brute force approach:

 

data want;
  set have;
  array v1(*) $ var1-var3;
  array v2(*) $ var4-var5;
  call sortc(of v1(*));
  do i=1 to dim(v1);
    if not missing(v1(i)) then do;
      if i lt dim(v1) then do;
        if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
      end;
      else var1_var3=sum(var1_var3,1);
    end;
  end;
  call sortc(of v2(*));
  do i=1 to dim(v2);
    if not missing(v2(i)) then do;
      if i lt dim(v2) then do;
        if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
      end;
      else var4_var5=sum(var4_var5,1);
    end;
  end;
run;

HTH,

Art, CEO, AnalystFinder.com

 


@art297 I thought you might enjoy this Rube Goldberg machine with implicit and explicit arrays, arrays of implicit arrays and temporary arrays. 

 

 

data t;
   array a $8 var1-var3;
   array b $8 var4-var5;
   array d[2] _temporary_;
   d[1]=dim(a); d[2]=dim(b);
   array c(j) a b;
   input (var:)(:);
   array t[10] $8 _temporary_;
   array count[2];
   do over c;
      call missing(k,of t[*]);
      do _i_ = 1 to d[j];
         if c in t then continue;
         k + 1;
         t[k]=c;
         end;
      count[j] = k;
      end;
   drop j k;
   cards;
abc d45g abc 1 2
a c rrrrrr g9 g9
;;;;
   run;
proc print;
   run;

Capture.PNG

 

LaurieF
Barite | Level 11

Can you please provide more data? My initial generic response (using proc transpose) may or may not be viable, depending on the number of observations.

 

If the source data are small enough, transpose would be quite good. If it's over 32k observations, it won't!

data_null__
Jade | Level 19

@LaurieF wrote:

Can you please provide more data? My initial generic response (using proc transpose) may or may not be viable, depending on the number of observations.

 

If the source data are small enough, transpose would be quite good. If it's over 32k observations, it won't!


@LaurieF this is the second or third time @ilikesas has asked a similar question.  If @ilikesas would normalized their data they could easily deal with this type of problem and more, without resorting to ARRAYs and other clunky data step code.

LaurieF
Barite | Level 11

<nods enthusiastically>

ilikesas
Barite | Level 11

Hi data_null_,

 

could you please elaborate more about what you mean by "normalizing" the data? It is indeed true that arrays and do loops tend to be messy and complicated (at least for me)...

 

Thanks! 

LaurieF
Barite | Level 11

The long answer is: Wikipedia Data Normalisation

 

The short answer: SAS works inherently better when it's processing very long, very narrow tables. One of the primary functions of proc transpose, for example, is to rotate data so that columns become rows and vice versa. That way you can do queries over columns, which proc sql naturally supports.

 

You can, of course, use data steps to walk along arrays, and that can work very well. But if you can manage to pre-process your data so that you can avoid that, the long term solution will tend to be more effective.

HB
Barite | Level 11 HB
Barite | Level 11

 

@ilikesas

 

Without a little more feel for your data its hard to say what normalization you need to do.

 

If you had survey data, for example with each survey question a variable (column) and each survey responder an observation (row) like:

  ques1 ques2 ques3
responder1      
responser2      
responder3      

 

that would be a pretty typical format in my experience.  For me (and I think for many) this sort of layout lends itself to differences between responders over a particular question easier than to differences in questions for a particular responder, although both are possible. 

 

I think the suggestion is that perhaps a look at the data/database design/scheme and what you want to do with it could save you some work in the long run.

ilikesas
Barite | Level 11

Hi HB,

 

thanks for the insight. In my case my data is code names (such as abc, gf4). Could this be fine?

ilikesas
Barite | Level 11

Hi LaurieF,

 

here is some data that I prepared:

 

data have;
input var1 $4. var2 $4. var3 $4. var4 $4. var5 $4.;
datalines;
abc abc def g55 g56
abc def ttt hhh hhh
ddd ddd ddd ddd ddd
111 111 112 111 111
a   a   a   a   a  
ab  ab  ac  ac  ac 
;
run;

I guess that in order to transpose more datalines are needed than variables. Here I also made al variables to be of character type and length 4, but I guess that this specification is not very important when data is imported from an external source such as Excel.

Thanks!

LaurieF
Barite | Level 11

Again, that's all very well, but it still doesn't define the problem very well: it's all a bit theoretical. Do you have some real-world data to play with?

ilikesas
Barite | Level 11

actually I don't have real data, but I am rather thinking of building a data which will consist of names or name codes (such as abc, gf5 etc.)

LaurieF
Barite | Level 11

I'd recommend, then, constructing your data so that you can avoid using arrays, and store the values in observations instead. You're making a rod for your own back otherwise.

Patrick
Opal | Level 21

@ilikesas

It's a bit hard to give you advice without any context and without knowing how you plan to use this data.

 

On a generic level: You could go for a key/value pair construct with one column holding the name of your variables and a second column holding the value. With such a structure it's then quite simple to implement distinct counts. You could even use such data to construct SAS Formats (via Proc Format, cntlin).

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 20 replies
  • 2701 views
  • 16 likes
  • 8 in conversation