Solved: Re: counting distinct values over a range of variables

ilikesas · Posted 01-26-2017 02:04 PM

Hi,

suppsoe I have the following data:

var1	var2	var3	var4	var5
abc	d45g	abc	1	2
a	c	rrrrrr	g9	g9

var1 - var3 is one range, and var4 - var5 is another.

For each row, I would like to count the number of distinct values for each range of variables. So for the above case, here is the count:

var1-var3	var4-var5
2	2
3	1

Thanks!

art297 · Posted 01-26-2017 02:45 PM

Here is a brute force approach:

data want;
  set have;
  array v1(*) $ var1-var3;
  array v2(*) $ var4-var5;
  call sortc(of v1(*));
  do i=1 to dim(v1);
    if not missing(v1(i)) then do;
      if i lt dim(v1) then do;
        if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
      end;
      else var1_var3=sum(var1_var3,1);
    end;
  end;
  call sortc(of v2(*));
  do i=1 to dim(v2);
    if not missing(v2(i)) then do;
      if i lt dim(v2) then do;
        if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
      end;
      else var4_var5=sum(var4_var5,1);
    end;
  end;
run;

HTH,

Art, CEO, AnalystFinder.com

View solution in original post

art297 · Posted 01-26-2017 02:45 PM

Here is a brute force approach:

data want;
  set have;
  array v1(*) $ var1-var3;
  array v2(*) $ var4-var5;
  call sortc(of v1(*));
  do i=1 to dim(v1);
    if not missing(v1(i)) then do;
      if i lt dim(v1) then do;
        if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
      end;
      else var1_var3=sum(var1_var3,1);
    end;
  end;
  call sortc(of v2(*));
  do i=1 to dim(v2);
    if not missing(v2(i)) then do;
      if i lt dim(v2) then do;
        if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
      end;
      else var4_var5=sum(var4_var5,1);
    end;
  end;
run;

HTH,

Art, CEO, AnalystFinder.com

data_null__ · Posted 01-26-2017 04:26 PM

@art297 wrote:

Here is a brute force approach:

data want;
  set have;
  array v1(*) $ var1-var3;
  array v2(*) $ var4-var5;
  call sortc(of v1(*));
  do i=1 to dim(v1);
    if not missing(v1(i)) then do;
      if i lt dim(v1) then do;
        if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
      end;
      else var1_var3=sum(var1_var3,1);
    end;
  end;
  call sortc(of v2(*));
  do i=1 to dim(v2);
    if not missing(v2(i)) then do;
      if i lt dim(v2) then do;
        if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
      end;
      else var4_var5=sum(var4_var5,1);
    end;
  end;
run;

HTH,

Art, CEO, AnalystFinder.com

@art297 I thought you might enjoy this Rube Goldberg machine with implicit and explicit arrays, arrays of implicit arrays and temporary arrays.

data t;
   array a $8 var1-var3;
   array b $8 var4-var5;
   array d[2] _temporary_;
   d[1]=dim(a); d[2]=dim(b);
   array c(j) a b;
   input (var:)(:);
   array t[10] $8 _temporary_;
   array count[2];
   do over c;
      call missing(k,of t[*]);
      do _i_ = 1 to d[j];
         if c in t then continue;
         k + 1;
         t[k]=c;
         end;
      count[j] = k;
      end;
   drop j k;
   cards;
abc d45g abc 1 2
a c rrrrrr g9 g9
;;;;
   run;
proc print;
   run;

LaurieF · Posted 01-26-2017 02:49 PM

Can you please provide more data? My initial generic response (using proc transpose) may or may not be viable, depending on the number of observations.

If the source data are small enough, transpose would be quite good. If it's over 32k observations, it won't!

data_null__ · Posted 01-26-2017 02:54 PM

@LaurieF wrote:

Can you please provide more data? My initial generic response (using proc transpose) may or may not be viable, depending on the number of observations.

If the source data are small enough, transpose would be quite good. If it's over 32k observations, it won't!

@LaurieF this is the second or third time @ilikesas has asked a similar question. If @ilikesas would normalized their data they could easily deal with this type of problem and more, without resorting to ARRAYs and other clunky data step code.

LaurieF · Posted 01-26-2017 02:55 PM

ilikesas · Posted 01-26-2017 03:21 PM

Hi data_null_,

could you please elaborate more about what you mean by "normalizing" the data? It is indeed true that arrays and do loops tend to be messy and complicated (at least for me)...

Thanks!

LaurieF · Posted 01-26-2017 03:42 PM

The long answer is: Wikipedia Data Normalisation

The short answer: SAS works inherently better when it's processing very long, very narrow tables. One of the primary functions of proc transpose, for example, is to rotate data so that columns become rows and vice versa. That way you can do queries over columns, which proc sql naturally supports.

You can, of course, use data steps to walk along arrays, and that can work very well. But if you can manage to pre-process your data so that you can avoid that, the long term solution will tend to be more effective.

HB · Posted 01-26-2017 03:43 PM

@ilikesas

Without a little more feel for your data its hard to say what normalization you need to do.

If you had survey data, for example with each survey question a variable (column) and each survey responder an observation (row) like:

	ques1	ques2	ques3
responder1
responser2
responder3

that would be a pretty typical format in my experience. For me (and I think for many) this sort of layout lends itself to differences between responders over a particular question easier than to differences in questions for a particular responder, although both are possible.

I think the suggestion is that perhaps a look at the data/database design/scheme and what you want to do with it could save you some work in the long run.

ilikesas · Posted 01-26-2017 03:50 PM

Hi HB,

thanks for the insight. In my case my data is code names (such as abc, gf4). Could this be fine?

ilikesas · Posted 01-26-2017 03:47 PM

Hi LaurieF,

here is some data that I prepared:

data have;
input var1 $4. var2 $4. var3 $4. var4 $4. var5 $4.;
datalines;
abc abc def g55 g56
abc def ttt hhh hhh
ddd ddd ddd ddd ddd
111 111 112 111 111
a   a   a   a   a  
ab  ab  ac  ac  ac 
;
run;

I guess that in order to transpose more datalines are needed than variables. Here I also made al variables to be of character type and length 4, but I guess that this specification is not very important when data is imported from an external source such as Excel.

Thanks!

LaurieF · Posted 01-26-2017 03:49 PM

Again, that's all very well, but it still doesn't define the problem very well: it's all a bit theoretical. Do you have some real-world data to play with?

ilikesas · Posted 01-26-2017 03:53 PM

actually I don't have real data, but I am rather thinking of building a data which will consist of names or name codes (such as abc, gf5 etc.)

LaurieF · Posted 01-26-2017 03:54 PM

I'd recommend, then, constructing your data so that you can avoid using arrays, and store the values in observations instead. You're making a rod for your own back otherwise.

Patrick · Posted 01-26-2017 04:04 PM

@ilikesas

It's a bit hard to give you advice without any context and without knowing how you plan to use this data.

On a generic level: You could go for a key/value pair construct with one column holding the name of your variables and a second column holding the value. With such a structure it's then quite simple to implement distinct counts. You could even use such data to construct SAS Formats (via Proc Format, cntlin).

Register Today!

SAS Training: Just a Click Away