Solved: counting distinct values over a range of variables - Page 2

HB · Posted 01-26-2017 04:24 PM

It's hard to know what abc and gf5 represent or what the desired outcome is or anything at all actually so it's hard to give advice.

Let's assume we have a number of restaurants and a number of restaurant health examinations (not the same exam conducted on multiple dates but different exam types) and further that each restaurant undergoes one of the examtypes on each given testdate. One could structure the data to look like

restaurantid	testdate1	testdate2	testdate3
restaurant1	examtype2	examtype2	examtype1
restaurant2	examtype3	examtype1	examtype2
restaurant3	examtype2	examtype1	examtype1

From this you could tell what restaurants underwent what kind of exam on testdate2, and whether restaurant2 has over the three exam dates completed all exam types, but it is a little cumbersome. Plus you have to add a new column for every new test date.

If the data looked like

restaurantid	testdates	examtype
restaurant1	testdate1	2
restaurant1	testdate2	2
restaurant1	testdate3	1
restaurant2	testdate1	3
restaurant2	testdate2	1
restaurant2	testdate3	2
restaurant3	testdate1	2
restaurant3	testdate2	1
restaurant3	testdate3	1

You can also tell what exams were given on any given date and you can tell whether a particular restaurant has completed all the required exams and its easier and you don't have to change the table structure for an additional test date.

As @LaurieF says, "it's all a bit theoretical" at this point.

data_null__ · Posted 01-26-2017 04:17 PM

@ilikesas wrote:

Hi LaurieF,

here is some data that I prepared:
data have;
input var1 $4. var2 $4. var3 $4. var4 $4. var5 $4.;
datalines;
abc abc def g55 g56
abc def ttt hhh hhh
ddd ddd ddd ddd ddd
111 111 112 111 111
a   a   a   a   a  
ab  ab  ac  ac  ac 
;
run;
I guess that in order to transpose more datalines are needed than variables. Here I also made al variables to be of character type and length 4, but I guess that this specification is not very important when data is imported from an external source such as Excel.

Thanks!

For your first example if your data were normalized this is the code you could use to count distinct values per group. It will do any number of groups.

ods select none;
ods output nlevels=nlevels;
proc freq nlevels;
   by id group;
   table value / noprint;
   run;
ods select all;

Patrick · Posted 01-26-2017 03:51 PM

Others pointed out already that you probably should structure your data differently.

It's a bit of an "overkill" but should you have to do this for many variable combinations with your current data structure then the following would be an option.

data have;
  input (var1 var2 var3) ($) var4 var5;
  datalines;
abc d45g abc 1 2
a c 9 9 9
;
run;

%macro count_dist_vals(invars,outvar);
  %local _num;
  %let _num=%sysfunc(ceil(%sysevalf(%sysfunc(ranuni(0))*10000000000)),z10.);
  if _n_=1 then
    do;
      length _tmp_var_12345 $32767;
      dcl hash h_&_num.(multidata:'n', hashexp:2);
      h_&_num..defineKey('_tmp_var_12345');
      h_&_num..defineDone();
    end;
    %do i= 1 %to %sysfunc(countw(&invars));
      _tmp_var_12345=catt(%scan(&invars,&i));
      h_&_num..ref();
    %end;
    &outvar=h_&_num..num_items;
    h_&_num..clear();
    drop _tmp_var_12345 ;
%mend;

data want;
  set have;
  %count_dist_vals(var1 var2 var3, n_dist_vals_1)
  %count_dist_vals(var3 var4 var5, n_dist_vals_2)
run;

art297 · Posted 01-26-2017 06:28 PM

I happen to disagree with my colleagues on this point. Normalizing one's data doesn't always provide the expected benefits .. especially with large data sets.

@data_null__: John, neither one of us would write such Goldberg-type code unless we were challenging each other to see who could come up with the most obfuscatious solution (job security maybe?).

And, I'm definitely not saying that I'm proud of the code I offered. However, like your Goldberg contraption, it does what the OP wanted (p.s., mine ran faster than yours).

Here is the code I ran, the log follows:

data have (drop=i);
  input (var1-var5) ($);
  do i=1 to 10000;output;end;
  cards;
abc d45g abc 1 2
a c rrrrrr g9 g9
abc abc def g55 g56
abc def ttt hhh hhh
ddd ddd ddd ddd ddd
111 111 112 111 111
a   a   a   a   a  
ab  ab  ac  ac  ac 
;

data want1 (keep=var1_var3 var4_var5);
  set have;
  array v1(*) $ var1-var3;
  array v2(*) $ var4-var5;
  call sortc(of v1(*));
  do i=1 to dim(v1);
    if not missing(v1(i)) then do;
      if i lt dim(v1) then do;
        if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
      end;
      else var1_var3=sum(var1_var3,1);
    end;
  end;
  call sortc(of v2(*));
  do i=1 to dim(v2);
    if not missing(v2(i)) then do;
      if i lt dim(v2) then do;
        if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
      end;
      else var4_var5=sum(var4_var5,1);
    end;
  end;
run;
data have;
  set have;
  id=_n_;
run;

proc transpose data=have out=need;
  var var1-var5;
  by id;
run;

data need (keep=id group col1);
  set need;
  if _name_ in ('var1', 'var2', 'var3') then group=1;
  else group=2;
run;

ods select none;
ods output nlevels=nlevels;
proc freq data=need nlevels;
   by id group;
   table col1 / noprint;
   run;
ods select all;

data want2 (keep=var1_var3 var4_var5);
  set nlevels (drop=TableVar);
  retain var1_var3;
  by id;
  if mod(_n_,2) eq 1 then var1_var3=nlevels;
  else do;
    var4_var5=nlevels;
    output;
  end;
run;

data want3 (keep=var1_var3 var4_var5);
   set have;
   array a $8 var1-var3;
   array b $8 var4-var5;
   array d[2] _temporary_;
   d[1]=dim(a); d[2]=dim(b);
   array c(j) a b;
   array t[10] $8 _temporary_;
   array count[2] var1_var3 var4_var5;
   do over c;
      call missing(k,of t[*]);
      do _i_ = 1 to d[j];
         if c in t then continue;
         k + 1;
         t[k]=c;
         end;
      count[j] = k;
      end;
run;
%macro count_dist_vals(invars,outvar);
  %local _num;
  %let _num=%sysfunc(ceil(%sysevalf(%sysfunc(ranuni(0))*10000000000)),z10.);
  if _n_=1 then
    do;
      length _tmp_var_12345 $32767;
      dcl hash h_&_num.(multidata:'n', hashexp:2);
      h_&_num..defineKey('_tmp_var_12345');
      h_&_num..defineDone();
    end;
    %do i= 1 %to %sysfunc(countw(&invars));
      _tmp_var_12345=catt(%scan(&invars,&i));
      h_&_num..ref();
    %end;
    &outvar=h_&_num..num_items;
    h_&_num..clear();
    drop _tmp_var_12345 ;
%mend;

data want4;
  set have;
  %count_dist_vals(var1 var2 var3, n_dist_vals_1)
  %count_dist_vals(var3 var4 var5, n_dist_vals_2)
run;

and now the log:

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 NOTE: ODS statements in the SAS Studio environment may disable some output features.
 56         
 57         data have (drop=i);
 58           input (var1-var5) ($);
 59           do i=1 to 10000;output;end;
 60           cards;
 
 NOTE: The data set WORK.HAVE has 80000 observations and 5 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.01 seconds
       cpu time            0.00 seconds
       
 69         ;
 
 70         
 71         data want1 (keep=var1_var3 var4_var5);
 72           set have;
 73           array v1(*) $ var1-var3;
 74           array v2(*) $ var4-var5;
 75           call sortc(of v1(*));
 76           do i=1 to dim(v1);
 77             if not missing(v1(i)) then do;
 78               if i lt dim(v1) then do;
 79                 if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1);
 80               end;
 81               else var1_var3=sum(var1_var3,1);
 82             end;
 83           end;
 84           call sortc(of v2(*));
 85           do i=1 to dim(v2);
 86             if not missing(v2(i)) then do;
 87               if i lt dim(v2) then do;
 88                 if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1);
 89               end;
 90               else var4_var5=sum(var4_var5,1);
 91             end;
 92           end;
 93         run;
 
 NOTE: There were 80000 observations read from the data set WORK.HAVE.
 NOTE: The data set WORK.WANT1 has 80000 observations and 2 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.03 seconds
       cpu time            0.04 seconds
       
 
 94         data have;
 95           set have;
 96           id=_n_;
 97         run;
 
 NOTE: There were 80000 observations read from the data set WORK.HAVE.
 NOTE: The data set WORK.HAVE has 80000 observations and 6 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.02 seconds
       cpu time            0.02 seconds
       
 
 98         
 99         proc transpose data=have out=need;
 100          var var1-var5;
 101          by id;
 102        run;
 
 NOTE: There were 80000 observations read from the data set WORK.HAVE.
 NOTE: The data set WORK.NEED has 400000 observations and 3 variables.
 NOTE: PROCEDURE TRANSPOSE used (Total process time):
       real time           0.07 seconds
       cpu time            0.07 seconds
       
 
 103        
 104        data need (keep=id group col1);
 105          set need;
 106          if _name_ in ('var1', 'var2', 'var3') then group=1;
 107          else group=2;
 108        run;
 
 NOTE: There were 400000 observations read from the data set WORK.NEED.
 NOTE: The data set WORK.NEED has 400000 observations and 3 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.04 seconds
       cpu time            0.05 seconds
       
 
 109        
 110        ods select none;
 111        ods output nlevels=nlevels;
 112        proc freq data=need nlevels;
 113           by id group;
 114           table col1 / noprint;
 115           run;
 
 NOTE: The data set WORK.NLEVELS has 160000 observations and 4 variables.
 NOTE: There were 400000 observations read from the data set WORK.NEED.
 NOTE: PROCEDURE FREQ used (Total process time):
       real time           14.90 seconds
       cpu time            15.02 seconds
       
 
 116        ods select all;
 117        
 118        data want2 (keep=var1_var3 var4_var5);
 119          set nlevels (drop=TableVar);
 120          retain var1_var3;
 121          by id;
 122          if mod(_n_,2) eq 1 then var1_var3=nlevels;
 123          else do;
 124            var4_var5=nlevels;
 125            output;
 126          end;
 127        run;
 
 NOTE: There were 160000 observations read from the data set WORK.NLEVELS.
 NOTE: The data set WORK.WANT2 has 80000 observations and 2 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.02 seconds
       cpu time            0.02 seconds
       
 
 128        
 129        data want3 (keep=var1_var3 var4_var5);
 130           set have;
 131           array a $8 var1-var3;
 132           array b $8 var4-var5;
 133           array d[2] _temporary_;
 134           d[1]=dim(a); d[2]=dim(b);
 135           array c(j) a b;
 136           array t[10] $8 _temporary_;
 137           array count[2] var1_var3 var4_var5;
 138           do over c;
 139              call missing(k,of t[*]);
 140              do _i_ = 1 to d[j];
 141                 if c in t then continue;
 142                 k + 1;
 143                 t[k]=c;
 144                 end;
 145              count[j] = k;
 146              end;
 147        run;
 
 NOTE: There were 80000 observations read from the data set WORK.HAVE.
 NOTE: The data set WORK.WANT3 has 80000 observations and 2 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.06 seconds
       cpu time            0.06 seconds
       
 
 148        %macro count_dist_vals(invars,outvar);
 149          %local _num;
 150          %let _num=%sysfunc(ceil(%sysevalf(%sysfunc(ranuni(0))*10000000000)),z10.);
 151          if _n_=1 then
 152            do;
 153              length _tmp_var_12345 $32767;
 154              dcl hash h_&_num.(multidata:'n', hashexp:2);
 155              h_&_num..defineKey('_tmp_var_12345');
 156              h_&_num..defineDone();
 157            end;
 158            %do i= 1 %to %sysfunc(countw(&invars));
 159              _tmp_var_12345=catt(%scan(&invars,&i));
 160              h_&_num..ref();
 161            %end;
 162            &outvar=h_&_num..num_items;
 163            h_&_num..clear();
 164            drop _tmp_var_12345 ;
 165        %mend;
 166        
 167        data want4;
 168          set have;
 169          %count_dist_vals(var1 var2 var3, n_dist_vals_1)
 170          %count_dist_vals(var3 var4 var5, n_dist_vals_2)
 171        run;
 
 NOTE: There were 80000 observations read from the data set WORK.HAVE.
 NOTE: The data set WORK.WANT4 has 80000 observations and 8 variables.
 NOTE: DATA statement used (Total process time):
       real time           38.12 seconds
       cpu time            37.94 seconds
       
 
 172        
 173        
 174        OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 186

devsas · Posted 01-27-2017 03:26 PM

I used the following code and got what you wanted. Its a simple code using PROC sql and transpose etc rather than more complex arrays. Not sure if it will be worth if there are hundreds of variables.

data have;
input var1 $	var2 $	var3 $	var4 $	var5 $;
cards;
abc	d45g	abc	1	2
a 	c	rrrrrr	g9	g9
;
run;


proc sql;
create table var1_3 as select var1, var2, var3 from have;
quit;

proc sort data = var1_3;
by var1 var2 var3;
run;
proc transpose data = var1_3 out=var1_3_ (DROP =_NAME_);
VAR var1 var2 var3;
run;

PROC SQL;
CREATE TABLE VAR1_VAR3 AS SELECT COUNT(DISTINCT COL1) AS VAR1, COUNT(DISTINCT COL2) AS VAR2 FROM var1_3_;
QUIT;

PROC TRANSPOSE DATA = VAR1_VAR3 OUT = VAR1_VAR3_ (DROP=_NAME_) PREFIX = VAR1_VAR3;
VAR VAR1 VAR2;
RUN;

proc sql;
create table var4_5 as select var4, var5 from have;
quit;

proc sort data = var4_5;
by var4 VAR5;
run;
proc transpose data = var4_5 out=var4_5_ (DROP =_NAME_);
VAR VAR4 VAR5;
run;

PROC SQL;
CREATE TABLE VAR4_VAR5 AS SELECT COUNT(DISTINCT COL1) AS VAR4, COUNT(DISTINCT COL2) AS VAR5 FROM var4_5_ ;
QUIT;

PROC TRANSPOSE DATA = VAR4_VAR5 OUT = VAR4_VAR5_ (DROP=_NAME_) PREFIX = VAR4_VAR5;
VAR VAR4 VAR5;
RUN;

DATA WANT;
MERGE VAR1_VAR3_ VAR4_VAR5_;
RUN;

viren · Posted 09-09-2017 03:08 PM

Hi everyone I am beginner/new to SAS & learning for my certification. I was practicising on a particular dataset which has different variables and for some variables/columns there are "0s" as values.

What I am trying to do is generate a table that lists the columns horizontally with the number of "0s" they have in their column. I have attached the dataset screenshot and also the output I am trying to achieve.

** I am using WPS environment for practising my codes.

Any help/guidance will really be appreciated.

variable label nmiss / count

Total Amt Total Amt 0

Installments Insss 5

Fitness fit 8

Re: counting distinct values over a range of variables

Re: counting distinct values over a range of variables

Re: counting distinct values over a range of variables

Re: counting distinct values over a range of variables

Re: counting distinct values over a range of variables

Need help on a code for counting values

Re: counting distinct values over a range of variables

Re: counting distinct values over a range of variables

Re: counting distinct values over a range of variables

Re: counting distinct values over a range of variables

Re: counting distinct values over a range of variables

Need help on a code for counting values

Register Today!

SAS Training: Just a Click Away