It's hard to know what abc and gf5 represent or what the desired outcome is or anything at all actually so it's hard to give advice.
Let's assume we have a number of restaurants and a number of restaurant health examinations (not the same exam conducted on multiple dates but different exam types) and further that each restaurant undergoes one of the examtypes on each given testdate. One could structure the data to look like
restaurantid | testdate1 | testdate2 | testdate3 |
restaurant1 | examtype2 | examtype2 | examtype1 |
restaurant2 | examtype3 | examtype1 | examtype2 |
restaurant3 | examtype2 | examtype1 | examtype1 |
From this you could tell what restaurants underwent what kind of exam on testdate2, and whether restaurant2 has over the three exam dates completed all exam types, but it is a little cumbersome. Plus you have to add a new column for every new test date.
If the data looked like
restaurantid | testdates | examtype |
restaurant1 | testdate1 | 2 |
restaurant1 | testdate2 | 2 |
restaurant1 | testdate3 | 1 |
restaurant2 | testdate1 | 3 |
restaurant2 | testdate2 | 1 |
restaurant2 | testdate3 | 2 |
restaurant3 | testdate1 | 2 |
restaurant3 | testdate2 | 1 |
restaurant3 | testdate3 |
1 |
You can also tell what exams were given on any given date and you can tell whether a particular restaurant has completed all the required exams and its easier and you don't have to change the table structure for an additional test date.
As @LaurieF says, "it's all a bit theoretical" at this point.
@ilikesas wrote:
Hi LaurieF,
here is some data that I prepared:
data have; input var1 $4. var2 $4. var3 $4. var4 $4. var5 $4.; datalines; abc abc def g55 g56 abc def ttt hhh hhh ddd ddd ddd ddd ddd 111 111 112 111 111 a a a a a ab ab ac ac ac ; run;
I guess that in order to transpose more datalines are needed than variables. Here I also made al variables to be of character type and length 4, but I guess that this specification is not very important when data is imported from an external source such as Excel.
Thanks!
For your first example if your data were normalized this is the code you could use to count distinct values per group. It will do any number of groups.
ods select none;
ods output nlevels=nlevels;
proc freq nlevels;
by id group;
table value / noprint;
run;
ods select all;
Others pointed out already that you probably should structure your data differently.
It's a bit of an "overkill" but should you have to do this for many variable combinations with your current data structure then the following would be an option.
data have;
input (var1 var2 var3) ($) var4 var5;
datalines;
abc d45g abc 1 2
a c 9 9 9
;
run;
%macro count_dist_vals(invars,outvar);
%local _num;
%let _num=%sysfunc(ceil(%sysevalf(%sysfunc(ranuni(0))*10000000000)),z10.);
if _n_=1 then
do;
length _tmp_var_12345 $32767;
dcl hash h_&_num.(multidata:'n', hashexp:2);
h_&_num..defineKey('_tmp_var_12345');
h_&_num..defineDone();
end;
%do i= 1 %to %sysfunc(countw(&invars));
_tmp_var_12345=catt(%scan(&invars,&i));
h_&_num..ref();
%end;
&outvar=h_&_num..num_items;
h_&_num..clear();
drop _tmp_var_12345 ;
%mend;
data want;
set have;
%count_dist_vals(var1 var2 var3, n_dist_vals_1)
%count_dist_vals(var3 var4 var5, n_dist_vals_2)
run;
I happen to disagree with my colleagues on this point. Normalizing one's data doesn't always provide the expected benefits .. especially with large data sets.
@data_null__: John, neither one of us would write such Goldberg-type code unless we were challenging each other to see who could come up with the most obfuscatious solution (job security maybe?).
And, I'm definitely not saying that I'm proud of the code I offered. However, like your Goldberg contraption, it does what the OP wanted (p.s., mine ran faster than yours).
Here is the code I ran, the log follows:
data have (drop=i); input (var1-var5) ($); do i=1 to 10000;output;end; cards; abc d45g abc 1 2 a c rrrrrr g9 g9 abc abc def g55 g56 abc def ttt hhh hhh ddd ddd ddd ddd ddd 111 111 112 111 111 a a a a a ab ab ac ac ac ; data want1 (keep=var1_var3 var4_var5); set have; array v1(*) $ var1-var3; array v2(*) $ var4-var5; call sortc(of v1(*)); do i=1 to dim(v1); if not missing(v1(i)) then do; if i lt dim(v1) then do; if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1); end; else var1_var3=sum(var1_var3,1); end; end; call sortc(of v2(*)); do i=1 to dim(v2); if not missing(v2(i)) then do; if i lt dim(v2) then do; if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1); end; else var4_var5=sum(var4_var5,1); end; end; run; data have; set have; id=_n_; run; proc transpose data=have out=need; var var1-var5; by id; run; data need (keep=id group col1); set need; if _name_ in ('var1', 'var2', 'var3') then group=1; else group=2; run; ods select none; ods output nlevels=nlevels; proc freq data=need nlevels; by id group; table col1 / noprint; run; ods select all; data want2 (keep=var1_var3 var4_var5); set nlevels (drop=TableVar); retain var1_var3; by id; if mod(_n_,2) eq 1 then var1_var3=nlevels; else do; var4_var5=nlevels; output; end; run; data want3 (keep=var1_var3 var4_var5); set have; array a $8 var1-var3; array b $8 var4-var5; array d[2] _temporary_; d[1]=dim(a); d[2]=dim(b); array c(j) a b; array t[10] $8 _temporary_; array count[2] var1_var3 var4_var5; do over c; call missing(k,of t[*]); do _i_ = 1 to d[j]; if c in t then continue; k + 1; t[k]=c; end; count[j] = k; end; run; %macro count_dist_vals(invars,outvar); %local _num; %let _num=%sysfunc(ceil(%sysevalf(%sysfunc(ranuni(0))*10000000000)),z10.); if _n_=1 then do; length _tmp_var_12345 $32767; dcl hash h_&_num.(multidata:'n', hashexp:2); h_&_num..defineKey('_tmp_var_12345'); h_&_num..defineDone(); end; %do i= 1 %to %sysfunc(countw(&invars)); _tmp_var_12345=catt(%scan(&invars,&i)); h_&_num..ref(); %end; &outvar=h_&_num..num_items; h_&_num..clear(); drop _tmp_var_12345 ; %mend; data want4; set have; %count_dist_vals(var1 var2 var3, n_dist_vals_1) %count_dist_vals(var3 var4 var5, n_dist_vals_2) run;
and now the log:
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; NOTE: ODS statements in the SAS Studio environment may disable some output features. 56 57 data have (drop=i); 58 input (var1-var5) ($); 59 do i=1 to 10000;output;end; 60 cards; NOTE: The data set WORK.HAVE has 80000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 69 ; 70 71 data want1 (keep=var1_var3 var4_var5); 72 set have; 73 array v1(*) $ var1-var3; 74 array v2(*) $ var4-var5; 75 call sortc(of v1(*)); 76 do i=1 to dim(v1); 77 if not missing(v1(i)) then do; 78 if i lt dim(v1) then do; 79 if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1); 80 end; 81 else var1_var3=sum(var1_var3,1); 82 end; 83 end; 84 call sortc(of v2(*)); 85 do i=1 to dim(v2); 86 if not missing(v2(i)) then do; 87 if i lt dim(v2) then do; 88 if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1); 89 end; 90 else var4_var5=sum(var4_var5,1); 91 end; 92 end; 93 run; NOTE: There were 80000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT1 has 80000 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.04 seconds 94 data have; 95 set have; 96 id=_n_; 97 run; NOTE: There were 80000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.HAVE has 80000 observations and 6 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.02 seconds 98 99 proc transpose data=have out=need; 100 var var1-var5; 101 by id; 102 run; NOTE: There were 80000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.NEED has 400000 observations and 3 variables. NOTE: PROCEDURE TRANSPOSE used (Total process time): real time 0.07 seconds cpu time 0.07 seconds 103 104 data need (keep=id group col1); 105 set need; 106 if _name_ in ('var1', 'var2', 'var3') then group=1; 107 else group=2; 108 run; NOTE: There were 400000 observations read from the data set WORK.NEED. NOTE: The data set WORK.NEED has 400000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.04 seconds cpu time 0.05 seconds 109 110 ods select none; 111 ods output nlevels=nlevels; 112 proc freq data=need nlevels; 113 by id group; 114 table col1 / noprint; 115 run; NOTE: The data set WORK.NLEVELS has 160000 observations and 4 variables. NOTE: There were 400000 observations read from the data set WORK.NEED. NOTE: PROCEDURE FREQ used (Total process time): real time 14.90 seconds cpu time 15.02 seconds 116 ods select all; 117 118 data want2 (keep=var1_var3 var4_var5); 119 set nlevels (drop=TableVar); 120 retain var1_var3; 121 by id; 122 if mod(_n_,2) eq 1 then var1_var3=nlevels; 123 else do; 124 var4_var5=nlevels; 125 output; 126 end; 127 run; NOTE: There were 160000 observations read from the data set WORK.NLEVELS. NOTE: The data set WORK.WANT2 has 80000 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.02 seconds 128 129 data want3 (keep=var1_var3 var4_var5); 130 set have; 131 array a $8 var1-var3; 132 array b $8 var4-var5; 133 array d[2] _temporary_; 134 d[1]=dim(a); d[2]=dim(b); 135 array c(j) a b; 136 array t[10] $8 _temporary_; 137 array count[2] var1_var3 var4_var5; 138 do over c; 139 call missing(k,of t[*]); 140 do _i_ = 1 to d[j]; 141 if c in t then continue; 142 k + 1; 143 t[k]=c; 144 end; 145 count[j] = k; 146 end; 147 run; NOTE: There were 80000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT3 has 80000 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.06 seconds cpu time 0.06 seconds 148 %macro count_dist_vals(invars,outvar); 149 %local _num; 150 %let _num=%sysfunc(ceil(%sysevalf(%sysfunc(ranuni(0))*10000000000)),z10.); 151 if _n_=1 then 152 do; 153 length _tmp_var_12345 $32767; 154 dcl hash h_&_num.(multidata:'n', hashexp:2); 155 h_&_num..defineKey('_tmp_var_12345'); 156 h_&_num..defineDone(); 157 end; 158 %do i= 1 %to %sysfunc(countw(&invars)); 159 _tmp_var_12345=catt(%scan(&invars,&i)); 160 h_&_num..ref(); 161 %end; 162 &outvar=h_&_num..num_items; 163 h_&_num..clear(); 164 drop _tmp_var_12345 ; 165 %mend; 166 167 data want4; 168 set have; 169 %count_dist_vals(var1 var2 var3, n_dist_vals_1) 170 %count_dist_vals(var3 var4 var5, n_dist_vals_2) 171 run; NOTE: There were 80000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT4 has 80000 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 38.12 seconds cpu time 37.94 seconds 172 173 174 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 186
I used the following code and got what you wanted. Its a simple code using PROC sql and transpose etc rather than more complex arrays. Not sure if it will be worth if there are hundreds of variables.
data have;
input var1 $ var2 $ var3 $ var4 $ var5 $;
cards;
abc d45g abc 1 2
a c rrrrrr g9 g9
;
run;
proc sql;
create table var1_3 as select var1, var2, var3 from have;
quit;
proc sort data = var1_3;
by var1 var2 var3;
run;
proc transpose data = var1_3 out=var1_3_ (DROP =_NAME_);
VAR var1 var2 var3;
run;
PROC SQL;
CREATE TABLE VAR1_VAR3 AS SELECT COUNT(DISTINCT COL1) AS VAR1, COUNT(DISTINCT COL2) AS VAR2 FROM var1_3_;
QUIT;
PROC TRANSPOSE DATA = VAR1_VAR3 OUT = VAR1_VAR3_ (DROP=_NAME_) PREFIX = VAR1_VAR3;
VAR VAR1 VAR2;
RUN;
proc sql;
create table var4_5 as select var4, var5 from have;
quit;
proc sort data = var4_5;
by var4 VAR5;
run;
proc transpose data = var4_5 out=var4_5_ (DROP =_NAME_);
VAR VAR4 VAR5;
run;
PROC SQL;
CREATE TABLE VAR4_VAR5 AS SELECT COUNT(DISTINCT COL1) AS VAR4, COUNT(DISTINCT COL2) AS VAR5 FROM var4_5_ ;
QUIT;
PROC TRANSPOSE DATA = VAR4_VAR5 OUT = VAR4_VAR5_ (DROP=_NAME_) PREFIX = VAR4_VAR5;
VAR VAR4 VAR5;
RUN;
DATA WANT;
MERGE VAR1_VAR3_ VAR4_VAR5_;
RUN;
Hi everyone I am beginner/new to SAS & learning for my certification. I was practicising on a particular dataset which has different variables and for some variables/columns there are "0s" as values.
What I am trying to do is generate a table that lists the columns horizontally with the number of "0s" they have in their column. I have attached the dataset screenshot and also the output I am trying to achieve.
** I am using WPS environment for practising my codes.
Any help/guidance will really be appreciated.
variable label nmiss / count
Total Amt Total Amt 0
Installments Insss 5
Fitness fit 8
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.