Hi guys, I'm new to SAS and I'm trying to obtain the distinct values for every variable in a large table (+300 Columns), for that reason I can't do it manually with every single column.
I have this
Var1 | Var2 | Var3 |
a | 1 | 1 |
b | 1 | 2 |
b | 2 | 3 |
a | 3 | 4 |
I want to get something like this:
Var1 | a,b |
Var2 | 1,2,3 |
Var3 | 1,2,3,4 |
I tried doing an iterative process using a do within a proc sql to at least get vectors for the unique values.
Thanks a lot!!
data have;
input Var1 $ Var2 $ Var3 $;
cards;
a 1 1
b 1 2
b 2 3
a 3 4
;
run;
proc transpose data=have out=temp name=name;
var var:;
run;
data want;
set temp;
length value $10.;
array d_[*] col:;
value='';
do _n_=1 to dim(d_);
if ^ find(value,d_[_n_],'it') then value=catx(',',value,d_[_n_]);
end;
drop col:;
run;
data have;
input Var1 $ Var2 $ Var3 $;
cards;
a 1 1
b 1 2
b 2 3
a 3 4
;
run;
proc transpose data=have out=temp name=name;
var var:;
run;
data want;
set temp;
length value $10.;
array d_[*] col:;
value='';
do _n_=1 to dim(d_);
if ^ find(value,d_[_n_],'it') then value=catx(',',value,d_[_n_]);
end;
drop col:;
run;
data have;
input Var1 $ Var2 $ Var3 $;
cards;
a 1 1
b 1 2
b 2 3
a 3 4
;
run;
proc sql;
create table temp as
select distinct 'var1' as v length=40,var1 as val from have
union all
select distinct 'var2' as v length=40,var2 as val from have
union all
select distinct 'var3' as v length=40,var3 as val from have
;
quit;
data want;
do until(last.v);
set temp;
by v notsorted;
length want $ 200;
want=catx(',',want,val);
end;
drop val;
run;
I first thought of something similar, but I have +300 variables, so is not viable for my usecase. Do you know if there is a way of running a query for every column?, thanks anyway, the last part of the code is usefull for other thing I'm working on.
@canino15 wrote:
I first thought of something similar, but I have +300 variables, so is not viable for my usecase. Do you know if there is a way of running a query for every column?, thanks anyway, the last part of the code is usefull for other thing I'm working on.
You could try PROC SUMMARY. Make sure to use WAYS 1 or else you will definitely run out of memory. You might still run out of memory with really large data with essentially continuous variables.
For example let's look at the first 4 observations of SASHELP.CLASS
proc summary data=sashelp.class(obs=5) missing chartype ;
class _all_;
ways 1;
output out=summary;
run;
Output:
Obs Name Sex Age Height Weight _TYPE_ _FREQ_ 1 . . 84.0 00001 1 2 . . 98.0 00001 1 3 . . 102.5 00001 2 4 . . 112.5 00001 1 5 . 56.5 . 00010 1 6 . 62.8 . 00010 1 7 . 63.5 . 00010 1 8 . 65.3 . 00010 1 9 . 69.0 . 00010 1 10 13 . . 00100 2 11 14 . . 00100 3 12 F . . . 01000 3 13 M . . . 01000 2 14 Alfred . . . 10000 1 15 Alice . . . 10000 1 16 Barbara . . . 10000 1 17 Carol . . . 10000 1 18 Henry . . . 10000 1
In the past I have used a macro, %dbcon(), to basically transpose the data and then analyze it.
%dbcon
/*----------------------------------------------------------------------
Summarize the contents of a dataset.
----------------------------------------------------------------------*/
(sashelp.class /* Dataset name */
,maxchar=40 /* Maximum length of character variables */
,maxobs=100000 /* Maximum observations before using sampling */
,select= /* Variable names to select for analysis */
,exclude= /* Variable names to exclude from analysis */
,outval=_dbvals /* Dataset name for values output */
,outsum=_dbvars /* Dataset name for variable summary output */
,fname= print /* Fileref or filename in quotes for text file */
,nval=10 /* Number of distinct values to print */
,printn=YES /* Include value frequency when text file is made */
);
Results:
~===================================================================== CLASS NOBS=19 [Display limited to 10 values] ====================================================================== N AGE LEN=8 nval=6 ---------------------------------------------------------------------- 2 11 5 12 3 13 4 14 4 15 1 16 ______________________________________________________________________ N HEIGHT LEN=8 nval=17 ---------------------------------------------------------------------- 1 51.3 1 56.3 1 56.5 1 57.3 1 57.5 ............... 1 65.3 2 66.5 1 67 1 69 1 72 ______________________________________________________________________ N NAME LEN=$8 nval=19 maxlen=7 ---------------------------------------------------------------------- 1 Alfred 1 Alice 1 Barbara 1 Carol 1 Henry ............... 1 Philip 1 Robert 1 Ronald 1 Thomas 1 William ______________________________________________________________________ N SEX LEN=$1 nval=2 maxlen=1 ---------------------------------------------------------------------- 9 F 10 M ______________________________________________________________________ N WEIGHT LEN=8 nval=15 ---------------------------------------------------------------------- 1 50.5 1 77 1 83 2 84 1 84.5 ............... 2 112 2 112.5 1 128 1 133 1 150 ______________________________________________________________________
Make a macro , it is not a big deal.
Since you have 300+ variables ,and SQL only support 255 UNION , so I make two part SQL to make it happen.
data have;
input Var1 $ Var2 $ Var3 $;
cards;
a 1 1
b 1 2
b 2 3
a 3 4
;
run;
proc transpose data=have(obs=0) out=vname;
var _all_;
run;
data vname1 vname2;
set vname nobs=nobs;
if _n_ < nobs/2 then output vname1;
else output vname2;
run;
proc sql noprint;
select catt('select distinct "',_name_,'" as v length=40,',_name_,' as val from have')
into : part1 separated by 'union all'
from vname1;
select catt('select distinct "',_name_,'" as v length=40,',_name_,' as val from have')
into : part2 separated by ' union all '
from vname2;
create table want1 as
&part1 ;
create table want2 as
&part2 ;
create table want as
select * from want1
union
select * from want2;
quit;
data final_want;
do until(last.v);
set want;
by v notsorted;
length want $ 400;
want=catx(',',want,val);
end;
drop val;
run;
Before even starting something like this I ask myself: do I have variables that are unique or close to unique for each record. Examples: most identification type variables, phone numbers, Addresses, billed amount, account balance, instrument reading.
If you have 10000 records in a data set with a unique identifier like social security number (9 or 11 characters depending on how stored) your example output would be asking for a single variable of close to 90,000 characters (plus close to 10,000 commas separating them). Which exceeds the length of a SAS character variable.
Then there is the usefulness of looking at 10000 values.
If you have such variables you really want to consider if they get this treatment.
You can get an idea for how many values are involved by running code like this:
Proc freq data=sashelp.class nlevels; ods exclude onewayfreqs; run;
Which will create a table with each variable name the number of levels (unique values) and if you have missing values some information about that as well.
Hey thanks for answersing, in case of having a lot of distinct values, I would only need 25 instances.
Could become long running but code as below should give you what you're asking for.
data test;
do i=1 to 100;
output;
end;
do j=1 to 10;
output;
end;
run;
proc freq data=test order=freq nlevels;
table _all_ /maxlevels=25 missing;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.