Solved: Re: Get unique values for every variable in a table

canino15 · Posted 05-05-2021 05:14 PM

Hi guys, I'm new to SAS and I'm trying to obtain the distinct values for every variable in a large table (+300 Columns), for that reason I can't do it manually with every single column.

I have this

Var1	Var2	Var3
a	1	1
b	1	2
b	2	3
a	3	4

I want to get something like this:

Var1	a,b
Var2	1,2,3
Var3	1,2,3,4

I tried doing an iterative process using a do within a proc sql to at least get vectors for the unique values.

Thanks a lot!!

r_behata · Posted 05-05-2021 06:08 PM

data have;
input Var1 $ Var2 $ Var3 $;
cards;
a 1 1
b 1 2
b 2 3
a 3 4
;
run;

proc transpose data=have out=temp name=name;
var var:;
run;


data want;
	set temp;
	
	length value $10.;
	array d_[*] col:;
	
	value='';
	do _n_=1 to dim(d_);
		if ^ find(value,d_[_n_],'it') then value=catx(',',value,d_[_n_]);
	end;

	drop col:;
run;

View solution in original post

r_behata · Posted 05-05-2021 06:08 PM

data have;
input Var1 $ Var2 $ Var3 $;
cards;
a 1 1
b 1 2
b 2 3
a 3 4
;
run;

proc transpose data=have out=temp name=name;
var var:;
run;


data want;
	set temp;
	
	length value $10.;
	array d_[*] col:;
	
	value='';
	do _n_=1 to dim(d_);
		if ^ find(value,d_[_n_],'it') then value=catx(',',value,d_[_n_]);
	end;

	drop col:;
run;

Ksharp · Posted 05-06-2021 08:23 AM

data have;
input Var1 $ Var2 $ Var3 $;
cards;
a 1 1
b 1 2
b 2 3
a 3 4
;
run;

proc sql;
create table temp as
select distinct 'var1' as v length=40,var1 as val from have
union all
select distinct 'var2' as v length=40,var2 as val from have
union all
select distinct 'var3' as v length=40,var3 as val from have
;
quit;
data want;
 do until(last.v);
  set temp;
  by v notsorted;
  length want $ 200;
  want=catx(',',want,val);
 end;
 drop val;
 run;

canino15 · Posted 05-07-2021 10:40 AM

I first thought of something similar, but I have +300 variables, so is not viable for my usecase. Do you know if there is a way of running a query for every column?, thanks anyway, the last part of the code is usefull for other thing I'm working on.

Tom · Posted 05-07-2021 02:57 PM

@canino15 wrote:

I first thought of something similar, but I have +300 variables, so is not viable for my usecase. Do you know if there is a way of running a query for every column?, thanks anyway, the last part of the code is usefull for other thing I'm working on.

You could try PROC SUMMARY. Make sure to use WAYS 1 or else you will definitely run out of memory. You might still run out of memory with really large data with essentially continuous variables.

For example let's look at the first 4 observations of SASHELP.CLASS

proc summary data=sashelp.class(obs=5) missing chartype ;
  class _all_;
  ways 1;
  output out=summary;
run;

Output:

Obs     Name      Sex    Age    Height    Weight    _TYPE_    _FREQ_

  1                        .       .        84.0    00001        1
  2                        .       .        98.0    00001        1
  3                        .       .       102.5    00001        2
  4                        .       .       112.5    00001        1
  5                        .     56.5         .     00010        1
  6                        .     62.8         .     00010        1
  7                        .     63.5         .     00010        1
  8                        .     65.3         .     00010        1
  9                        .     69.0         .     00010        1
 10                       13       .          .     00100        2
 11                       14       .          .     00100        3
 12                F       .       .          .     01000        3
 13                M       .       .          .     01000        2
 14    Alfred              .       .          .     10000        1
 15    Alice               .       .          .     10000        1
 16    Barbara             .       .          .     10000        1
 17    Carol               .       .          .     10000        1
 18    Henry               .       .          .     10000        1

In the past I have used a macro, %dbcon(), to basically transpose the data and then analyze it.

%dbcon
/*----------------------------------------------------------------------
Summarize the contents of a dataset.
----------------------------------------------------------------------*/
(sashelp.class      /* Dataset name */
,maxchar=40         /* Maximum length of character variables */
,maxobs=100000      /* Maximum observations before using sampling */
,select=            /* Variable names to select for analysis */
,exclude=           /* Variable names to exclude from analysis */
,outval=_dbvals     /* Dataset name for values output */
,outsum=_dbvars     /* Dataset name for variable summary output */
,fname= print            /* Fileref or filename in quotes for text file */
,nval=10            /* Number of distinct values to print */
,printn=YES           /* Include value frequency when text file is made */
);

Results:

~=====================================================================
CLASS NOBS=19 [Display limited to 10 values]
======================================================================
 N AGE LEN=8 nval=6
----------------------------------------------------------------------
 2           11
 5           12
 3           13
 4           14
 4           15
 1           16
______________________________________________________________________
 N HEIGHT LEN=8 nval=17
----------------------------------------------------------------------
 1         51.3
 1         56.3
 1         56.5
 1         57.3
 1         57.5
...............
 1         65.3
 2         66.5
 1           67
 1           69
 1           72
______________________________________________________________________
 N NAME LEN=$8 nval=19 maxlen=7
----------------------------------------------------------------------
 1 Alfred
 1 Alice
 1 Barbara
 1 Carol
 1 Henry
...............
 1 Philip
 1 Robert
 1 Ronald
 1 Thomas
 1 William
______________________________________________________________________
 N SEX LEN=$1 nval=2 maxlen=1
----------------------------------------------------------------------
 9 F
10 M
______________________________________________________________________
 N WEIGHT LEN=8 nval=15
----------------------------------------------------------------------
 1         50.5
 1           77
 1           83
 2           84
 1         84.5
...............
 2          112
 2        112.5
 1          128
 1          133
 1          150
______________________________________________________________________

Ksharp · Posted 05-13-2021 08:05 AM

Make a macro , it is not a big deal.

Since you have 300+ variables ,and SQL only support 255 UNION , so I make two part SQL to make it happen.

data have;
input Var1 $ Var2 $ Var3 $;
cards;
a 1 1
b 1 2
b 2 3
a 3 4
;
run;

proc transpose data=have(obs=0) out=vname;
var _all_;
run;

data vname1 vname2;
 set vname nobs=nobs;
 if _n_ < nobs/2 then output vname1;
  else output vname2;
run;

proc sql noprint;
select catt('select distinct "',_name_,'" as v length=40,',_name_,' as val from have')
       into : part1 separated by 'union all'   
 from vname1;

 
select catt('select distinct "',_name_,'" as v length=40,',_name_,' as val from have')
       into : part2 separated by ' union all '   
 from vname2;

 create table want1 as
 &part1 ;

 
 create table want2 as
 &part2 ;

 create table want as
 select * from want1
 union
 select * from want2;
 quit;

data final_want;
 do until(last.v);
  set want;
  by v notsorted;
  length want $ 400;
  want=catx(',',want,val);
 end;
 drop val;
 run;

ballardw · Posted 05-06-2021 12:54 PM

Before even starting something like this I ask myself: do I have variables that are unique or close to unique for each record. Examples: most identification type variables, phone numbers, Addresses, billed amount, account balance, instrument reading.

If you have 10000 records in a data set with a unique identifier like social security number (9 or 11 characters depending on how stored) your example output would be asking for a single variable of close to 90,000 characters (plus close to 10,000 commas separating them). Which exceeds the length of a SAS character variable.

Then there is the usefulness of looking at 10000 values.

If you have such variables you really want to consider if they get this treatment.

You can get an idea for how many values are involved by running code like this:

Proc freq data=sashelp.class nlevels;
   ods exclude onewayfreqs;
run;

Which will create a table with each variable name the number of levels (unique values) and if you have missing values some information about that as well.

canino15 · Posted 05-07-2021 10:35 AM

Hey thanks for answersing, in case of having a lot of distinct values, I would only need 25 instances.

Patrick · Posted 05-08-2021 12:07 AM

Could become long running but code as below should give you what you're asking for.

data test;
  do i=1 to 100;
    output;
  end;
  do j=1 to 10;
    output;
  end;
run;

proc freq data=test order=freq nlevels;
  table _all_ /maxlevels=25 missing;
run;

Registration is open