BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
canino15
Calcite | Level 5

Hi guys, I'm new to SAS and I'm trying to obtain the distinct values for every variable in a large table (+300 Columns), for that reason I can't do it manually with every single column. 

 

I have this 

Var1Var2Var3
a11
b12
b23
a34

 

I want to get something like this:

Var1a,b
Var21,2,3
Var31,2,3,4

 

I tried doing an iterative process using a do within a proc sql to at least get vectors for the unique values.

 

Thanks a lot!!

1 ACCEPTED SOLUTION

Accepted Solutions
r_behata
Barite | Level 11
data have;
input Var1 $ Var2 $ Var3 $;
cards;
a 1 1
b 1 2
b 2 3
a 3 4
;
run;

proc transpose data=have out=temp name=name;
var var:;
run;


data want;
	set temp;
	
	length value $10.;
	array d_[*] col:;
	
	value='';
	do _n_=1 to dim(d_);
		if ^ find(value,d_[_n_],'it') then value=catx(',',value,d_[_n_]);
	end;

	drop col:;
run;

View solution in original post

8 REPLIES 8
r_behata
Barite | Level 11
data have;
input Var1 $ Var2 $ Var3 $;
cards;
a 1 1
b 1 2
b 2 3
a 3 4
;
run;

proc transpose data=have out=temp name=name;
var var:;
run;


data want;
	set temp;
	
	length value $10.;
	array d_[*] col:;
	
	value='';
	do _n_=1 to dim(d_);
		if ^ find(value,d_[_n_],'it') then value=catx(',',value,d_[_n_]);
	end;

	drop col:;
run;
Ksharp
Super User
data have;
input Var1 $ Var2 $ Var3 $;
cards;
a 1 1
b 1 2
b 2 3
a 3 4
;
run;

proc sql;
create table temp as
select distinct 'var1' as v length=40,var1 as val from have
union all
select distinct 'var2' as v length=40,var2 as val from have
union all
select distinct 'var3' as v length=40,var3 as val from have
;
quit;
data want;
 do until(last.v);
  set temp;
  by v notsorted;
  length want $ 200;
  want=catx(',',want,val);
 end;
 drop val;
 run;
canino15
Calcite | Level 5

I first thought of something similar, but I have +300 variables, so is not viable for my usecase. Do you know if there is a way of running a query for every column?, thanks anyway, the last part of the code is usefull for other thing I'm working on. 

Tom
Super User Tom
Super User

@canino15 wrote:

I first thought of something similar, but I have +300 variables, so is not viable for my usecase. Do you know if there is a way of running a query for every column?, thanks anyway, the last part of the code is usefull for other thing I'm working on. 


You could try PROC SUMMARY.  Make sure to use WAYS 1 or else you will definitely run out of memory.  You might still run out of memory with really large data with essentially continuous variables.

 

For example let's look at the first 4 observations of SASHELP.CLASS

proc summary data=sashelp.class(obs=5) missing chartype ;
  class _all_;
  ways 1;
  output out=summary;
run;

Output:

Obs     Name      Sex    Age    Height    Weight    _TYPE_    _FREQ_

  1                        .       .        84.0    00001        1
  2                        .       .        98.0    00001        1
  3                        .       .       102.5    00001        2
  4                        .       .       112.5    00001        1
  5                        .     56.5         .     00010        1
  6                        .     62.8         .     00010        1
  7                        .     63.5         .     00010        1
  8                        .     65.3         .     00010        1
  9                        .     69.0         .     00010        1
 10                       13       .          .     00100        2
 11                       14       .          .     00100        3
 12                F       .       .          .     01000        3
 13                M       .       .          .     01000        2
 14    Alfred              .       .          .     10000        1
 15    Alice               .       .          .     10000        1
 16    Barbara             .       .          .     10000        1
 17    Carol               .       .          .     10000        1
 18    Henry               .       .          .     10000        1

 

In the past I have used a macro, %dbcon(), to basically transpose the data and then analyze it.

%dbcon
/*----------------------------------------------------------------------
Summarize the contents of a dataset.
----------------------------------------------------------------------*/
(sashelp.class      /* Dataset name */
,maxchar=40         /* Maximum length of character variables */
,maxobs=100000      /* Maximum observations before using sampling */
,select=            /* Variable names to select for analysis */
,exclude=           /* Variable names to exclude from analysis */
,outval=_dbvals     /* Dataset name for values output */
,outsum=_dbvars     /* Dataset name for variable summary output */
,fname= print            /* Fileref or filename in quotes for text file */
,nval=10            /* Number of distinct values to print */
,printn=YES           /* Include value frequency when text file is made */
);

Results:

~=====================================================================
CLASS NOBS=19 [Display limited to 10 values]
======================================================================
 N AGE LEN=8 nval=6
----------------------------------------------------------------------
 2           11
 5           12
 3           13
 4           14
 4           15
 1           16
______________________________________________________________________
 N HEIGHT LEN=8 nval=17
----------------------------------------------------------------------
 1         51.3
 1         56.3
 1         56.5
 1         57.3
 1         57.5
...............
 1         65.3
 2         66.5
 1           67
 1           69
 1           72
______________________________________________________________________
 N NAME LEN=$8 nval=19 maxlen=7
----------------------------------------------------------------------
 1 Alfred
 1 Alice
 1 Barbara
 1 Carol
 1 Henry
...............
 1 Philip
 1 Robert
 1 Ronald
 1 Thomas
 1 William
______________________________________________________________________
 N SEX LEN=$1 nval=2 maxlen=1
----------------------------------------------------------------------
 9 F
10 M
______________________________________________________________________
 N WEIGHT LEN=8 nval=15
----------------------------------------------------------------------
 1         50.5
 1           77
 1           83
 2           84
 1         84.5
...............
 2          112
 2        112.5
 1          128
 1          133
 1          150
______________________________________________________________________

 

Ksharp
Super User

Make a macro , it is not a big deal.

Since you have 300+ variables ,and SQL only support 255 UNION , so I make two part SQL to make it happen.

 

data have;
input Var1 $ Var2 $ Var3 $;
cards;
a 1 1
b 1 2
b 2 3
a 3 4
;
run;

proc transpose data=have(obs=0) out=vname;
var _all_;
run;

data vname1 vname2;
 set vname nobs=nobs;
 if _n_ < nobs/2 then output vname1;
  else output vname2;
run;

proc sql noprint;
select catt('select distinct "',_name_,'" as v length=40,',_name_,' as val from have')
       into : part1 separated by 'union all'   
 from vname1;

 
select catt('select distinct "',_name_,'" as v length=40,',_name_,' as val from have')
       into : part2 separated by ' union all '   
 from vname2;

 create table want1 as
 &part1 ;

 
 create table want2 as
 &part2 ;

 create table want as
 select * from want1
 union
 select * from want2;
 quit;

data final_want;
 do until(last.v);
  set want;
  by v notsorted;
  length want $ 400;
  want=catx(',',want,val);
 end;
 drop val;
 run;
ballardw
Super User

Before even starting something like this I ask myself: do I have variables that are unique or close to unique for each record. Examples: most identification type variables, phone numbers, Addresses, billed amount, account balance, instrument reading.

 

If you have 10000 records in a data set with a unique identifier like social security number (9 or 11 characters depending on how stored) your example output would be asking for a single variable of close to 90,000 characters (plus close to 10,000 commas separating them). Which exceeds the length of a SAS character variable.

Then there is the usefulness of looking at 10000 values.

If you have such variables you really want to consider if they get this treatment.

 

You can get an idea for how many values are involved by running code like this:

Proc freq data=sashelp.class nlevels;
   ods exclude onewayfreqs;
run;

Which will create a table with each variable name the number of levels (unique values) and if you have missing values some information about that as well.

canino15
Calcite | Level 5

Hey thanks for answersing, in case of having a lot of distinct values, I would only need 25 instances. 

Patrick
Opal | Level 21

Could become long running but code as below should give you what you're asking for.

data test;
  do i=1 to 100;
    output;
  end;
  do j=1 to 10;
    output;
  end;
run;

proc freq data=test order=freq nlevels;
  table _all_ /maxlevels=25 missing;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 3635 views
  • 7 likes
  • 6 in conversation