In that document proc means is used for numeric vars.
Another solution could be proc tabulate using the count (N) . Having some information about variables (varname) and counts a direct file or execute call could do the work defining needed lists ina many ways. What is your problem you are trying to solve?
There is probably a more elegant way to do this than my solution, but here are my steps:
Example code:
/* create sample with a duplicate column */
data class;
set sashelp.class;
xDupHeight = Height;
zDupAge = age;
run;
/* create summary stats for all numeric columns */
proc means data=class noprint;
output out=checkdup ;
run;
/* pivot the summary stats to be "row major" */
proc transpose data=checkdup (drop=_type_ _freq_)
out=work.transposed
prefix=stat
name=source
label=label
;
id _stat_;
var _numeric_;
run;
/* identify the duplicates */
proc sort data=transposed nodupkey dupout=duplicates;
by _numeric_;
run;
In my example, this is the result showing the duplicate numeric variables.
Chris
This is seeming to be some data-mining / analytics at :
1- getting knowing your data
2- reduce the number of vars by eliminating combine by correlations and for the meaningless.
That is the question topic.
3- analysing
4 -...
With the growing number of datasources some automatisation could be done. I am missing the step1 and the correlation/combing options. I could give some ideas to automatise steps.
step1: (allocs)
libname test "/folders/myfolders/test";
filename test "/folders/myfolders/test";
filename macro "/folders/myfolders/macro";
filename work "%sysfunc(pathname(work))";
filename tmp "/tmp" ;
%let analyse_dr_vars=Class_var;
%let analyse_dr_miss=Class_sum;
%let analyse_data=Class;
%let analyse_Lloc=test;
Running the code classic in one the same OS process it is the first step. Wanting tod do the next steps in parallel, this will become some startup/autoexec code.
step 2: (create_test data).
data &analyse_lloc..&analyse_data ;
set sashelp.class ;
if (_n_ < 8 ) then do;
dummy_num2=1 ;
dummy_chr2="A" ;
end;
else do;
dummy_num2=. ;
dummy_chr2=" " ;
end;
dummy_num1= .;
dummy_chr1="" ;
run;
You could define or use your own data as you like.
Take care about the missings and nulls difference (they are not teh same) wiht a RDBMS.
step 3: (create nobs formats)
/* get no obs */
/* https://support.sas.com/documentation/cdl/en/lefunctionsref/67960/HTML/default/viewer.htm#p12uchp7hm... */
/* http://support.sas.com/documentation/cdl/en/mcrolref/67912/HTML/default/viewer.htm#p1o13d7wb2zfcnn19... */
options mprint mlogic symbolgen source2 ;
%macro nobsdsn (dsn=,m_obs=,m_var= ) ;
%global &m_obs &m_var ;
%let dsid= %sysfunc(open(&dsn)) ;
%if &dsid %then %do;
%let &m_obs = %sysfunc(attrn(&dsid,NOBS));
%let &m_var = %sysfunc(attrn(&dsid,NVARS));
%let rc=%sysfunc(close(&dsid)) ;
%end;
%else %do;
%put Open for data set &dset failed - %sysfunc(sysmsg());
%end;
%mend ;
/* https://support.sas.com/documentation/cdl/en/proc/67916/HTML/default/viewer.htm#n1c16dxnndwfzyn14o1k... */
proc format ;
value nonmiss ( Default=3 )
. = "-.-"
low-high = "VNM" ;
value $ nonmiss ( Default=3 )
" " = "-.-"
other = "VCH" ;
run;
%nobsdsn(dsn=&analyse_lloc..&analyse_data,m_obs=analyse_obs,m_var=analyse_var ) ;
This one is reducing alls possible values into just two. unique keys and other very unique values will not harm anymore the basics inventory. Using formats will avoid a dataconversion. The nobs retrieval is direct from the sas-dataset header
That is where the challenge for further steps is. What are the identifying unique varaibles (reduce those) and wat are the measures in limited ranges.
step 4 (tabulate)
/* tabulate output dataset with formats */
/* procs - http://support.sas.com/documentation/cdl/en/proc/67916/HTML/default/viewer.htm#n1hpbwr9acrbmen1e2d6h... */
/* macro -http://support.sas.com/documentation/cdl/en/mcrolref/67912/HTML/default/viewer.htm#p09y28i2d1kn8qn1p...
/* resutlting order of variables in out-dataset is numeric char followed by the three tabinformations and statistic N */
proc tabulate data=&analyse_lloc..&analyse_data out=&analyse_lloc..&analyse_dr_miss ;
class _numeric_ _char_ /missing ;
format _numeric_ nonmiss. _char_ $nonmiss. ;
table ( _all_ ) (N * f=8.0 ) ;
run;
data _null_ ;
set &analyse_lloc..&analyse_data (obs=1);
array analnum {*} _numeric_ ; array analchar {*} _character_ ; /* same order as created _type_ is last */
numn=dim(analnum);
numc=dim(analchar);
call symput("analyse_numn", put(numn,8.) ) ;
call symput("analyse_numc", put(numc,8.) ) ;
run;
%put number of numeric vars: &analyse_numn character vars: &analyse_numc ;
proc datasets library=&analyse_lloc noprint nolist ;
modify &analyse_dr_miss ;
format _type_ $w. _page_ _table_ N best. ;
run;
The tabulate is createing a resulting dataset wiht a _type_ indicating the variable it refers to.
The Columns are still the original variable but the rows are having the analyses. N in this case for all varaibles as analysed in a class approach (reformatted vaules).
step 5 (analyse varmis)
/* http://support.sas.com/documentation/cdl/en/lestmtsref/68024/HTML/default/viewer.htm#p08do6szetrxe2n... */
/* https://support.sas.com/documentation/cdl/en/lefunctionsref/67960/HTML/default/viewer.htm#p18toxpk18... */
/* _type_ 00010000 single var is the position of the variable in the created dataset */
/* as the dataset got more variables, define new ones after teh set and array def, use previous definede sizes of of _type_ */
%let nums_add=3; /* correction of added number vars _table_ _page_ N */
data &analyse_lloc..&analyse_dr_vars (keep=varname vartype varmiss_pct) ;
set &analyse_lloc..&&analyse_dr_miss ;
array analnum {*} _numeric_ ; array analchar {*} _character_ ; /* same order as created _type_ is last */
length varname $32 ; /* sas limit 9.4, define after set as of ordering */
varpos=index(_type_,"1") ;
select ;
when (varpos = 0 ) ;
when (varpos < 0 ) ;
when (varpos <= &analyse_numn ) do;
vartype="NUM";
Var_recode= putn(analnum{varpos},"nonmiss.");
varname=vname(analnum(varpos));
Select ;
when ( Var_recode = "-.-" ) do;
varmiss_pct=N /&analyse_obs ;
IF ( varmiss_pct > .4 ) then Output &analyse_lloc..&analyse_dr_vars ;
end;
otherwise ;
end;
end;
when (varpos > &analyse_numn ) do;
vartype="CHR";
varpos = varpos-&analyse_numn ;
Var_recode= putc(analchar(varpos),"$nonmiss.");
varname=vname(analchar(varpos)) ;
Select ;
when ( Var_recode = "-.-" ) do;
varmiss_pct=N /&analyse_obs ;
IF ( varmiss_pct > .4 ) then Output &analyse_lloc..&analyse_dr_vars ;
end;
otherwise ;
end;
end;
otherwise put "errord internal " ;
end;
/* put _all_ ; */
run;
Having that information off all variables, working it back on inforamtion on each variable wiht some criteria
Multiple output dataset could be created dependent on next steps. Having those datasets as the selected / reviewed ones a variabel list (the keep) or Drop or could be a generated peice of code to be used in the analysis steps.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.