Re: Numeric Field Reduction

AMcGill · Posted 09-09-2015 05:12 AM

Hi, I am using SAS Enterprise Guide and a colleague showed me a paper (http://www.lexjansen.com/nesug/nesug13/90_Final_Paper.pdf) on a macro to drop character variables if the number of levels are the same (NLevels=1) . I was wondering if anyone knew how to do something similar, but for numeric variables? And as an added extra, one of the tables produced by the code from the above paper gives the number of distinct levels per variable. Does anyone know how to incorporate into the code say if there we had a variable Age but 80% of the observations were assigned to Age=10 and remaining 20% to Age = 12 etc.. then drop? Any advice would be greatly appreciated. Thanks.

jakarman · Posted 09-09-2015 11:31 AM

In that document proc means is used for numeric vars.

Another solution could be proc tabulate using the count (N) . Having some information about variables (varname) and counts a direct file or execute call could do the work defining needed lists ina many ways. What is your problem you are trying to solve?

---->-- ja karman --<-----

AMcGill · Posted 09-10-2015 05:13 AM

Basically, I am dealing with a very large dataset and wanted to find a way of reducing the variables without having to know the variable names for a drop statement. My issue is with the proc means in the document is that my dataset contains character strings in numeric fields, so for these variables a value of 0 returns and I could drop all 0s but I actually don't want to drop these. Is there such a thing of finding out if observations in say variable 1 are similar to observations in variable 2, but drop the least populated variable or something like that? That may deal with my issue when trying to drop numeric vars.

ChrisHemedinger · Posted 09-10-2015 08:07 AM

There is probably a more elegant way to do this than my solution, but here are my steps:

Use PROC MEANS to create an output data set of the summary stats (N, Min, Max, Mean, StdDev -- can add others if you want)
Transpose the MEANS output to be "row major" so that each variable summary is on its own record.
Use PROC SORT to identify the duplicate variables and place them in a different data set.

Example code:

/* create sample with a duplicate column */
data class;
  set sashelp.class;
  xDupHeight = Height;
  zDupAge = age;
run;

/* create summary stats for all numeric columns */
proc means data=class noprint;
  output out=checkdup  ;
run;

/* pivot the summary stats to be "row major" */
proc transpose data=checkdup (drop=_type_ _freq_)
	out=work.transposed
	prefix=stat
	name=source
	label=label
;
	id _stat_;
	var _numeric_;
run;

/* identify the duplicates */
proc sort data=transposed nodupkey dupout=duplicates;
  by _numeric_;
run;

In my example, this is the result showing the duplicate numeric variables.

Chris

SAS Hackathon registration is open! Build your skills. Make connections. Enjoy creative freedom. Maybe change the world.

jakarman · Posted 09-12-2015 05:26 AM

This is seeming to be some data-mining / analytics at :

1- getting knowing your data

2- reduce the number of vars by eliminating combine by correlations and for the meaningless.

That is the question topic.

3- analysing

4 -...

With the growing number of datasources some automatisation could be done. I am missing the step1 and the correlation/combing options. I could give some ideas to automatise steps.

step1: (allocs)

libname test "/folders/myfolders/test";
filename test "/folders/myfolders/test";
filename macro "/folders/myfolders/macro";
filename work "%sysfunc(pathname(work))";
filename tmp "/tmp" ;

%let analyse_dr_vars=Class_var;
%let analyse_dr_miss=Class_sum;
%let analyse_data=Class;
%let analyse_Lloc=test;

Running the code classic in one the same OS process it is the first step. Wanting tod do the next steps in parallel, this will become some startup/autoexec code.

step 2: (create_test data).

data &analyse_lloc..&analyse_data ;
   set sashelp.class ;
   if (_n_ < 8 ) then do;
      dummy_num2=1 ;
      dummy_chr2="A" ;
   end;
   else do;
      dummy_num2=. ;
      dummy_chr2=" " ;
   end;
   dummy_num1= .;
   dummy_chr1="" ;
run;

You could define or use your own data as you like.

Take care about the missings and nulls difference (they are not teh same) wiht a RDBMS.

step 3: (create nobs formats)

/* get no obs */
/* https://support.sas.com/documentation/cdl/en/lefunctionsref/67960/HTML/default/viewer.htm#p12uchp7hm... */
/* http://support.sas.com/documentation/cdl/en/mcrolref/67912/HTML/default/viewer.htm#p1o13d7wb2zfcnn19... */

options mprint mlogic symbolgen source2 ;

%macro nobsdsn (dsn=,m_obs=,m_var= ) ;
   %global &m_obs &m_var ;
   %let dsid= %sysfunc(open(&dsn)) ;
   %if &dsid %then    %do;
     %let &m_obs = %sysfunc(attrn(&dsid,NOBS));
     %let &m_var = %sysfunc(attrn(&dsid,NVARS));
     %let rc=%sysfunc(close(&dsid)) ;
   %end;
   %else %do;
     %put Open for data set &dset failed - %sysfunc(sysmsg());
   %end;
%mend ;

/* https://support.sas.com/documentation/cdl/en/proc/67916/HTML/default/viewer.htm#n1c16dxnndwfzyn14o1k... */
proc format ;
       value nonmiss ( Default=3 )
             . = "-.-"
       low-high = "VNM" ;
       value $ nonmiss ( Default=3 )
             " " = "-.-"
       other = "VCH" ;
run;

%nobsdsn(dsn=&analyse_lloc..&analyse_data,m_obs=analyse_obs,m_var=analyse_var ) ;

This one is reducing alls possible values into just two. unique keys and other very unique values will not harm anymore the basics inventory. Using formats will avoid a dataconversion. The nobs retrieval is direct from the sas-dataset header

That is where the challenge for further steps is. What are the identifying unique varaibles (reduce those) and wat are the measures in limited ranges.

step 4 (tabulate)

/* tabulate output dataset with formats */
/* procs - http://support.sas.com/documentation/cdl/en/proc/67916/HTML/default/viewer.htm#n1hpbwr9acrbmen1e2d6h... */
/* macro -http://support.sas.com/documentation/cdl/en/mcrolref/67912/HTML/default/viewer.htm#p09y28i2d1kn8qn1p...

/* resutlting order of variables in out-dataset is numeric char followed by the three tabinformations and statistic N */

proc tabulate data=&analyse_lloc..&analyse_data out=&analyse_lloc..&analyse_dr_miss ;
class _numeric_ _char_ /missing ;
format _numeric_ nonmiss. _char_ $nonmiss. ;
table ( _all_ ) (N * f=8.0 ) ;
run;

data _null_ ;
set &analyse_lloc..&analyse_data (obs=1);
    array analnum {*} _numeric_ ; array analchar {*} _character_ ; /* same order as created _type_ is last */
    numn=dim(analnum);
    numc=dim(analchar);
    call symput("analyse_numn", put(numn,8.) ) ;
    call symput("analyse_numc", put(numc,8.) ) ;
run;
%put number of    numeric vars: &analyse_numn    character vars: &analyse_numc ;

proc datasets library=&analyse_lloc noprint nolist ;
modify &analyse_dr_miss ;
format _type_ $w. _page_ _table_ N best. ;
run;

The tabulate is createing a resulting dataset wiht a _type_ indicating the variable it refers to.
The Columns are still the original variable but the rows are having the analyses. N in this case for all varaibles as analysed in a class approach (reformatted vaules).

step 5 (analyse varmis)

/* http://support.sas.com/documentation/cdl/en/lestmtsref/68024/HTML/default/viewer.htm#p08do6szetrxe2n... */
/* https://support.sas.com/documentation/cdl/en/lefunctionsref/67960/HTML/default/viewer.htm#p18toxpk18... */

/* _type_ 00010000 single var is the position of the variable in the created dataset */
/* as the dataset got more variables, define new ones after teh set and array def, use previous definede sizes of of _type_ */
%let nums_add=3; /* correction of added number vars _table_ _page_ N */

data &analyse_lloc..&analyse_dr_vars (keep=varname vartype varmiss_pct) ;
set &analyse_lloc..&&analyse_dr_miss ;
array analnum {*} _numeric_ ; array analchar {*} _character_ ; /* same order as created _type_ is last */
length varname $32 ; /* sas limit 9.4, define after set as of ordering   */
varpos=index(_type_,"1") ;
select ;
when (varpos = 0 ) ;
when (varpos < 0 ) ;
when (varpos <= &analyse_numn ) do;
    vartype="NUM";
    Var_recode= putn(analnum{varpos},"nonmiss.");
    varname=vname(analnum(varpos));
    Select ;
    when ( Var_recode = "-.-" ) do;
    varmiss_pct=N /&analyse_obs ;
    IF ( varmiss_pct > .4 ) then Output &analyse_lloc..&analyse_dr_vars ;
    end;
    otherwise ;
    end;
end;
when (varpos > &analyse_numn ) do;
    vartype="CHR";
    varpos = varpos-&analyse_numn ;
    Var_recode= putc(analchar(varpos),"$nonmiss.");
    varname=vname(analchar(varpos)) ;
    Select ;
    when ( Var_recode = "-.-" ) do;
    varmiss_pct=N /&analyse_obs ;
    IF ( varmiss_pct > .4 ) then Output &analyse_lloc..&analyse_dr_vars ;
    end;
    otherwise ;
    end;
end;
   otherwise put "errord internal   " ;
end;

/* put _all_ ; */
run;

Having that information off all variables, working it back on inforamtion on each variable wiht some criteria

Multiple output dataset could be created dependent on next steps. Having those datasets as the selected / reviewed ones a variabel list (the keep) or Drop or could be a generated peice of code to be used in the analysis steps.

---->-- ja karman --<-----

Numeric Field Reduction