Solved: Re: How do I list all unused columns in a dataset?

jthy · Posted 09-01-2011 09:37 AM

I'm trying to create a view of a huge dataset and I want to keep it as small as possible.

It's a view of a SPD dataset and I create it with proc sql.

Instead of using SELECT * I would like to use list all variables that actually have a value in the huge SPD dataset, and I'm therefore desperately trying to find a simple way of listing either all columns which actually has a value on at least one row, or all columns that are empty on all rows.

I have a dstinct feeling that there must be a really simple way to find out...

Ksharp · Posted 09-01-2011 11:46 PM

How about:

data have;
   missing a b;
   input name$ a   b   c   d   e;
   cards;
 x 2 . . . .
 y . . 3 . .
 z . 2 . . .
 p . . . . .
 q . . . . .
;;;;
run;

proc sql noprint;
 select 'nmiss('||strip(name)||') as '||strip(name) 
  into : list separated by ',' 
   from dictionary.columns 
    where libname='WORK' and memname='HAVE';

 create table n_miss as
  select &list from have;
quit;
%let dsid=%sysfunc(open(work.have));
%let nobs=%sysfunc(attrn(&dsid,nobs));
%let dsid=%sysfunc(close(&dsid));
data _null_;
 set n_miss;
 length var_list $ 2000;
 array _miss{*} _numeric_;
 do i=1 to dim(_miss);
  if _miss{i} ne &nobs then do;
                           var_list=catx(',',var_list,vname(_miss{i}));
                           found=1;
                           end;
 end;
 if not found then do;
                     do j=1 to dim(_miss);
                     var_list=catx(',',var_list,vname(_miss{j}));
                     end;
                   end;
 call symputx('v_list',var_list); stop;
run;
%put &v_list; 
proc sql;
 create view want_view as 
  select &v_list from have;
quit;

Ksharp

View solution in original post

art297 · Posted 09-01-2011 09:47 AM

There are numerous ways to do it. E.g., take a look at the paper at: http://support.sas.com/resources/papers/proceedings10/048-2010.pdf

Doc_Duke · Posted 09-01-2011 10:17 AM

jthy,

I don' think that the %DROPMISS macro that Art mentions will work for you. It depends on PROC CONTENTS to provide the summary and the data it uses is not presented for database tables. You couild copy the entire data into a SAS dataset and then apply the macto, but that is a bit of a kluge.

I've not used SPD, but most database products that I have worked with (Oracle, SQL/Server) have internal metadata tables that include the number of entries in each column. You could query them to build your SQL SELECT statement.

However, I have to wonder if it is worth the bother. If you are actually creating a "view", then there is not very much space used by the columns with no data.

Doc Muhlbaier

Duke

FriedEgg · Posted 09-01-2011 10:38 AM

By SPD, I assume he means that he is using the Scalable Performance Data Engine, or Server (SPDE/SPDS). These are both SAS products. SPDE being a storage engine avaialble with SAS/BASE and SPDS being a individual product offering. In both cases proc contents will work for your needs. Additionally proc contents also works on all of the database products I have worked with: Oracle, SQL/Server, Terradata, Netezza, so I assume it would work with any SAS/ACCESS works with, but couldn't be certain.

Next point. A table view does not really require space to store, so by limiting the variables will help save some cpu and other resources just like reading directly from the dataset it will not affect the disk space to store the view. The SPDE in my experience requires slightly more space to store than the v9 engine. Not a substantial difference though. Typically I utilize SPDE libraries for extrememly large tables (exceeding several hundred GB). These tables are usually accessed by many processes simultaneously by production operations and need the partitioned structure for better read access and I/O. The structure of the reads in these situations becomes more random than sequential which being able to spread the dataset across multiple I/O channels becomes highly beneficial to performance.

I do not know of any simple way to gather stats on all columns in a dataset as far as their content in concerned. If you need to look through a dataset looking for variables with no population you will need to actually read through the entire file and analyize the contents yourself. You could use any of many, many methods to accomplish this. I would probably use proc freq with special 'catch all' formats and look in the output sets for varaibles where only my no value has count>0, but there are many ways to accomplish this manually.

proc format

value $charallfmt

'' = 'no'

other = 'yes';

value numallfmt

.,0 = 'no'

other = 'yes';

run;

proc freq data=spdlib.spdset noprint;

tables myvar1 /out=myvar1;

tables myvar2 /out=myvar2;

/* etc... */

format _character_ $charallfmt. _numeric_ numallfmt.;

run;

/* etc... */

data_null__ · Posted 09-01-2011 12:12 PM

FriedEgg consider this generalization of your idea. With PROC FREQ NLEVELS you don't need any of the individual frequency tables.

data have;
   missing a b;
   input name$ a   b   c   d   e;
   cards;
 x 2 . . . .
 y . . 3 . .
 z . a . . .
 p . . . b .
 q . . . . .
;;;;
   run;
proc print;
   run;
 
 
proc format;
   value allmiss ._-.z=. other=1;
   value $allmiss ' '=' ' other='1';
   run;
 
ods select nlevels;
ods output nlevels=nlevels;
proc freq data=have nlevels;
   format  _character_ $allmiss. _numeric_ allmiss.;
   run;
ods output close;
 
data nlevels; /*NNonMissLevels will not exist when there are no variables with missing values*/
   retain NNonMissLevels -1; 
   set nlevels;
   run;
 
%let allMiss=;
proc sql noprint;
   select tableVar into :allmiss separated by ' '
   from Nlevels where NNonMissLevels = 0;
   quit;
   run;
%put allMiss=&allMiss;

FriedEgg · Posted 09-01-2011 12:44 PM

Thanks _null_, I had forgotten about nlevels!

jthy · Posted 09-02-2011 02:16 AM

Many thanks for your input. Your assumption is correct. "SPD" should have been "SPDS" in my post.

And the sizing issue I'm struggeling with is not storage, it is more a question of making it the handling and viewing of the view easier by getting rid of the unused columns. When less than 50% of the columns are actually used, it is easier to work with the table if yu don't have to see the unused ones...

The reason to why only 50% of the columns are used: The tables come from a standard solution covering the whole banking sector and I work only with credit card business.

Peter_C · Posted 09-01-2011 12:26 PM

jthy

stats about indexed columns appear with proc contents.

For other columns there is no quick-and-easy option.

Three strategies come to mind.

1 proc freq with nLevels for numeric variables

2 for character variables

proc summary missing ;

class _character_ ;

ways 0 1 ;

output out= frequencies ;

run ;

the output table frequencies will look a bit disorganised, but carries the info ;

data empties ;

set frequencies ;

if _type_ = 0 then whole + _freq_ ;

if _type_ ;

if _freq_ = whole ;

run ;

The empties data set defines columns with a constant value.

Those two methods analyse every variable on every observation.

The next doesn't.

3

this alternative is suitable when you expect that all, or most vars have content:

A data step to flag vars having non-missing value.

On each observation, check only those vars which have not yet had non-missing values.

Once all vars have been flagged, there is no need to check further.

If you get to end-of-data, unflagged vars are those with no content.

data_null__ · Posted 09-01-2011 02:47 PM

Peter.C wrote:

3

this alternative is suitable when you expect that all, or most vars have content:

A data step to flag vars having non-missing value.

On each observation, check only those vars which have not yet had non-missing values.

Once all vars have been flagged, there is no need to check further.

If you get to end-of-data, unflagged vars are those with no content.

I think this program address your method 3. This is a mildly interesting example using a couple of associative arrays to hold the relevant information. This program stops checking variables that have been shown to have a non missing values, and the program stops checking obs when all variables are determined to have non missing values.

data have;
   infile cards missover;
   missing a b;
   input name$ a   b   c   d   e f$ i;
   cards;
 x 2 . . . .
 y . . 3 . .
 z . a . . .
 p . . . b .
 q . . . . .
;;;;
   run;
proc print;
   run;
 
%let syslast=sashelp.class;
%let syslast=have;
data _null_;
   if 0 then set _last_;
   length _dummyN_ 8 _dummyC_ $1;
   array _c
 _character_;
   array _n
 
_numeric_;
 
   declare hash vr();
   declare hiter hvr('VR');
   vr.definekey('vName');
   vr.definedata('vName','vType','_i_');
   vr.definedone();
   vType = 'C';
   do _i_ = 1 to dim(_c)-1;
      vName    = vName(_c[_i_]);
      vr.add();
      end;
   vType = 'N';
   do _i_ = 1 to dim(_n)-1;
      vName    = vName(_n[_i_]);
      vr.add();
      end;
   *vr.output(dataset:'vr1');
 
   declare hash vrem();
   declare hiter hvrem('VREM');
   vrem.definekey('_j_');
   vrem.definedata('vName');
   vrem.definedone();
 
   do until(eof);
      set _last_ end=eof;
      _j_ = 0;
      vrem.clear();
      do rc=hvr.first() by 0 while(rc eq 0);
         select(vtype);
            when('C') if not missing(_c[_i_]) then do _j_=_j_+1; vrem.add(); end;
            when('N') if not missing(_n[_i_]) then do _j_=_j_+1; vrem.add(); end;
            otherwise;
            end; 
         rc = hvr.next();
         end;
      if vrem.NUM_ITEMS then do rc=hvrem.first() by 0 while(rc eq 0);
         vr.remove();
         rc = hvrem.next();
         end;
      if vr.NUM_ITEMS eq 0 then leave;
      end;
   vr.output(dataset:'AllMissingVars');
   stop;      
   run;
proc print data=vr1;
proc print data=AllMissingVars;
   run;

MikeZdeb · Posted 09-01-2011 12:35 PM

Hi ... you can use the NLEVELS option in PROC FREQ to see if any variable has all

non-missing (or all missing) values. If there are no missing data for any of the variables,

the LOG for PROC SQL will show ...

ERROR: The following columns were not found in the contributing tables: nnonmisslevels.

If there are no variables with all missing values, PROC SQL is a waste of time since you just copy the

data set. You can try ...

proc print data=tables;

where nnonmisslevels eq 0;

run;

after PROC FREQ and if you see ...

NOTE: No observations were selected from data set WORK.TABLES.

stop there since every variable has some non-missing value(s).

data test;

input name : $10. gender : $1.

age height weight;

datalines;

MIKE . 21 . .L 200

MARY . .Z . .H 120

MARK . 45 . .H 110

;

run;

proc format;

value nm low - high = 'OK' other = . ;

value $ch ' ' = ' ' other = 'OK';

run;

ods listing close;

ods output nlevels=tables;

proc freq data=test nlevels;

format _numeric_ nm. _character_ $ch.;

run;

ods output close;

ods listing;

proc sql noprint;

select tablevar into :keeplist separated by ','

from tables

where nnonmisslevels ne 0;

create table new as

select &keeplist

from test;

quit;

proc print data=new noobs;

run;

name age

MIKE 21

MARY Z

MARK 45

ps ... courtesy of data _null_'s postings about NLEVELS on SAS-L

Ksharp · Posted 09-01-2011 11:46 PM

How about:

data have;
   missing a b;
   input name$ a   b   c   d   e;
   cards;
 x 2 . . . .
 y . . 3 . .
 z . 2 . . .
 p . . . . .
 q . . . . .
;;;;
run;

proc sql noprint;
 select 'nmiss('||strip(name)||') as '||strip(name) 
  into : list separated by ',' 
   from dictionary.columns 
    where libname='WORK' and memname='HAVE';

 create table n_miss as
  select &list from have;
quit;
%let dsid=%sysfunc(open(work.have));
%let nobs=%sysfunc(attrn(&dsid,nobs));
%let dsid=%sysfunc(close(&dsid));
data _null_;
 set n_miss;
 length var_list $ 2000;
 array _miss{*} _numeric_;
 do i=1 to dim(_miss);
  if _miss{i} ne &nobs then do;
                           var_list=catx(',',var_list,vname(_miss{i}));
                           found=1;
                           end;
 end;
 if not found then do;
                     do j=1 to dim(_miss);
                     var_list=catx(',',var_list,vname(_miss{j}));
                     end;
                   end;
 call symputx('v_list',var_list); stop;
run;
%put &v_list; 
proc sql;
 create view want_view as 
  select &v_list from have;
quit;

Ksharp

jthy · Posted 09-02-2011 07:07 AM

Many thanks! It works!

data_null__ · Posted 09-02-2011 10:11 AM

Using CODEGEN to generate SQL NMISS expressions is a acceptable solution assuming the generated code fits in a macro variable. I would not expect the macro variable length to be a problem too often.

I don't understand why you us SAS File I/O to create a macro variable to hold a number that may not be correct and then use a complicated data step just to create a list of names. I would count the observations while counting the NMISS. Counting will correct for any "REMOVEd" observations. To further simplify the the creation of the variable list, data N_MISS could be transposed to easily create a list of names to keep or a list of names to drop. Consider the following modification to your program.

data have;
   missing a b;
   input name$ a   b   c   d   e;
   cards;
 x 2 . . . .
 y . . 3 a .
 z . 2 . . .
 p . . . . .
 q . . . . .
;;;;
run;
 
%let LIST=;
proc sql noprint;
   select cat('nmiss(',strip(name),') as ',strip(name)) into :list separated by ',' 
      from dictionary.columns where libname eq 'WORK' and memname eq 'HAVE';
   %put NOTE: LIST=&LIST;
   create table n_miss as select count(*) as nobs, &list from have;
   quit;
   run;
proc transpose data=n_miss out=n_miss;
   by nobs;
   run;
 
%let v_list=;
proc sql noprint;
   select _name_ into :v_list separated by ','
      from n_miss 
      where nobs ne col1;
   %put NOTE: V_LIST=&V_LIST;
 
   create view want_view as select &v_list from have;
   describe view want_view;
   quit;
   run;

Ksharp · Posted 09-02-2011 12:29 PM

Null.

You missed another demand of OP.

or all columns that are empty on all rows.

That is to mean the macro variable &v_list will be generated by two different situation.

So this is to need some code to judge which situation.

Ksharp

data_null__ · Posted 09-02-2011 01:19 PM

I understand the requirements. I do not understand your next statement.

"That is to mean the macro variable &v_list will be generated by two different situation.

So this is to need some code to judge which situation."

If all obs for all vars are missing you make a list of all vars. For that situation the list becomes a DROP list not a KEEP list. How can that be helpful?

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!