BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jthy
Calcite | Level 5

I'm trying to create a view of a huge dataset and I want to keep it as small as possible.

It's a view of a SPD dataset and I create it with proc sql.

Instead of using SELECT * I would like to use list all variables that actually have a value in the huge SPD dataset, and I'm therefore desperately trying to find a simple way of listing either all columns which actually has a value on at least one row, or all columns that are empty on all rows.

I have a dstinct feeling that there must be a really simple way to find out...

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

How about:

data have;
   missing a b;
   input name$ a   b   c   d   e;
   cards;
 x 2 . . . .
 y . . 3 . .
 z . 2 . . .
 p . . . . .
 q . . . . .
;;;;
run;

proc sql noprint;
 select 'nmiss('||strip(name)||') as '||strip(name) 
  into : list separated by ',' 
   from dictionary.columns 
    where libname='WORK' and memname='HAVE';

 create table n_miss as
  select &list from have;
quit;
%let dsid=%sysfunc(open(work.have));
%let nobs=%sysfunc(attrn(&dsid,nobs));
%let dsid=%sysfunc(close(&dsid));
data _null_;
 set n_miss;
 length var_list $ 2000;
 array _miss{*} _numeric_;
 do i=1 to dim(_miss);
  if _miss{i} ne &nobs then do;
                           var_list=catx(',',var_list,vname(_miss{i}));
                           found=1;
                           end;
 end;
 if not found then do;
                     do j=1 to dim(_miss);
                     var_list=catx(',',var_list,vname(_miss{j}));
                     end;
                   end;
 call symputx('v_list',var_list); stop;
run;
%put &v_list; 
proc sql;
 create view want_view as 
  select &v_list from have;
quit;


Ksharp

View solution in original post

15 REPLIES 15
art297
Opal | Level 21

There are numerous ways to do it.  E.g., take a look at the paper at: http://support.sas.com/resources/papers/proceedings10/048-2010.pdf

Doc_Duke
Rhodochrosite | Level 12

jthy,

I don' think that the %DROPMISS macro that Art mentions will work for you.  It depends on PROC CONTENTS to provide the summary and the data it uses is not presented for database tables.  You couild copy the entire data into a SAS dataset and then apply the macto, but that is a bit of a kluge.

I've not used SPD, but most database products that I have worked with (Oracle, SQL/Server) have internal metadata tables that include the number of entries in each column.  You could query them to build your SQL SELECT statement.

However, I have to wonder if it is worth the bother.  If you are actually creating a "view", then there is not very much space used by the columns with no data.

Doc Muhlbaier

Duke

FriedEgg
SAS Employee

By SPD, I assume he means that he is using the Scalable Performance Data Engine, or Server (SPDE/SPDS).  These are both SAS products.  SPDE being a storage engine avaialble with SAS/BASE and SPDS being a individual product offering.  In both cases proc contents will work for your needs.  Additionally proc contents also works on all of the database products I have worked with: Oracle, SQL/Server, Terradata, Netezza, so I assume it would work with any SAS/ACCESS works with, but couldn't be certain.

Next point.  A table view does not really require space to store, so by limiting the variables will help save some cpu and other resources just like reading directly from the dataset it will not affect the disk space to store the view.  The SPDE in my experience requires slightly more space to store than the v9 engine.  Not a substantial difference though.  Typically I utilize SPDE libraries for extrememly large tables (exceeding several hundred GB).  These tables are usually accessed by many processes simultaneously by production operations and need the partitioned structure for better read access and I/O.  The structure of the reads in these situations becomes more random than sequential which being able to spread the dataset across multiple I/O channels becomes highly beneficial to performance.

I do not know of any simple way to gather stats on all columns in a dataset as far as their content in concerned.  If you need to look through a dataset looking for variables with no population you will need to actually read through the entire file and analyize the contents yourself.  You could use any of many, many methods to accomplish this.  I would probably use proc freq with special 'catch all' formats and look in the output sets for varaibles where only my no value has count>0, but there are many ways to accomplish this manually.

proc format

value $charallfmt

'' = 'no'

other = 'yes';

value numallfmt

.,0 = 'no'

other = 'yes';

run;

proc freq data=spdlib.spdset noprint;

tables myvar1 /out=myvar1;

tables myvar2 /out=myvar2;

/* etc... */

format _character_ $charallfmt. _numeric_ numallfmt.;

run;

/* etc... */

data_null__
Jade | Level 19

FriedEgg consider this generalization of your idea.  With PROC FREQ NLEVELS you don't need any of the individual frequency tables.

data have;

   missing a b;

   input name$ a   b   c   d   e;

   cards;

x 2 . . . .

y . . 3 . .

z . a . . .

p . . . b .

q . . . . .

;;;;

   run;

proc print;

   run;

proc format;

   value allmiss ._-.z=. other=1;

   value $allmiss ' '=' ' other='1';

   run;

ods select nlevels;

ods output nlevels=nlevels;

proc freq data=have nlevels;

   format  _character_ $allmiss. _numeric_ allmiss.;

   run;

ods output close;

data nlevels; /*NNonMissLevels will not exist when there are no variables with missing values*/

   retain NNonMissLevels -1;

   set nlevels;

   run;

%let allMiss=;

proc sql noprint;

   select tableVar into :allmiss separated by ' '

   from Nlevels where NNonMissLevels = 0;

   quit;

   run;

%put allMiss=&allMiss;

FriedEgg
SAS Employee

Thanks _null_, I had forgotten about nlevels! Smiley Happy

jthy
Calcite | Level 5

Many thanks for your input. Your assumption is correct. "SPD" should have been "SPDS" in my post.

And the sizing issue I'm struggeling with is not storage, it is more a question of making it the handling and viewing of the view easier by getting rid of the unused columns. When less than 50% of the columns are actually used, it is easier to work with the table if yu don't have to see the unused ones...

The reason to why only 50% of the columns are used: The tables come from a standard solution covering the whole banking sector and I work only with credit card business.

Peter_C
Rhodochrosite | Level 12

jthy

stats about indexed columns appear with proc contents.

For other columns there is no quick-and-easy option.

Three strategies come to mind.

1 proc freq with nLevels for numeric variables

2 for character variables

proc summary missing ;

class _character_ ;

ways 0 1 ;

output out=  frequencies ;

run ;

the output table frequencies will look a bit disorganised, but carries the info ;

data empties ;

set frequencies ;

if _type_ = 0 then whole + _freq_ ;

if _type_ ;

if _freq_ = whole ;

run ;

The empties data set defines columns with a constant value.

Those two methods analyse every variable on every observation.

The next doesn't.

3

this alternative is suitable when you expect that all, or most vars have content:

A data step to flag vars having non-missing value.

On each observation, check only those vars which have not yet had non-missing values.

Once all vars have been flagged, there is no need to check further.

If you get to end-of-data, unflagged vars are those with no content.

data_null__
Jade | Level 19

Peter.C wrote:

3

this alternative is suitable when you expect that all, or most vars have content:

A data step to flag vars having non-missing value.

On each observation, check only those vars which have not yet had non-missing values.

Once all vars have been flagged, there is no need to check further.

If you get to end-of-data, unflagged vars are those with no content.


I think this program address your method 3.  This is a mildly interesting example using a couple of associative arrays to hold the relevant information.  This program stops checking variables that have been shown to have a non missing values, and the program stops checking obs when all variables are determined to have non missing values.

data have;

   infile cards missover;

   missing a b;

   input name$ a   b   c   d   e f$ i;

   cards;

x 2 . . . .

y . . 3 . .

z . a . . .

p . . . b .

q . . . . .

;;;;

   run;

proc print;

   run;

%let syslast=sashelp.class;

%let syslast=have;

data _null_;

   if 0 then set _last_;

   length _dummyN_ 8 _dummyC_ $1;

   array _c

  • _character_;
  •    array _n

  • _numeric_;

       declare hash vr();

       declare hiter hvr('VR');

       vr.definekey('vName');

       vr.definedata('vName','vType','_i_');

       vr.definedone();

       vType = 'C';

       do _i_ = 1 to dim(_c)-1;

          vName    = vName(_c[_i_]);

          vr.add();

          end;

       vType = 'N';

       do _i_ = 1 to dim(_n)-1;

          vName    = vName(_n[_i_]);

          vr.add();

          end;

       *vr.output(dataset:'vr1');

       declare hash vrem();

       declare hiter hvrem('VREM');

       vrem.definekey('_j_');

       vrem.definedata('vName');

       vrem.definedone();

       do until(eof);

          set _last_ end=eof;

          _j_ = 0;

          vrem.clear();

          do rc=hvr.first() by 0 while(rc eq 0);

             select(vtype);

                when('C') if not missing(_c[_i_]) then do _j_=_j_+1; vrem.add(); end;

                when('N') if not missing(_n[_i_]) then do _j_=_j_+1; vrem.add(); end;

                otherwise;

                end;

             rc = hvr.next();

             end;

          if vrem.NUM_ITEMS then do rc=hvrem.first() by 0 while(rc eq 0);

             vr.remove();

             rc = hvrem.next();

             end;

          if vr.NUM_ITEMS eq 0 then leave;

          end;

       vr.output(dataset:'AllMissingVars');

       stop;     

       run;

    proc print data=vr1;

    proc print data=AllMissingVars;

       run;

    MikeZdeb
    Rhodochrosite | Level 12

    Hi ... you can use the NLEVELS option in PROC FREQ to see if any variable has all

    non-missing (or all missing) values.  If there are no missing data for any of the variables,

    the LOG for PROC SQL will show ...

    ERROR: The following columns were not found in the contributing tables: nnonmisslevels.

    If there are no variables with all missing values, PROC SQL is a waste of time since you just copy the

    data set.  You can try ...

    proc print data=tables;

    where nnonmisslevels eq 0;

    run;

    after PROC FREQ and if you see ...

    NOTE: No observations were selected from data set WORK.TABLES.

    stop there since every variable has some non-missing value(s).

    data test;

    input name : $10. gender : $1.

          age height weight;

    datalines;

    MIKE . 21 . .L 200

    MARY . .Z . .H 120

    MARK . 45 . .H 110

    ;

    run;

    proc format;

    value nm  low - high = 'OK'    other    = . ;

    value $ch    ' '  = ' '   other = 'OK';

    run;

    ods listing close;

    ods output nlevels=tables;

    proc freq data=test nlevels;

    format _numeric_ nm.  _character_ $ch.;

    run;

    ods output close;

    ods listing;

    proc sql noprint;

    select tablevar into :keeplist separated by ','

    from tables

    where nnonmisslevels ne 0;

    create table new as

    select &keeplist

    from test;

    quit;

    proc print data=new noobs;

    run;

    name    age

    MIKE     21

    MARY      Z

    MARK     45

    ps ... courtesy of data _null_'s postings about NLEVELS on SAS-L

    Ksharp
    Super User

    How about:

    data have;
       missing a b;
       input name$ a   b   c   d   e;
       cards;
     x 2 . . . .
     y . . 3 . .
     z . 2 . . .
     p . . . . .
     q . . . . .
    ;;;;
    run;
    
    proc sql noprint;
     select 'nmiss('||strip(name)||') as '||strip(name) 
      into : list separated by ',' 
       from dictionary.columns 
        where libname='WORK' and memname='HAVE';
    
     create table n_miss as
      select &list from have;
    quit;
    %let dsid=%sysfunc(open(work.have));
    %let nobs=%sysfunc(attrn(&dsid,nobs));
    %let dsid=%sysfunc(close(&dsid));
    data _null_;
     set n_miss;
     length var_list $ 2000;
     array _miss{*} _numeric_;
     do i=1 to dim(_miss);
      if _miss{i} ne &nobs then do;
                               var_list=catx(',',var_list,vname(_miss{i}));
                               found=1;
                               end;
     end;
     if not found then do;
                         do j=1 to dim(_miss);
                         var_list=catx(',',var_list,vname(_miss{j}));
                         end;
                       end;
     call symputx('v_list',var_list); stop;
    run;
    %put &v_list; 
    proc sql;
     create view want_view as 
      select &v_list from have;
    quit;
    
    
    

    Ksharp

    jthy
    Calcite | Level 5

    Many thanks! It works!

    data_null__
    Jade | Level 19

    Using CODEGEN to generate SQL NMISS expressions is a acceptable solution assuming the generated code fits in a macro variable. I would not expect the macro variable length to be a problem too often. 

    I don't understand why you us SAS File I/O to create a macro variable to hold a number that may not be correct and then use a complicated data step just to create a list of names.  I would count the observations while counting the NMISS.  Counting will correct for any "REMOVEd" observations.  To further simplify the the creation of the variable list, data N_MISS could be transposed to easily create a list of names to keep or a list of names to drop.  Consider the following modification to your program.

    data have;

       missing a b;

       input name$ a   b   c   d   e;

       cards;

    x 2 . . . .

    y . . 3 a .

    z . 2 . . .

    p . . . . .

    q . . . . .

    ;;;;

    run;

    %let LIST=;

    proc sql noprint;

       select cat('nmiss(',strip(name),') as ',strip(name)) into :list separated by ','

          from dictionary.columns where libname eq 'WORK' and memname eq 'HAVE';

       %put NOTE: LIST=&LIST;

       create table n_miss as select count(*) as nobs, &list from have;

       quit;

       run;

    proc transpose data=n_miss out=n_miss;

       by nobs;

       run;

    %let v_list=;

    proc sql noprint;

       select _name_ into :v_list separated by ','

          from n_miss

          where nobs ne col1;

       %put NOTE: V_LIST=&V_LIST;

       create view want_view as select &v_list from have;

       describe view want_view;

       quit;

       run;

    Ksharp
    Super User

    Null.

    You missed another demand of OP.

    or all columns that are empty on all rows.

    That is to mean the macro variable &v_list will be generated by two different situation.

    So this is to need some code to judge which situation.

    Ksharp

    data_null__
    Jade | Level 19

    I understand the requirements.  I do not understand your next statement. 

    "That is to mean the macro variable &v_list will be generated by two different situation.

    So this is to need some code to judge which situation."

    If all obs for all vars are missing you make a list of all vars.  For that situation the list becomes a DROP list not a KEEP list.  How can that be helpful? 

    sas-innovate-2024.png

    Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

    Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

     

    Register now!

    What is Bayesian Analysis?

    Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

    Find more tutorials on the SAS Users YouTube channel.

    Click image to register for webinarClick image to register for webinar

    Classroom Training Available!

    Select SAS Training centers are offering in-person courses. View upcoming courses for:

    View all other training opportunities.

    Discussion stats
    • 15 replies
    • 1959 views
    • 3 likes
    • 8 in conversation