Identify columns with missing data - multiple datasets

Reply
Regular Contributor
Posts: 152

Identify columns with missing data - multiple datasets

[ Edited ]

Hello Experts,

I have got 100 SAS tables and they are all stored under same library.

I want to check if some of the columns are populated correctly(do a missing value check,)

there are about 8 columns I want to check however the 8 columns are not all presented in all the table.

,so some tables may have 2 out of the 8 columns,some may have 3 .

all the 8 columns I want to check are in numeric value.

 

The output table I want to have should have the following columns

The table name that has been checked

the library name the table stored 

column name that has at least one missing value

 

I hope I made my question clear.

 

 

THanks

Super User
Posts: 19,800

Re: Identify columns with missing data - multiple datasets

What's are you counting as a missing value, a single value or all values in the column? Are all the columns numeric? Or character? A mix?

 

Are the datasets consistent in any manner, same number of variables, similar number of rows? Is there a naming convention? Are those datasets the only ones in the library or are there others that don't need to be processed. 

 

The process here is : develop code for one dataset. Take code and make it dynamic using SASHELP.VTABLE or information from PROC CONTENTS/DATASETS.

 

Once you get that working, use SASHELP.VTABLE to drive the macro to run it for each dataset. Or use DOSUBL.  If you're new to macro's/automation I wouldn't recommend DOSUBL personally. 

 

 

Regular Contributor
Posts: 152

Re: Identify columns with missing data - multiple datasets

yes, the dataset are consistent and always has the columns. however the column can be either numeric or date or character format.
Super User
Posts: 7,782

Re: Identify columns with missing data - multiple datasets


gyambqt wrote:
yes, the dataset are consistent and always has the columns. however the column can be either numeric or date or character format.

You have yet to answer ~80% of @Reeza's questions.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Regular Contributor
Posts: 152

Re: Identify columns with missing data - multiple datasets

I modified my question hope it is more clear.

i build the code my self but I want to see if experts have better way of handling this problem.

sorry can't share my code as I am typing using my phone.

Super User
Posts: 19,800

Re: Identify columns with missing data - multiple datasets

@gyambqt Please use a more descriptive title in the future. This forum is intended as a resource to both answer your current question and for future readers to find solutions to problems they have that are similar. Additionally, taking the time to both detail your question will save the time it takes me to ask all the questions and then you still have to answer them, rather than just being able to answer your question. 

 

Regular Contributor
Posts: 152

Re: Identify columns with missing data - multiple datasets

sure, I will notice that in the future.

 

Super User
Super User
Posts: 7,955

Re: Identify columns with missing data - multiple datasets

Quick and dirty example:

data work.test1;
  a=1;
  b=4;
  c="1";
run;

data work.test2;
  a=.;
  b="xyz";
  c="RTY";
run;

data work.test3;
  a=9;
  b="";
  c="";
run;

proc sql;
  create table RESULTS (DSNAME char(200),VAR char(200),TEST char(200));
quit;

data _null_;
  set sashelp.vtable (where=(libname="WORK" and substr(memname,1,4)="TEST"));
  call execute('ods output onewayfreqs=onewayfreqs_'||strip(memname)||';
                proc freq data='||strip(memname)||';
                  tables _all_;
                run;
                data onewayfreqs_'||strip(memname)||' (keep=dsname var result);
                  set onewayfreqs_'||strip(memname)||';
                  dsname="'||strip(memname)||'";
                  var=table;
                  result=ifc(percent=100,"Non Missing","Missing");
                run;
                proc sql; insert into RESULTS select * from ONEWAYFREQS_'||strip(memname)||'; quit;');
run;

Regular Contributor
Posts: 152

Re: Identify columns with missing data - multiple datasets

Hi expert,  if any of my table contain a mixture of character and numeric variable and I only want to test if any of the 8 columns has missing value. how do I do that?

for example: I have 100 columns in one of my table A and 3 of the 8 columns are presented there and I want to test if the 3 columns are missing.

Regular Contributor
Posts: 152

Re: Identify columns with missing data - multiple datasets

Also I found your code will check every single records. but if my data contain 100 millitons records it will perform a proc fre for all 100millions records and return a result which is not that efficent. Is there any way to improve the efficency?

 

 

Super User
Super User
Posts: 7,955

Re: Identify columns with missing data - multiple datasets

So what you are saying is, 

You have a 100 datasets which are 100million records, you don't know exactly what the structure is in each, and you want a general program to work with this?  This is not a Q&A then.  First you need a system to be able to work with big data, without the resources no code will process 100*100million in any good time.  Secondly you need to map out the structure of your data so that you know what is in each, formulate a plan that consistently work across the data, proc freq was given in the example as SAS provided, compiled functions will operate faster than any code you write by hand.  But what you can do is, once you have a list of variables in each dataset, filter freq for only those:

data work.test1;
  a=1;
  b=4;
  c="1";
run;

data work.test2;
  a=.;
  b="xyz";
  c="RTY";
run;

data work.test3;
  a=9;
  b="";
  c="";
run;

proc sql;
  create table RESULTS (DSNAME char(200),VAR char(200),TEST char(200));
quit;

data cols (keep=memname var_list);
  set sashelp.vcolumn (where=(libname="WORK" and substr(memname,1,4)="TEST" and NAME in ("a","b")));
  by libname memname;
  length var_list $2000;
  retain var_list;
  var_list=ifc(first.memname,name,catx(' ',var_list,name));
  if last.memname then output;
run;

data _null_;
  merge sashelp.vtable (where=(libname="WORK" and substr(memname,1,4)="TEST") in=a)
        work.cols (in=b);
  by memname;
  if a;
  call execute('ods output onewayfreqs=onewayfreqs_'||strip(memname)||';
                proc freq data='||strip(memname)||' (keep='||strip(var_list)||');
                  tables _all_;
                run;
                data onewayfreqs_'||strip(memname)||' (keep=dsname var result);
                  set onewayfreqs_'||strip(memname)||';
                  dsname="'||strip(memname)||'";
                  var=table;
                  result=ifc(percent=100,"Non Missing","Missing");
                run;
                proc sql; insert into RESULTS select * from ONEWAYFREQS_'||strip(memname)||'; quit;');
run;

Note I create a dataset cols which contains only the variables I want within the list of A,B.

Valued Guide
Posts: 2,177

Re: Identify columns with missing data - multiple datasets

I wrote a routine which searched for empty columns in a data step. (It was enhanced to restrict the search to a subset of columns). Once a column has a non-empty cell, it no longer needs checking. In this way non-empty columns are quickly eliminated from investigation. A table pass stops as soon as all columns show non-missing (need not be all on the same row). Unfortunately, tables with even just one empty column, demand a complete pass.
There seems no way to avoid the I/O problem for large tables.
Valued Guide
Posts: 2,177

Re: Identify columns with missing data - multiple datasets

 
Ask a Question
Discussion stats
  • 12 replies
  • 250 views
  • 0 likes
  • 5 in conversation