BookmarkSubscribeRSS Feed
GreggB
Pyrite | Level 9

How can I find columns that are completely blank in a data set that I created by importing a csv file? Running PROC FREQ on all variables is not practical due to file size.

13 REPLIES 13
noling
SAS Employee

This approach does 1 pass through your data with a datastep, and then some cleanup. If it usually takes 20 minutes for one pass through your data, this may take about as long. This assumes that your field names are all 26 chars or less (I add the text "_count" to the field names):

 

*my practice dataset;
data temp;
	set sashelp.cars;
	call missing(make, model, cylinders);
run;

*your table;
%let table_name=work.temp;

proc contents data=&table_name noprint out=work.contents;
run;
proc sql noprint;
	select name into :fields separated by ' '
	from work.contents;
quit;
%put &fields;

%macro find_missing;
	%let to_loop = %sysfunc(countw(&fields)); 
	data result;
		set &table_name end=last;
		retain %do i = 1 %to &to_loop;
				   %scan(&fields,&i)_count	
			   %end;
			   0;
		%do i = 1 %to &to_loop;
		   if not missing(%scan(&fields,&i)) then %scan(&fields,&i)_count +1;
	    %end;
		if last then output;
	run;
%mend;
%find_missing

*rotate dataset to make it look nice;
proc transpose data=result out=rotated(where=(COL1=0));
run;
*and restore normal var names.;
data final;
	set rotated;
	length field $32;
	*cut the "_count" off of <variable name>_count;
	field = substr(_name_,1,length(_name_)-6);
	keep field;
run;

Results are in the final dataset.


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

ballardw
Super User

ODS Select can restrict which tables are generated. The Proc Freq NLEVELS may be the briefest approach:

 

data example;
   do i= 1 to 10;
    x=.;
    output;
   end;
run;

ods select nlevels;
proc freq data=example nlevels;
run;

A variable with 0 Nonmissing Levels would have all values missing.

novinosrin
Tourmaline | Level 20

Sir @ballardw  once again you are proving you are genius. If you have a moment, can you offer a couple of notes on how that works nlevels. I just saw the doc and I got more confused. A request to you at your own convenience. Thanks

 

Also, would that work for char var?

ballardw
Super User

Basically the NLEVELS option creates a table of counts of unique values. In the example above:

          Number of Variable Levels

                         Missing    Nonmissing
Variable      Levels      Levels        Levels
----------------------------------------------
i                 10           0            10
x                  1           1             0

The variable I has 10 unique values (Levels) all non-missing, x has 1 unique value, with 1 missing level and no non-missing levels.

The "Missing Levels" can be important if you use the special missing .A through .Z.

 

In another example with data:

data example;
   do i= 1 to 10;
    x=.;
    output;
   end;
   do i= 1 to 10;
      if mod(i,3)=0 then x=.A;
      output;
      if mod(i,4)=0 then x=i;
      else x=.;
      output;
   end;
run;

The result would be:

          Number of Variable Levels

                         Missing    Nonmissing
Variable      Levels      Levels        Levels
----------------------------------------------
i                 10           0            10
x                  4           2             2

Which again shows I with 10 unique values even though they are repeated and no nonmissing. X now shows 4 levels because it has values of ., .A (special missing), 4 and 8. The missing levels shows 2 because of the use of .A as well as simple . missing, as well as showing that 2 of the levels are nonmissing.

 

This can also be used to check if the variable values are repeated by comparing to the number of records in the dataset. If a value should be unique and present for each record then the nonmissing levels should equal the number of records in a set (easily found in the Log from the number of observations read.

novinosrin
Tourmaline | Level 20

Thank you sir @ballardw 

data_null__
Jade | Level 19

Applying a FORMAT to all variable such that all non-missing values are considered equal should make it a bit faster.

 

data class;
	set sashelp.class;
   call missing(height,sex);
   run;
proc format ;
	value $missfmt    ' '="Missing" other="Not Missing";
	value missfmt  ._-.Z ="Missing" other="Not Missing";
   run;
ods select none;
proc freq data=class nlevels;
	format _numeric_ missfmt. _character_ $missfmt.;
   ods output nlevels=nlevels;
   run;
ods select all;
data nlevels;
   length TableVar $32 NLevels NMissLevels NNonMissLevels 8;  
   set nlevels;
   if NNonMissLevels eq 0 then Flag=1;
   run;
proc print;
   run;

Capture.PNG

Astounding
PROC Star

I would bet that imported variables that are always blank must be numeric.  So let's take the easy route:

 

proc summary data=have;

var _numeric_;

output out=maxvals (drop=_type_ _freq_) max=;

run;

 

That gives you the maximum value for each numeric variable, but uses the original variable name to hold the result.  There are a few structures that might be considered a "list".  One way to proceed from there:

 

data always_missing;

length var $ 32;

set maxvals;

array nums {*} _numeric_;

do _n_=1 to dim(nums);

   if nums{_n_} = . then do;

      var = vname(nums{_n_});

      output;

   end;

end;

run;

ballardw
Super User

@Astounding wrote:

I would bet that imported variables that are always blank must be numeric.  So let's take the easy route:

 

Most of the CSV I have imported using Proc Import with all missing values end up with $1. informats for those columns.

FreelanceReinh
Jade | Level 19

If the number of variables is large, but only a few of them are suspected to be completely or mostly blank, I would first draw a reasonably sized random sample from the dataset in order to determine those few "candidate" variables and then restrict the full analysis (e.g. with NLEVELS) accordingly.

novinosrin
Tourmaline | Level 20

Sir @FreelanceReinh   You've got me interested. 

"I would first draw a reasonably sized random sample from the dataset in order to determine those few "candidate" variables "

Can you explain this a bit further when you have some time plz

 

FreelanceReinh
Jade | Level 19

I was just thinking: If there were, say, 10 million observations with a total of 1000 variables, but only 5 of them were either sparsely populated or perhaps completely blank, why should we spend computing time to find out that the remaining 995 variables have 7654321, 5566778, ... distinct non-missing levels, respectively?

s_lassen
Meteorite | Level 14

You can do it with a datastep, like this:

data have;
  retain a . b 3 t 5 j ' ' name 'øfdø';
  do i=1 to 300;
    output;
    end;
run;

data missingvars;
  if 0 then set have;
  array nums _numeric_;
  array char _character_;
  type='N'; /* numeric */
  do _N_=1 to dim(nums);
    name=vname(nums(_N_));
    dsid=open(catt('have(where=(',name,' is not null))'));
    select(fetch(dsid));
      when(-1)  /* no data */
        output;
      when(0); /* data read OK */
      otherwise do;
        msg=sysmsg();
        put msg=;
        end;
      end;
    dsid=close(dsid);
    end;
  type='C'; /* numeric */
  do _N_=1 to dim(char);
    name=vname(char(_N_));
    dsid=open(catt('have(where=(',name,' is not null))'));
    select(fetch(dsid));
      when(-1)  /* no data */
        output;
      when(0); /* data read OK */
      otherwise do;
        msg=sysmsg();
        put msg=;
        end;
      end;
    dsid=close(dsid);
    end;
  stop;
  keep type name;
run;

This approach is probably relatively fast, as it just uses a where clause to look for non-missing values of each variable.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 9563 views
  • 12 likes
  • 9 in conversation