BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AieuYuhara
Fluorite | Level 6

Hi, my data have more than 1,000 variables and I need to check for missing and filling rate for each variable. 

My aim is to output a sas table which lists all variables, along with number of missing, non missing and the rate of missing and non missing. 

 

Sample data as per below:

data missingval;
length Country $ 10 TourGuide $ 10 Trip_Num 8;
input Country TourGuide Trip_Num;
datalines;
Japan Yamada 3
. Militello .
Australia Edney 4
Venezuela . 2
Brazil Cardoso .
;run;

Expected output:

Variable_NameVariable_TypeFill_NumMiss_NumFill_RateMiss_Rate
CountryChar410.80.2
TourGuideChar410.80.2
Trip_NumNum320.60.4

 

Really appreciate your help on this! 

1 ACCEPTED SOLUTION

Accepted Solutions
andreas_lds
Jade | Level 19

@Kurt_Bremser wrote:

Mind that both methods require you to have enough memory to store a number of variables equal to the number of observations in your original dataset during the transpose.


So, do you think that solution skipping transposing would be better?

That one got a bit lengthy, but i am out of time right now.

 

proc sql noprint;
   select quote(trim(Name)) into :varList separated by ','
      from sashelp.vcolumn
         where LibName = 'WORK' and MemName = 'MISSINGVAL'
   ;
quit;


data _null_;
   set missingval end=lastObs;

   length 
      Variable_Name $ 32
      Variable_Type $ 4
      Fill_Num Miss_Num 8
      _isMissing 8
   ;

   if _n_ = 1 then do;
      declare hash h(ordered:'yes');
      declare hiter bob('h');
      h.defineKey('Variable_Name');
      h.defineData('Variable_Name', 'Variable_Type', 'Fill_Num', 'Miss_Num');
      h.defineDone();

      do Variable_Name = &varList;
         Variable_Type = vtypex(Variable_Name);
         Fill_Num = 0;
         Miss_Num = 0;
         h.add();
      end;
   end;

   do Variable_Name = &varList;
      _rc = h.find();

      if Variable_Type = 'C' then do;
         _isMissing = missing(vvaluex(Variable_Name));
      end;
      else do;
         _isMissing = (strip(vvaluex(Variable_Name)) = '.');
      end;
      
      Fill_Num = sum(Fill_Num, (not _isMissing));
      Miss_Num = sum(Miss_Num, _isMissing);

      h.replace();
   end;

   if lastObs then do;
      h.output(dataset: 'want');
   end;
run;


data work.want;
   set work.want;

   length Fill_Rate Miss_Rate 8;

   Fill_Rate = Fill_Num / (Fill_Num + Miss_Num);
   Miss_Rate = Miss_Num / (Fill_Num + Miss_Num);
run;

proc print data=work.want;run;

 

View solution in original post

8 REPLIES 8
Kurt_Bremser
Super User

Use a double transpose, followed by a summarizing data step that merges the types in:

data missingval;
length Country $ 10 TourGuide $ 10 Trip_Num 8;
input Country TourGuide Trip_Num;
datalines;
Japan Yamada 3
. Militello .
Australia Edney 4
Venezuela . 2
Brazil Cardoso .
;

proc sql;
create table types as
select name as _name_, type
from dictionary.columns
where libname = 'WORK' and memname = 'MISSINGVAL';
quit;

proc transpose data=missingval out=trans;
var _all_;
run;

proc transpose data=trans out=freq;
by _name_;
var col:;
run;

data want (rename=(_name_=variable_name));
length _name_ $32; * prevents WARNING;
merge
  freq
  types
;
by _name_;
if first._name_
then do;
  fill_num = 0;
  miss_num = 0;
end;
if strip(col1) not in (' ','.')
then fill_num + 1;
else miss_num + 1;
if last._name_;
fill_rate = fill_num / (fill_num + miss_num);
miss_rate = miss_num / (fill_num + miss_num);
keep _name_ type fill_num miss_num fill_rate miss_rate;
run;

proc print data=want noobs;
run;

Result:

variable_                                    fill_    miss_
  name       type    fill_num    miss_num     rate     rate

Country      char        4           1        0.8      0.2 
TourGuide    char        4           1        0.8      0.2 
Trip_Num     num         3           2        0.6      0.4 
Jagadishkatam
Amethyst | Level 16

alternately please try

 

data missingval;
length Country $ 10 TourGuide $ 10 Trip_Num 8;
input Country TourGuide Trip_Num;
datalines;
Japan Yamada 3
. Militello .
Australia Edney 4
Venezuela . 2
Brazil Cardoso .
;
run;

proc transpose data=missingval out=test;
var Country TourGuide Trip_Num;
run;

data want;
set test;
array cols(*) col1-col5;
do i = 1 to dim(cols);
obs=sum(i);
if compress(cols(i))='.' then cols(i)=' ';
if compress(cols(i))^=' ' then cols(i)='1';
end;
miss_Num=cmiss(of col1-col5);
fill_Num=sum(of col1-col5);
fill_rate=fill_Num/obs;
miss_rate=miss_Num/obs;
drop col1-col5;
run;
Thanks,
Jag
Kurt_Bremser
Super User

Mind that both methods require you to have enough memory to store a number of variables equal to the number of observations in your original dataset during the transpose.

andreas_lds
Jade | Level 19

@Kurt_Bremser wrote:

Mind that both methods require you to have enough memory to store a number of variables equal to the number of observations in your original dataset during the transpose.


So, do you think that solution skipping transposing would be better?

That one got a bit lengthy, but i am out of time right now.

 

proc sql noprint;
   select quote(trim(Name)) into :varList separated by ','
      from sashelp.vcolumn
         where LibName = 'WORK' and MemName = 'MISSINGVAL'
   ;
quit;


data _null_;
   set missingval end=lastObs;

   length 
      Variable_Name $ 32
      Variable_Type $ 4
      Fill_Num Miss_Num 8
      _isMissing 8
   ;

   if _n_ = 1 then do;
      declare hash h(ordered:'yes');
      declare hiter bob('h');
      h.defineKey('Variable_Name');
      h.defineData('Variable_Name', 'Variable_Type', 'Fill_Num', 'Miss_Num');
      h.defineDone();

      do Variable_Name = &varList;
         Variable_Type = vtypex(Variable_Name);
         Fill_Num = 0;
         Miss_Num = 0;
         h.add();
      end;
   end;

   do Variable_Name = &varList;
      _rc = h.find();

      if Variable_Type = 'C' then do;
         _isMissing = missing(vvaluex(Variable_Name));
      end;
      else do;
         _isMissing = (strip(vvaluex(Variable_Name)) = '.');
      end;
      
      Fill_Num = sum(Fill_Num, (not _isMissing));
      Miss_Num = sum(Miss_Num, _isMissing);

      h.replace();
   end;

   if lastObs then do;
      h.output(dataset: 'want');
   end;
run;


data work.want;
   set work.want;

   length Fill_Rate Miss_Rate 8;

   Fill_Rate = Fill_Num / (Fill_Num + Miss_Num);
   Miss_Rate = Miss_Num / (Fill_Num + Miss_Num);
run;

proc print data=work.want;run;

 

Kurt_Bremser
Super User

Nice. One of the things that work fine with a hash. And it is equally free of any assumptions as my double transpose code.

And it will outperform the double transpose quite well with increasing dataset size.

hashman
Ammonite | Level 13

@andreas_lds:

It's actually not that lengthy, and because of the hash dynamics can be done in a single step:

data have ;                                                                                                                             
  input (country tourguide) (:$10.) trip_num ;                                                                                          
  cards ;                                                                                                                               
Japan     Yamada    3                                                                                                                   
.         Militello .                                                                                                                   
Australia Edney     4                                                                                                                   
Venezuela .         2                                                                                                                   
Brazil    Cardoso   .                                                                                                                   
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = country tourguide trip_num)  ;                                                                                        
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h () ;                                                                                                                     
    h.definekey  ("variable_name") ;                                                                                                    
    h.definedata ("variable_name", "variable_type", "fill_num") ;                                                                       
    h.definedone () ;                                                                                                                   
    dcl hiter hi ("h") ;                                                                                                                
  end ;                                                                                                                                 
  set have nobs = n end = z ;                                                                                                           
  array cc _char_ ;                                                                                                                     
  array nn _numeric_ ;                                                                                                                  
  length variable_name $ 32 variable_type $ 4 ;                                                                                         
  call missing (variable_name, variable_type) ;                                                                                         
  do over nn ;                                                                                                                          
    if h.find (key:vname(nn)) ne 0 then fill_num = 0 ;                                                                                  
    fill_num + not cmiss(nn) ;                                                                                                          
    h.replace (key:vname(nn), data:vname(nn), data:vtype(nn), data:fill_num) ;                                                          
  end ;                                                                                                                                 
  do over cc ;                                                                                                                          
    if h.find (key:vname(cc)) ne 0 then fill_num = 0 ;                                                                                  
    fill_num + not cmiss(cc) ;                                                                                                          
    h.replace (key:vname(cc), data:vname(cc), data:vtype(cc), data:fill_num) ;                                                          
  end ;                                                                                                                                 
  if z then do while (hi.next() = 0) ;                                                                                                  
    variable_type = ifc (variable_type eq "N", "Num", "Char") ;                                                                         
    miss_num = n - fill_num ;                                                                                                           
    miss_rate = divide (miss_num, n) ;                                                                                                  
    output ;                                                                                                                            
  end ;                                                                                                                                 
run ;                      

Kind regards

Paul D. 

AieuYuhara
Fluorite | Level 6

Thanks for your help!

 

All the codes given by others are working too, and i ran your codes, it works way faster, and I do like it when we dont have to specify length or dropping any variables, because it wont help when working with more than 1,000 variables. 

 

Thank you everyone!

Ksharp
Super User
data missingval;
length Country $ 10 TourGuide $ 10 Trip_Num 8;
input Country TourGuide Trip_Num;
datalines;
Japan Yamada 3
. Militello .
Australia Edney 4
Venezuela . 2
Brazil Cardoso .
;run;

data want;
 set sashelp.vcolumn(keep=libname memname name type
 where=(lowcase(libname)='work' and lowcase(memname)='missingval')) end=last;
 if _n_=1 then call execute('proc sql;create table want as');
 call execute(catt('select "',name,'" as name length=40,"',type,'" as type length=10,
 n(',name,') as fill_n,nmiss(',name,') as miss_n,
 calculated fill_n/count(*) as Fill_Rate,1- calculated Fill_Rate as Miss_Rate 
 from missingval'));
if last then call execute(';quit;');
 else call execute('union');
run;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1397 views
  • 4 likes
  • 6 in conversation