BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.

I have a datasets which have combination of numeric as well as character variables. I would like to know if there is any easy way to find out if 'all' the values of that variable are missing. 

 

Thanks in advance!

 

- Dr. Abhijeet Safai

Dr. Abhijeet Safai
Certified Base and Clinical SAS Programmer
Associate Data Analyst
Actu-Real
1 ACCEPTED SOLUTION

Accepted Solutions
DrAbhijeetSafai
Pyrite | Level 9

Many thanks to @ballardw , @Mazi , @data_null__ @yabwon @tom_grant , @PaigeMiller and all others for your response.

 

I was able to solve the issue but not easy way but by macro way. Somehow finally I found that the approach by macro will be better for me so I went ahead with that and the issue is solved. I would share the steps of what I did below.

 

  1. First I used cmiss function (works for numeric too) to find out if particular value is missing in a variable.
  2. Then by creating another variable I found accumulating total of that variable (value of cmiss is 1 if the value is missing and 0 if it it not missing). In this way now I have count of number of values that are missing in a variable.
  3. I kept the last row of this dataset (by end= option) and kept only those variable which will have counts of missing values.
  4. By proc contents I found out how many rows are there in the given dataset. 
  5. I joined that dataset of proc contents with this dataset of one row.
  6. If the dataset contains say 168 observations and if there 168 missing values for a variable, that means all the values are missing for that variable. (And that is what I was searching for: variables having all missing values).
  7.  Once found, I created a flag for those variables for which number of missing values are number of records in the dataset. 
  8. All this I put in the macro which will do it for all the variables one by one by using do while loop.

 

And in this way the issue was solved. I would have liked to share the code for this but I will not be able to do it because I will need to type out the code again and that is very time consuming. 

 

Many thanks to all those who responded. It was a fun exercise for me. I thought I will be able to avoid macro but later I found that macro is a better way to solve it. As I found this answer working for me, I will mark this as accepted answer. 

 

Thank you.

 

- Dr. Abhijeet Safai

 

 

Dr. Abhijeet Safai
Certified Base and Clinical SAS Programmer
Associate Data Analyst
Actu-Real

View solution in original post

19 REPLIES 19
DrAbhijeetSafai
Pyrite | Level 9

Thanks @PaigeMiller . 

 

I was also looking at this. I found this one simpler. I am just trying to take all the variables out by using out= option in proc freq in the code below. Please look at the out= part which I have bolded.

 

/* create a format to group missing and nonmissing */
proc format;
 value $missfmt ' '='Missing' other='Not Missing';
 value  missfmt  . ='Missing' other='Not Missing';
run;
 
proc freq data=one; 
format _CHAR_ $missfmt.; /* apply format for the duration of this PROC */
tables _CHAR_ / out=test1 missing missprint nocum nopercent;
format _NUMERIC_ missfmt.;
tables _NUMERIC_ / missing missprint nocum nopercent;
run;

Thank you. 

 

- Dr. Abhijeet Safai

Dr. Abhijeet Safai
Certified Base and Clinical SAS Programmer
Associate Data Analyst
Actu-Real
PaigeMiller
Diamond | Level 26

Is there something wrong with the part you have in bold? If so, please explain.

--
Paige Miller
DrAbhijeetSafai
Pyrite | Level 9

Hi @PaigeMiller ,

 

When I used only out=test1, it is not showing all the variables in the dataset. The variables can be seen in the output window, but I was not able to see them in the dataset where I can further process them. To further process them, I need them in the dataset instead in output window. 

 

To find out which file is getting created at the background to store this information (which is available in output window), I used following code with ods trace on.

 

ods trace on;
ods output onewayfreqs=xxx;
proc freq
data=one; format _CHAR_ $missfmt.; /* apply format for the duration of this PROC */ tables _CHAR_ / out=test1 missing missprint nocum nopercent; run;
ods trace off;

As a result, I can now see the dataset in xxx but it is too complex to comprehend. It can be comprehended but there are other ways like macros to find out. I was thinking if there is any simpler way. I found the above approach simplest but that is not giving me information in dataset but only in output window and when I got the dataset which is creating that output window, it is again too complex.

 

Thank you.

 

- Dr. Abhijeet Safai 

Dr. Abhijeet Safai
Certified Base and Clinical SAS Programmer
Associate Data Analyst
Actu-Real
DrAbhijeetSafai
Pyrite | Level 9

Thanks @PaigeMiller .

 

I was searching some simple way. Thanks for your response but I was thinking if there is any other simple way. 

 

- Dr. Abhijeet Safai

Dr. Abhijeet Safai
Certified Base and Clinical SAS Programmer
Associate Data Analyst
Actu-Real
PaigeMiller
Diamond | Level 26

I'm not sure I understand your definition of "simple". The code has been written by others, you just need to plug it into your program, that seems pretty simple to me.

--
Paige Miller
Tom
Super User Tom
Super User

@DrAbhijeetSafai wrote:

Thanks @PaigeMiller .

 

I was searching some simple way. Thanks for your response but I was thinking if there is any other simple way. 

 

- Dr. Abhijeet Safai


A simple way is to use the NLEVELS output of PROC FREQ.

 

Example:

ods select none;
ods output nlevels=nlevels;
proc freq data=sashelp.cars nlevels;
  tables _all_ / noprint;
run;
ods select all;

proc print data=nlevels;
run;

Output:

Tom_0-1718194710894.png

So to find the variables that have only missing values look for those that have NNonMissLevels=0.

proc sql noprint;
select nliteral(tablevar) into :missvar_list separated by ' '
from nlevels
where nnonmisslevels=0
;
quit;

 

There are a couple of problems/issues with this method:

 

The biggest one is that for large datasets with variables that have a lot of distinct values PROC FREQ might not be able to complete the analysis due to lack of memory.

 

The second is just the nature of using ODS OUTPUT datasets.  They are built for DISPLAY and not really as DATA.  So which variables appear and how they are defined in the output dataset will depend on what is in the dataset being analyzed.  You can fix that by adding a second step that defines a consistent data structure.  For example by setting the variable name and label fields to the maximum value that SAS supports for those attributes.  And you can remove the unneeded formats that it attaches.

data nlevels;
  length TableVar $32 NLevels NMissLevels NNonMissLevels 8 TableVarLabel $256 ;
  set nlevels;
  format _all_;
run;

Tom_1-1718195685082.png

 

data_null__
Jade | Level 19

Using this or similar formats will solve the memory problem.  Still kind of slow for really large data sets. 

 

   proc format;
      value $_miss(default=1) ' '=' '   other='X';
      value  _miss(default=1) ._-.z=' ' other='X';
      quit;
yabwon
Onyx | Level 15

Just for fun:

%let size=100;
data have;
  array noMissC[&size.] $1 (&size.*"A");
  array noMissN[&size.]   (&size.*1)  ;
  array   MissC[&size.] $1            ;
  array   MissN[&size.]               ;
  do i = 1 to 5e6;
    output;
  end;
run;
/*
proc print;
run;
*/

options ls=max;

%macro anyNonMissingData(ds);
proc format;
  value missing
    ._-.z = " "
    other = "*"
  ;
  value $missing
    " " = " "
    other = "*"
  ;
run;
data _null_;
  if 0 then set &ds.;
  format _numeric_ missing. _character_ $missing.;
  length _NAME_ $ 32 _TYPE_ $ 1;
  declare hash _H_();
  _H_.defineKey("_NAME_");
  _H_.DefineDONE();

  do until(_E_);
    set &ds. end=_E_;

    do until(_NAME_='_NAME_');
      if _NAME_ NE '_NAME_' then 
        do;
          call vnext(_NAME_, _TYPE_, _N_);
          if NOT cmiss(vvaluex(_NAME_)) then rc=_H_.add();
        end;
    end;
  end;
  _H_.output(dataset:"work.onlyMissingData(where=((_NAME_ NE '_NAME_')))");
stop;
run;
%mend anyNonMissingData;

%anyNonMissingData(have)
proc print data = onlyMissingData;
run; 

Have is 8.47GB, log says:

NOTE: The data set WORK.ONLYMISSINGDATA has 201 observations and 1 variables.
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
      real time           1.45 seconds
      user cpu time       0.28 seconds
      system cpu time     1.17 seconds
      memory              1861.31k
      OS Memory           26652.00k

I'd say, not bad (both time and memory), but still 8GB is not a big data.

 

But we can't do to much about that "all missing" search, its always O(n*k) time process(n=#of obs, k=#of vars).

 

Bart

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



data_null__
Jade | Level 19

@yabwon this is really nice.  Great to see application of CALL VNEXT and VVALUEX

 


@yabwon wrote:

Just for fun:

%let size=100;
data have;
  array noMissC[&size.] $1 (&size.*"A");
  array noMissN[&size.]   (&size.*1)  ;
  array   MissC[&size.] $1            ;
  array   MissN[&size.]               ;
  do i = 1 to 5e6;
    output;
  end;
run;
/*
proc print;
run;
*/

options ls=max;

%macro anyNonMissingData(ds);
proc format;
  value missing
    ._-.z = " "
    other = "*"
  ;
  value $missing
    " " = " "
    other = "*"
  ;
run;
data _null_;
  if 0 then set &ds.;
  format _numeric_ missing. _character_ $missing.;
  length _NAME_ $ 32 _TYPE_ $ 1;
  declare hash _H_();
  _H_.defineKey("_NAME_");
  _H_.DefineDONE();

  do until(_E_);
    set &ds. end=_E_;

    do until(_NAME_='_NAME_');
      if _NAME_ NE '_NAME_' then 
        do;
          call vnext(_NAME_, _TYPE_, _N_);
          if NOT cmiss(vvaluex(_NAME_)) then rc=_H_.add();
        end;
    end;
  end;
  _H_.output(dataset:"work.onlyMissingData(where=((_NAME_ NE '_NAME_')))");
stop;
run;
%mend anyNonMissingData;

%anyNonMissingData(have)
proc print data = onlyMissingData;
run; 

Have is 8.47GB, log says:

NOTE: The data set WORK.ONLYMISSINGDATA has 201 observations and 1 variables.
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
      real time           1.45 seconds
      user cpu time       0.28 seconds
      system cpu time     1.17 seconds
      memory              1861.31k
      OS Memory           26652.00k

I'd say, not bad (both time and memory), but still 8GB is not a big data.

 

But we can't do to much about that "all missing" search, its always O(n*k) time process(n=#of obs, k=#of vars).

 

Bart

 


 

DrAbhijeetSafai
Pyrite | Level 9

@yabwon , Thanks a lot!

 

Macro and do while loop is the approach I used. Many thanks once again!

 

- Dr. Abhijeet Safai

Dr. Abhijeet Safai
Certified Base and Clinical SAS Programmer
Associate Data Analyst
Actu-Real
data_null__
Jade | Level 19

You don't need the frequency tables OUT=.  All you need is NLEVELS.  I believe this is about as simple as it can be.

 

proc format;
   value $_miss(default=1) ' '=' '   other='X';
   value  _miss(default=1) ._-.z=' ' other='X';
   quit;
data class;
   set sashelp.class;
   name = ' '||name;
   allmiss = ' ';
   x = .d;
   run;
proc freq nlevels;
   tables _all_ / noprint;
   format _character_ $_miss. _numeric_ _miss.;
   ods select nlevels;
   ods output nlevels=nlevels;
   run;
data nlevels;
   length TableVar $32 NLevels NMissLevels NNonMissLevels 8;
   set nlevels;
   run;
proc print;
   run;
proc sql noprint;
   select TableVar into :allmissvars separated by ' ' 
      from nlevels where nnonmisslevels eq 0;
   quit;
%put NOTE: &=allmissvars;
 

Capture.PNG

ballardw
Super User

@DrAbhijeetSafai wrote:

Thanks @PaigeMiller . 

 

I was also looking at this. I found this one simpler. I am just trying to take all the variables out by using out= option in proc freq in the code below. Please look at the out= part which I have bolded.

 

/* create a format to group missing and nonmissing */
proc format;
 value $missfmt ' '='Missing' other='Not Missing';
 value  missfmt  . ='Missing' other='Not Missing';
run;
 
proc freq data=one; 
format _CHAR_ $missfmt.; /* apply format for the duration of this PROC */
tables _CHAR_ / out=test1 missing missprint nocum nopercent;
format _NUMERIC_ missfmt.;
tables _NUMERIC_ / missing missprint nocum nopercent;
run;

Thank you. 

 

- Dr. Abhijeet Safai


The OUT= option of the Tables data set only outputs the LAST result.

You would have to make a call for each variable separately, with a separate output data set for each, for this to do what you want. Much better ways are shown.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 803 views
  • 20 likes
  • 8 in conversation