BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
robertrao
Quartz | Level 8

Hello,

 

I have multiple fields in a dataset to perform a QC test to check how many records are having missing values for that variable etc etc.

Some fields are numeric(. for missing value) and some fields are charecter("" for missing value).The ultimate goal is to show what percentage of the records are having unwanted values.

 

First I thought I will do a Proc Freq but did not quite understand how to get the TOTAL_RECORDS even though it gives us the unwanted_records and the percentages

Next, frequency would not give us a record if we do not have a unwanted record at all (Mortalityflag below)

 

Whats the easiest way to acheive this ???

Thank you

 

FIELD      UNWANTED_RECORDS   TOTAL_RECORDS    %UNWANTED

Age(num)                   5                                       550                           0.90           ---checking if age field is missing(.)

Gender(char)             25                                    550                           4.54           ---checking if gender is missing("")

Race(char)                 15                                     550                           2.72         --cehcking if race is missing("")

cost (num)                  10                                     550                           1.81        --checking if cost is missing(.)

Mortalityflag(char)       0                                      550                           0.00         --cehcking if they are dead( identified by "1")

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

And another way using informats:

Proc format library=work;
invalue nmisunwanted
  . = 1
  other=0
;
invalue sexunwanted
 'M','F'=0
 other=1
;
invalue oneunwanted
1=1
other=0
;
run;

data temp;
   informat Age nmisunwanted.  SEX sexunwanted.     Surgery oneunwanted.;
   input age sex surgery;
datalines;
54       M           1        
56     .             0
.         M           0 
99      .           0
;
run;

proc tabulate data=temp;
   var age sex surgery;
   table age sex surgery,
         sum='Unwanted'*f=best6.
         n='Total Records'
         mean='% unwanted'*f=percent8.2
   ;
run;

View solution in original post

10 REPLIES 10
ballardw
Super User

Add the option MISSING to the Tables statement in proc freq. Then Missing is treated as category and the count is included for perencentage calculations.

 

Proc freq data=have;

   tables _all_ / missing;

run;

robertrao
Quartz | Level 8

Heelo Ballard,

 

Thanks for the reply. The method you suggested is giving me the frequncy of all the values in that field including missing.

I want the frequency/Percent of only those records that are missing a value or unwanted value...

 

Thanks

Astounding
PROC Star

Do you have a variable UNWANTED in your data set with values of 0 or 1?  Otherwise how are  you counting the unwanteds?  You will have to show a bit of what is in your data here and how you know whether a value is wanted or unwanted.

robertrao
Quartz | Level 8

Hello Astounding,

 

I mentioned it in the table i listed but it might not have been very clear.

So UNWAANTED is not on my dataset and I wanted it in my output.

It requires a WHERE condition to check if a field has unwanted . Even if we do not have any then we need to show it as 0.

All fields may not have a blank value as UNWANTED as shown below (SURGERY is 0 or 1 but 1 is UNWANTED IN THIS case

 

Example:

 

Age   SEX     Surgery

54       M           1        

56      --             0

--         M           0 

99        --           0

 

 

 

OUTPUT

 

VAR       UNWANTED       TOTAL_RECORDS      %UNWANTED

Age                1                              4                              1/4*100

sex                  2                              4                             2/4*100

surgery            1                              4                             1/4*100

 

 

 

ballardw
Super User

And another way using informats:

Proc format library=work;
invalue nmisunwanted
  . = 1
  other=0
;
invalue sexunwanted
 'M','F'=0
 other=1
;
invalue oneunwanted
1=1
other=0
;
run;

data temp;
   informat Age nmisunwanted.  SEX sexunwanted.     Surgery oneunwanted.;
   input age sex surgery;
datalines;
54       M           1        
56     .             0
.         M           0 
99      .           0
;
run;

proc tabulate data=temp;
   var age sex surgery;
   table age sex surgery,
         sum='Unwanted'*f=best6.
         n='Total Records'
         mean='% unwanted'*f=percent8.2
   ;
run;
robertrao
Quartz | Level 8

Hello,

 

Thanks for all the replies. This is very helpful to me.

Could you please explain why you have used INVALUE and INFORMAT?

I have 0's and 1's which are charecter strings.. Do I need to convert all those variables to numeric for this proc tablulate code to summarize? or that is taken care in the formats??

 

Thank you

 

 

 

ballardw
Super User

@robertrao wrote:

Hello,

 

Thanks for all the replies. This is very helpful to me.

Could you please explain why you have used INVALUE and INFORMAT?

I have 0's and 1's which are charecter strings.. Do I need to convert all those variables to numeric for this proc tablulate code to summarize? or that is taken care in the formats??

 

Thank you

 

 

 


This is making a data set to use one specific approach to get a count (sum or 1 values) and percent (mean of a 0/1 coded variable)

Your example data showed that you have character values M for instance. You cannot sum them so create another dataset with the variables valued in a way that allows that approach. I use INFORMAT because 1) if you have similar rules then you can apply the same informat to multiple variables 2) it may force you to think of all of the values you may have to address 3) Informat rules are much easier when combining lists and ranges of values than writing a long nested If/then/else structure to assign the desired value ( consider if you wanted a specific variable with missing and values from 3 to 8 as unwanted) 4) Informats can be used easier with multiple data set

 

You didn't post data in the form of a data step to provide a more targetted solution.

 

Proc tabulate is more limited what it can do with character valued variables. By default a Class variable, the only option for characters in Tabulate, entire records are excluded if the class variable is missing. Tabulate breaks the data into multiple rows/columns for every (formatted) level of a class variable. Which would mean that the N is incorrect (only that Value not the variable) and the percent calculation would have the wrong denominator for your purpose. And to provide a table that looks like your desired output it is much easier to handle numeric only variables.

robertrao
Quartz | Level 8

Thank you Ballard. That was very detailed explanation. appreciate your time.

So I will first convert my variables to numeric and then follow your steps. Most/all of my variables in the data set have a value of  '0'/'1' (but charecters). i will also have to read proc tabulate to have overall percentafes of unwanted .

 

 

I was in the opinion, Invalue and informat is used while dealing only with data in the form of datalines?

 

Thanks again

 

ballardw
Super User

@robertrao wrote:

 

 

I was in the opinion, Invalue and informat is used while dealing only with data in the form of datalines?

 

Thanks again

 


The example I used with datalines was because you did not provide a data step for existing data and I had to make one. I combined the read and conversion in one step to demonstrate the count and percent calculation.

 

A custom informat is the same as any and can be used to create new dataset variables in a data step using existing data.

This example uses a sas supplied format but a custom format for character data could be used in place of F1.

data have;
   x='0';
run;

data want;
   set have;
   numx = input(x,f1.);
run;

 

Note that the variable on the left of =, numx in the example above, CANNOT be the name of an existing Character varialbe. SAS does not allow changing variable data type after it is created.

 

Existing numerics are only slightly trickier as INPUT wants a character value in the first postion and would look more like:

numvar = input(put(numvar,best5.),MyCustomInformat.); where the Best5 should create values that look like those you specifiy on the left part of the definitions of the Invalue .

Since we are talking about creating a different numeric value for an existing numeric variable this can use the same varaible. I would actually recommend some standard for a data checking program to use a consistent prefix for the variables used to receive the recoded values so that a variable list can be used to reference all of them. For instance if you create variables that have all start with CHK_ (and none of your existing variables start that way) then the tabulate code can use lists in the Var and table statement:

 

Var CHK_: ; <= all variables starting with CHK_ are Var varaibles

Table CHK_: * (sum n mean); <= all the variables will have the same statistics calculated.

 

For better use assign labels such as for the original variable.

label Chk_Age = 'Age'

         Chk_Sex = 'Sex'

         Chk_Something = 'Measure of something at occurence'

;

art297
Opal | Level 21

@robertrao: Here is one way:

proc format;
  value surgery
  1=.
  ;
run;

data have;
  infile cards dlm=',';
  input Age   sex  $   surgery;
  format surgery surgery.;
  cards;
54,       M,           1        
56,       ,             0
.,         M,           0 
99 , ,           0
;

proc transpose data=have out=need;
  var age sex surgery;
run;

proc sql noprint;
  select count(*) 
    into :total
      from have
  ;
quit;


data want (drop=_:col:);
  set need (rename=_name_=var);
  array cols $ col:;
  do over cols;
    if strip(cols) eq '.' then call missing(cols);
  end;
  unwanted=cmiss(of col:);
  total_records=&total.;
  pct_unwanted=100*unwanted/&total;
run;

proc print data=want noobs;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 2766 views
  • 2 likes
  • 4 in conversation