BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
kristiepauly
Obsidian | Level 7

Hi All, 

I have a large dataset with 10,198 observations.  I ultimately want to remove the observations from the analysis that have missing data for any of the four numeric variables: diabetes_P1 (1=yes/2=no), diabetes_P3 (1=yes/2=no), finalgold_P1 (-1 - 4), and finalgold_P3 (-1 - 4). If I could separate the N count by yes/no and the finalgold stages (-1 -4) that would be even better. My thought process was to write code for the number of observations for only these four variables (I assume these would all be 10,198 since I'm not removing any of the obs yet and I haven't figured out how to stratify even further by yes/no and -1 -4 stages).

PROC MEANS DATA=temp1 ;
  VAR diabetes_P1 diabetes_P3 finalgold_P1 finalgold_P3;  
  OUTPUT OUT=tempnew ;
RUN;

 this is what I got...

kristiepauly_0-1711219360405.png

That clearly didn't work so I just attempted code to remove the missing observations and see what my N was. 

******delete missing data*******************************************************************;
data tempnew;
	set temp1;
	if diabetes_P1 = '.' then delete;
	if diabetes_P3 ='.' then delete; 
	if finalgold_P1 = '.' then delete; 
	if finalgold_P3 ='.' then delete;
run; 

Proc means data=tempnew;
	VAR diabetes_P1 diabetes_P3 finalgold_P1 finalgold_P3; 
	run;

From the log:
NOTE: There were 5 observations read from the data set WORK.TEMPNEW.
NOTE: The data set WORK.TEMPNEW1 has 5 observations and 7 variables
 

I've screwed up which data sets I'm calling so I went back and changed those from 'tempnew' to temp1' and now in the log every variable is "uninitialized" and each "variable in the list does not match type prescribed for the list." 

Is there a way to 1. print the number of observations for select variables? 2. Remove the observations with missing data for any of the selected variables and then reprint the "new" number of observations. 

 

Side question: I have the original data set diabetes_kp.sas7bdat. I made a 'temp' data set so as to not overwrite the original. From there, I made another data file 'temp1' that I code in. I have used 'temp1' for everything I've coded thinking that all the changes would be made to one new file called 'temp1'. I don't think that is what's happening though given that every time I type 'data' numerous 'temp1' files are displayed to choose from. Do I now have an excessive number of temp1 files that each have only 1 change to the code made? Do I have to go back to the start and start from scratch with a new data file name since I've overwritten my temp1 file that had all my changes?  

data temp;
	set capstone.diabetes_kp;
	run;

data temp1;
	set temp;
	format age_P1 age_range_group.;
	run;

data temp1;
	set temp;
	format corsterinhal_P1 cortsterinhal_group.;
run;
etc. etc. with numerous other lines of code all using "data temp1;"

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Let's start with a some concepts that you may not quite be understanding.

When you do code like this:

data temp1;
	set temp;
	format age_P1 age_range_group.;
	run;

data temp1;
	set temp;
	format corsterinhal_P1 cortsterinhal_group.;
run;

The second data step result Temp1 completely replaces the one from the previous step.

Second you do not need to separate selecting observations from other steps like formatting or labeling variables.

If you want to add different default formats but not change your source data set then

data temp;
	set capstone.diabetes_kp;
        format age_p1 age_range_group.;
       format corsterinhal_p1 cortsterinhal_group.;
      /* one suspects a spelling issue the above line*/
     /* add as many formats in this step as you want, or variable labels,
        or manipulation code needed*/
run;

It is a good idea to use the proper forms for variables though SAS will attempt to help.

This treats those 4 variables as character causing notes about variable conversion in your log (you do read the logs don't you). Not every time that SAS applies such a conversion will work as desired. Your other information implies those variables a NUMERIC. Single variable comparison would be better as : "if diabetes_P1 =. then delete;"

Better, as it completely avoids the variable worry is the MISSING function: "if missing(diabetes_p1) then delete.

data tempnew;
	set temp1;
	if diabetes_P1 = '.' then delete;
	if diabetes_P3 ='.' then delete; 
	if finalgold_P1 = '.' then delete; 
	if finalgold_P3 ='.' then delete;
run; 

Then since the variables are numeric there is a function NMISS that will return the number of missing variables in a list allowing the above to be rewritten as:

data tempnew;
	set temp1;
	if nmiss(diabetes_P1, diabetes_P3 , finalgold_P1, finalgold_P3) = 0 then delete;
run; 

Which I would actually add the code creating your previous TEMP data set above all in one step adding the formats and selecting the records. You do want to preserve your starting data set because, especially while new to SAS, you can make logic errors that make it hard to determine what happened if you modify the same data set.

data temp;
   set capstone.diabetes_kp;
   format age_p1 age_range_group.;
   format corsterinhal_p1 cortsterinhal_group.;
   if nmiss(diabetes_P1, diabetes_P3 , finalgold_P1, finalgold_P3) = 0 then delete;
run;

 

Procedures like Proc means, summary, tabulate and report will all allow reporting on the number of NUMERIC values.

Example with a data set you should have in your installation for practice:

Proc means data=sashelp.class n;
   var height; /* place as many of the numeric variables you are interested here*/
run;

or Tabulate

Proc tabulate data=sashelp.class ;
   var height;
   table height, n;
run;

 

View solution in original post

3 REPLIES 3
PaigeMiller
Diamond | Level 26
data tempnew;
	set temp1;
	if diabetes_P1 = '.' then delete;
	if diabetes_P3 ='.' then delete; 
	if finalgold_P1 = '.' then delete; 
	if finalgold_P3 ='.' then delete;
run; 

 

This doesn't work because your variables are numeric, but you are comparing them to a character string. SAS doesn't allow this.

 

However this works:

 

data tempnew;
    set temp1;
    if nmiss(diabetes_p1,diabetes_p3,finalgold_p1,finalgold_p3)=0;
run;

 

 

"Reprint the number of observations": well the easiest way is to look in the log for the number of observations in tempnew.

 

Regarding your side question, there are times when creating multiple data sets is useful, but based on what I see, you need only create one data set.

 

data temp;
	set capstone.diabetes_kp;
	format age_P1 age_range_group.
	    corsterinhal_P1 cortsterinhal_group.;
run;

 

--
Paige Miller
ballardw
Super User

Let's start with a some concepts that you may not quite be understanding.

When you do code like this:

data temp1;
	set temp;
	format age_P1 age_range_group.;
	run;

data temp1;
	set temp;
	format corsterinhal_P1 cortsterinhal_group.;
run;

The second data step result Temp1 completely replaces the one from the previous step.

Second you do not need to separate selecting observations from other steps like formatting or labeling variables.

If you want to add different default formats but not change your source data set then

data temp;
	set capstone.diabetes_kp;
        format age_p1 age_range_group.;
       format corsterinhal_p1 cortsterinhal_group.;
      /* one suspects a spelling issue the above line*/
     /* add as many formats in this step as you want, or variable labels,
        or manipulation code needed*/
run;

It is a good idea to use the proper forms for variables though SAS will attempt to help.

This treats those 4 variables as character causing notes about variable conversion in your log (you do read the logs don't you). Not every time that SAS applies such a conversion will work as desired. Your other information implies those variables a NUMERIC. Single variable comparison would be better as : "if diabetes_P1 =. then delete;"

Better, as it completely avoids the variable worry is the MISSING function: "if missing(diabetes_p1) then delete.

data tempnew;
	set temp1;
	if diabetes_P1 = '.' then delete;
	if diabetes_P3 ='.' then delete; 
	if finalgold_P1 = '.' then delete; 
	if finalgold_P3 ='.' then delete;
run; 

Then since the variables are numeric there is a function NMISS that will return the number of missing variables in a list allowing the above to be rewritten as:

data tempnew;
	set temp1;
	if nmiss(diabetes_P1, diabetes_P3 , finalgold_P1, finalgold_P3) = 0 then delete;
run; 

Which I would actually add the code creating your previous TEMP data set above all in one step adding the formats and selecting the records. You do want to preserve your starting data set because, especially while new to SAS, you can make logic errors that make it hard to determine what happened if you modify the same data set.

data temp;
   set capstone.diabetes_kp;
   format age_p1 age_range_group.;
   format corsterinhal_p1 cortsterinhal_group.;
   if nmiss(diabetes_P1, diabetes_P3 , finalgold_P1, finalgold_P3) = 0 then delete;
run;

 

Procedures like Proc means, summary, tabulate and report will all allow reporting on the number of NUMERIC values.

Example with a data set you should have in your installation for practice:

Proc means data=sashelp.class n;
   var height; /* place as many of the numeric variables you are interested here*/
run;

or Tabulate

Proc tabulate data=sashelp.class ;
   var height;
   table height, n;
run;

 

kristiepauly
Obsidian | Level 7

Thank you @ballardw. Your explanation was very helpful.  I did as you suggested and all is working....mostly.  I will revisit it again tomorrow. 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 831 views
  • 0 likes
  • 3 in conversation