BookmarkSubscribeRSS Feed
kristiepauly
Obsidian | Level 7

I have data in an excel file that has ~80 variables starting with N, subject ID, deceased.....  There are multiple lines for subject ID as each follow up visit is added in another row and aligned under the 'follow up visit' variable.  My issue now is when I'm running reports, the number of "missing" is much higher than it should be because it is counting every repeated subject_id as missing.  For example, subject id AMC001 below, is showing sex is missing 4 times.  How can I get says to only look at the first subject id and ignore the rest if they are the same subject_id? I tried truncover without an input statement because I have ~80 variables and didn't want to list them all. I figured there has to be an easier way.

variables	N	subject_id	Deceased	Reason_for_Death	Sex	Ethnicity
	       12	   008IGG	        0		                  1	 
	       13	   009IGG	        0		                  1	 
	       14	   AMC001	        0		                  0           5
		          AMC001				
		          AMC001				
		          AMC001				
		          AMC001

data=datasets.redcap_pde1;
Infile miss truncover;
run;
5 REPLIES 5
PaigeMiller
Diamond | Level 26

Once you read this monstrosity of an Excel file into SAS, you can create a new data set using only the first record for each subject_id and then do your statistics on this new data set.

 

data want;
    set have;
    by subject_id;
    if first.subject_id;
run;
--
Paige Miller
Tom
Super User Tom
Super User

It is not clear in what form your data is now.  You mentioned EXCEL and then show code using INFILE which is not valid for reading a binary file like an Excel workbook.  

 

Let's assume you actually have an EXCEL file and have converted it into a SAS dataset using PROC IMPORT.

Let's also assume you have a variable like SUBJID_ID that is non-missing on every observation that ca be used to group the data.

 

In that case you could treat the whole file like a series of transactions using the UPDATE statement.  That will ignore missing values on the transaction records to that the existing value of the variable is not replaced.  So essentially values will be carried forward until a non-missing value appears for that variable.  If there are variables that you do not want carried forward the add them to the KEEP= option on the extra SET statement.

So here is an example using HAVE as the name of the existing dataset, SUBJECT_ID as the key variable, and VISIT_DATE as an example of a variable where you do not want carried forward.

data want;
  update have(obs=0) have;
  by subject_id;
  set have(keep=visit_date);
  output;
run;
kristiepauly
Obsidian | Level 7

@Tom Sorry, that was unclear.  Your assumptions were correct. The original data was an xlsx file and I imported to SAS. I tried your code and then ran a proc freq on MRI's but I'm still showing 471 'missing' . My total N is 116, so it is still counting the multiple entries for each subject_id.

data work.PDE_by_subjectid;
  update datasets.REDCap_PDE0(obs=0)datasets.REDCap_PDE0;
  by subject_id;
  set datasets.REDCap_PDE0; *(keep=redcap_event_name);
  output;
run;

*MRI performed;
Proc freq data=work.PDE_by_subjectid;
	tables MRI_performed /missing;
	format mri_performed $mri_performed.;
run;
*RESULTS; The FREQ Procedure MRI performed mri_performed Frequency Percent Cum Frequency Cum Percent 471 80.79 471 80.79 No 14 2.40 485 83.19 Yes 98 16.81 583 100.00
Tom
Super User Tom
Super User

Try it without that extra SET and see if you like the results. 

The point I was trying to make by including is that it might be that you only want to carry forward the non-missing values on SOME of the variables, like SEX, and not others. Having the extra SET statement re-read ALL of the variables like you did completely undoes anything the UPDATE statement might have done to carry forward values.  

 

Reeza
Super User

@kristiepauly wrote:

I have data in an excel file that has ~80 variables starting with N, subject ID, deceased.....  There are multiple lines for subject ID as each follow up visit is added in another row and aligned under the 'follow up visit' variable.  My issue now is when I'm running reports, the number of "missing" is much higher than it should be because it is counting every repeated subject_id as missing.  For example, subject id AMC001 below, is showing sex is missing 4 times.  How can I get says to only look at the first subject id and ignore the rest if they are the same subject_id? I tried truncover without an input statement because I have ~80 variables and didn't want to list them all. I figured there has to be an easier way.

variables	N	subject_id	Deceased	Reason_for_Death	Sex	Ethnicity
	       12	   008IGG	        0		                  1	 
	       13	   009IGG	        0		                  1	 
	       14	   AMC001	        0		                  0           5
		          AMC001				
		          AMC001				
		          AMC001				
		          AMC001

data=datasets.redcap_pde1;
Infile miss truncover;
run;

  • No = sign in data step statement required for this
  • Import Excel files with PROC IMPORT not a data step
  • Import CSV files with a data step or proc import
  • If you have a large file you do have to list all the variables and types, use the code generated from proc import (from the log) to make this quicker but if you're setting up a repeatable process you definitely want to use a data step not proc import. Proc import guesses at types and not always correctly so you can end up with issues.

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1112 views
  • 0 likes
  • 4 in conversation