multiple lines of data for each subject

kristiepauly · Posted 07-29-2022 10:58 AM

I have data in an excel file that has ~80 variables starting with N, subject ID, deceased..... There are multiple lines for subject ID as each follow up visit is added in another row and aligned under the 'follow up visit' variable. My issue now is when I'm running reports, the number of "missing" is much higher than it should be because it is counting every repeated subject_id as missing. For example, subject id AMC001 below, is showing sex is missing 4 times. How can I get says to only look at the first subject id and ignore the rest if they are the same subject_id? I tried truncover without an input statement because I have ~80 variables and didn't want to list them all. I figured there has to be an easier way.

variables	N	subject_id	Deceased	Reason_for_Death	Sex	Ethnicity
	       12	   008IGG	        0		                  1	 
	       13	   009IGG	        0		                  1	 
	       14	   AMC001	        0		                  0           5
		          AMC001				
		          AMC001				
		          AMC001				
		          AMC001

data=datasets.redcap_pde1; 
Infile miss truncover;
run;

PaigeMiller · Posted 07-29-2022 11:02 AM

Once you read this monstrosity of an Excel file into SAS, you can create a new data set using only the first record for each subject_id and then do your statistics on this new data set.

data want;
    set have;
    by subject_id;
    if first.subject_id;
run;

--
Paige Miller

Tom · Posted 07-29-2022 11:07 AM

It is not clear in what form your data is now. You mentioned EXCEL and then show code using INFILE which is not valid for reading a binary file like an Excel workbook.

Let's assume you actually have an EXCEL file and have converted it into a SAS dataset using PROC IMPORT.

Let's also assume you have a variable like SUBJID_ID that is non-missing on every observation that ca be used to group the data.

In that case you could treat the whole file like a series of transactions using the UPDATE statement. That will ignore missing values on the transaction records to that the existing value of the variable is not replaced. So essentially values will be carried forward until a non-missing value appears for that variable. If there are variables that you do not want carried forward the add them to the KEEP= option on the extra SET statement.

So here is an example using HAVE as the name of the existing dataset, SUBJECT_ID as the key variable, and VISIT_DATE as an example of a variable where you do not want carried forward.

data want;
  update have(obs=0) have;
  by subject_id;
  set have(keep=visit_date);
  output;
run;

kristiepauly · Posted 07-29-2022 12:13 PM

@Tom Sorry, that was unclear. Your assumptions were correct. The original data was an xlsx file and I imported to SAS. I tried your code and then ran a proc freq on MRI's but I'm still showing 471 'missing' . My total N is 116, so it is still counting the multiple entries for each subject_id.

data work.PDE_by_subjectid;
  update datasets.REDCap_PDE0(obs=0)datasets.REDCap_PDE0;
  by subject_id;
  set datasets.REDCap_PDE0; *(keep=redcap_event_name);
  output;
run;

*MRI performed;
Proc freq data=work.PDE_by_subjectid;
	tables MRI_performed /missing;
	format mri_performed $mri_performed.;
run;

*RESULTS;
The FREQ Procedure
MRI performed
mri_performed	Frequency	Percent	Cum Frequency	Cum Percent
 	           471	         80.79	      471	       80.79
No	            14	          2.40	      485	       83.19
Yes	            98	          16.81	      583	      100.00

Tom · Posted 07-29-2022 03:34 PM

Try it without that extra SET and see if you like the results.

The point I was trying to make by including is that it might be that you only want to carry forward the non-missing values on SOME of the variables, like SEX, and not others. Having the extra SET statement re-read ALL of the variables like you did completely undoes anything the UPDATE statement might have done to carry forward values.

Reeza · Posted 07-29-2022 12:10 PM

@kristiepauly wrote:

I have data in an excel file that has ~80 variables starting with N, subject ID, deceased..... There are multiple lines for subject ID as each follow up visit is added in another row and aligned under the 'follow up visit' variable. My issue now is when I'm running reports, the number of "missing" is much higher than it should be because it is counting every repeated subject_id as missing. For example, subject id AMC001 below, is showing sex is missing 4 times. How can I get says to only look at the first subject id and ignore the rest if they are the same subject_id? I tried truncover without an input statement because I have ~80 variables and didn't want to list them all. I figured there has to be an easier way.
variables	N	subject_id	Deceased	Reason_for_Death	Sex	Ethnicity
	       12	   008IGG	        0		                  1	 
	       13	   009IGG	        0		                  1	 
	       14	   AMC001	        0		                  0           5
		          AMC001				
		          AMC001				
		          AMC001				
		          AMC001

data=datasets.redcap_pde1; 
Infile miss truncover;
run;				

No = sign in data step statement required for this
Import Excel files with PROC IMPORT not a data step
Import CSV files with a data step or proc import
If you have a large file you do have to list all the variables and types, use the code generated from proc import (from the log) to make this quicker but if you're setting up a repeatable process you definitely want to use a data step not proc import. Proc import guesses at types and not always correctly so you can end up with issues.

multiple lines of data for each subject

Re: multiple lines of data for each subject

Re: multiple lines of data for each subject

Re: multiple lines of data for each subject

Re: multiple lines of data for each subject

Re: multiple lines of data for each subject

multiple lines of data for each subject

Re: multiple lines of data for each subject

Re: multiple lines of data for each subject

Re: multiple lines of data for each subject

Re: multiple lines of data for each subject

Re: multiple lines of data for each subject

The 2025 SAS Hackathon has begun!