Solved: Re: Delete overlapping observation by participant

JLang055 · Posted 11-19-2022 10:09 AM

I have a patient dataset with multiple observations and variable. I need to delete overlapping observations for each participant. See example below.

HAVE

FID	MOOD	ANXIETY	EAT	DEVELOP
1		1
1			1
1	1
1	1
2			1
2		1
2			1
3		1
3	1

WANT

FID	MOOD	ANXIETY	EAT	DEVELOP
1		1
1			1
1	1
2			1
2		1
3		1
3	1

JLang055 · Posted 11-20-2022 09:28 PM

Hi All,

I took @Tom suggestion to merge all the data into one single row per participant. I ran some code that worked for me.

%macro count (Var); *create new diagnosis count variable;

proc summary data=DIAG_FULL; 
	var &var;

	by FID;

	output out=&var sum=;
run;

data &VAR;
	set &VAR;

	IF &VAR >= 1 then &VAR =1;
	ELSE &VAR =0;

	drop _TYPE_ _FREQ_;
run;

%mend count;

%count(MOOD);
%count(ANXIETY);
%count(DEVELOP);
%count(CONDUCT);
%count(ADHD);

data CLINIC_DIAG;
	merge MOOD ANXIETY DEVELOP CONDUCT ADHD DIAG_MULTI;
	by FID;
run;

View solution in original post

fja · Posted 11-19-2022 10:21 AM

Hello!
Would you like to filter an existing dataset or would you prevent the insertion of superfluous observations?
--fja

JLang055 · Posted 11-19-2022 10:31 AM

Im trying to prevent double counting a participant in specific categories.

fja · Posted 11-19-2022 11:01 AM

What about defining an index with option unique?

--fja

fja · Posted 11-19-2022 11:23 AM

OK, if you just want to have a "clean" intermediary table, than you could use proc sort:

data work.TestData;
	infile datalines dsd;
	input FID MOOD ANXIETY EAT DEVELOP;
datalines;	
1, ,1, , 
1, , ,1, 
1,1, , , 
1,1, , , 
2, , ,1, 
2, ,1, , 
2, , ,1, 
3, ,1, , 
3,1, , , 
;
run;

PROC SORT DATA = work.TestData NODUPKEY out=work.testdata2;
BY MOOD ANXIETY EAT;
RUN;

Tom · Posted 11-19-2022 02:41 PM

For data like that with values that are either 1 or missing I would probably just collapse to one observation per subject.

data want;
  update have(obs=0) have;
  by fid;
run;

Obs    FID    MOOD    ANXIETY    EAT    DEVELOP

 1      1       1        1        1        .
 2      2       .        1        1        .
 3      3       1        1        .        .

fja · Posted 11-19-2022 05:14 PM

That resulted in a more sane looking dataset, agreed. It is just that @JLang055 asked for a different kind of output.
--fja

JLang055 · Posted 11-20-2022 05:12 PM

I think this could be a good option. When I run the code you suggest I run into the following error:

NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
24         
25         GOPTIONS ACCESSIBLE;
26         data DIAG_FULL_F;
27         	update have(obs=0) have;
ERROR: File WORK.HAVE.DATA does not exist.
ERROR: File WORK.HAVE.DATA does not exist.
28         	by FID;
29         run;

NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.DIAG_FULL_F may be incomplete.  When this step was stopped there were 0 observations and 0 variables.
WARNING: Data set WORK.DIAG_FULL_F was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

Do you have any idea whats going on with it?

Kurt_Bremser · Posted 11-20-2022 06:13 PM

ERROR: File WORK.HAVE.DATA does not exist.

What could be more clear than this? You need to create the HAVE (the name you used in your initial post) dataset before you can use it.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 11-19-2022 05:29 PM

Below should create what you're asking for.

proc sort data=have out=want nodupkey;
  by _all_;
run;

Code not tested because data not provided in a form that doesn't require work to use it directly in code (i.e. a fully working SAS data step creating the data).

mkeintz · Posted 11-19-2022 09:48 PM

This program will keep the first instance of each unique combination of variables:

data have;
	infile datalines dsd;
	input FID MOOD ANXIETY EAT DEVELOP;
datalines;	
1, ,1, , 
1, , ,1, 
1,1, , , 
1,1, , , 
2, , ,1, 
2, ,1, , 
2, , ,1, 
3, ,1, , 
3,1, , , 
;
run;
data want;
  set have;
  if _n_=1 then do;
    declare hash h (dataset:'have (obs=0)');
      h.definekey(all:'Y');
      h.definedone();
  end;
  if h.add()=0 ;
run;

The hash object is "keyed" on all the variables in dataset have (think of it as using a compound index based on the combination of all variables).

The hash method ADD will be successful (i.e. h.add()=0) only when there is not already a dataitem (i.e. a "row") in it with the same combination of variables. As a result the dataset does not even need to be sorted, even by FID.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

JLang055 · Posted 11-20-2022 09:28 PM

Hi All,

I took @Tom suggestion to merge all the data into one single row per participant. I ran some code that worked for me.

%macro count (Var); *create new diagnosis count variable;

proc summary data=DIAG_FULL; 
	var &var;

	by FID;

	output out=&var sum=;
run;

data &VAR;
	set &VAR;

	IF &VAR >= 1 then &VAR =1;
	ELSE &VAR =0;

	drop _TYPE_ _FREQ_;
run;

%mend count;

%count(MOOD);
%count(ANXIETY);
%count(DEVELOP);
%count(CONDUCT);
%count(ADHD);

data CLINIC_DIAG;
	merge MOOD ANXIETY DEVELOP CONDUCT ADHD DIAG_MULTI;
	by FID;
run;

fja · Posted 11-21-2022 02:28 AM

Congratulations to your first solution then ... but could at least spear Toms posting a like. 😉

Registration is open

SAS Training: Just a Click Away