Solved: Merging on Varying Variables

djbateman · Posted 11-29-2018 12:09 PM

I have a situation where I have a parent study and a roll-over study. I need to merge any AE records from the parent study that are missing an end date with all AEs in the roll-over study to see if the parent AEs got entered into the roll-over study. I'm having issues due to how data is entered.

I want to merge by SUBJECT, LLT_NAME, and AESTDAT. However, sometimes, LLT_NAME is missing. In those cases, I want to merge on PARAEID (Note: PARAEID is in the roll-over study and matches AESPID in the parent study). I need to find a way to look ahead and see if I should merge by LLT_NAME or PARAEID on an individual record level.

I was doing something like:

data ae_parent;
	set ae_parent;
	if llt_name^='' then mergevar=strip(llt_name);
		else if aespid^=. then mergevar=compress(put(aespid,best.));
run;

data ae_ro;
	set ae_ro;
	if llt_name^='' then mergevar=strip(llt_name);
		else if paraeid^=. then mergevar=compress(put(paraeid,best.));
run;

The downfall is that MERGEVAR will not always properly match. If the parent study has LLT_NAME and the roll-over study does not, then MERGEVAR will be LLT_NAME in the parent study and PARAEID (or blank if PARAEID is also missing) in the roll-over study.

If you look at my full sample code below, you'll see in the final output (AE_MERGED) that there are 6 records with a status of "NOT Copied Over" when only the last 2 of those 6 should have that status because the other 4 records are creating different values for MERGEVAR. So, is there any way that someone can help me find an algorithm for properly matching up based on differing variables?

data ae_parent;
	input SUBJECT $11. LLT_NAME $34. AESPID 49-50 AESTDAT date9.;
	format AESTDAT date9.;
	aestdat=mdy(month(aestdat),day(aestdat),2018);	/* AESTDAT reading in as 2020. Forcing to 2018. */
	cards;
103-049-001	CRP increased						2	09MAY2018
103-049-001	Bacteria urine identified			7	21JUN2018
103-049-001	Urine WBC increased					8	21JUN2018
103-049-001	Urine epithelial cells increased	6	21JUN2018
103-049-001	Urine leukocyte esterase positive	9	21JUN2018
103-049-003	APTT increased						3	02MAY2018
103-049-003	INR increased						2	02MAY2018
103-049-003	PT increased						1	02MAY2018
103-049-003	Nasal mucus increased				5	19JUN2018
103-049-004										3	02JUL2018
103-049-004	APTT increased						9	17JUL2018
103-049-004	Blood glucose increased				6	17JUL2018
103-049-004	Decreased hemoglobin				10	17JUL2018
103-049-004	Glucose urine increased				11	17JUL2018
103-049-004	INR increased						8	17JUL2018
103-049-004	PT increased						7	17JUL2018
;
run;


data ae_ro;
	input SUBJECT $11. LLT_NAME $34. AESPID 49-50 PARAEID 53-54 AESTDAT date9.;
	format AESTDAT date9.;
	aestdat=mdy(month(aestdat),day(aestdat),2018);	/* AESTDAT reading in as 2020. Forcing to 2018. */
	cards;
103-049-001	CRP increased						3	2	09MAY2018
103-049-001										5	7	21JUN2018
103-049-001										6	8	21JUN2018
103-049-001	Urine epithelial cells increased	4	6	21JUN2018
103-049-001	Urine leukocyte esterase positive	7	9	21JUN2018
103-049-001	Skin irritation						1		24JUL2018
103-049-001	Skin irritation						2		22AUG2018
103-049-003										4		02MAY2018
103-049-003	INR increased						3	2	02MAY2018
103-049-003	PT increased						2	1	02MAY2018
103-049-003	Nasal mucus increased				5	5	19JUN2018
103-049-003	Upper respiratory infection			1		30SEP2018
103-049-004	Creatine kinase increased			6	3	02JUL2018
103-049-004	APTT increased						10	9	17JUL2018
103-049-004	Blood glucose increased				7	6	17JUL2018
103-049-004	INR increased						9	8	17JUL2018
103-049-004	PT increased						8	7	17JUL2018
103-049-004	Ear pain							1		29JUL2018
103-049-004	Creatinine increased				2		31JUL2018
103-049-004	AST increased						3		11SEP2018
103-049-004	Creatine kinase increased			4		11SEP2018
103-049-004	Upper respiratory infection			5		06OCT2018
;
run;


data ae_parent;
	set ae_parent;
	if llt_name^='' then mergevar=strip(llt_name);
		else if aespid^=. then mergevar=compress(put(aespid,best.));
run;
data ae_ro;
	set ae_ro;
	if llt_name^='' then mergevar=strip(llt_name);
		else if paraeid^=. then mergevar=compress(put(paraeid,best.));
run;

proc sort data=ae_parent;
	by subject mergevar aestdat;
proc sort data=ae_ro;
	by subject mergevar aestdat;
run;

data ae_merged;
	merge ae_parent (in=a) ae_ro (in=b);
	by subject mergevar aestdat;
	if a;
	if b=0 then STATUS='NOT Copied Over';
	if b=1 then STATUS='Copied Over';
run;

heather_g · Posted 11-30-2018 10:59 AM

I think I've got it. See if this attached code works.

View solution in original post

heather_g · Posted 11-29-2018 06:23 PM

My code isn't quite ready to properly illustrate, but here's the gist of it:

Since you seem to be familiar with DATA steps, and your data are small, here's how I propose you do it, approximately.

(I could write up a more elegant solution with PROC SQL).

- Sort the data sets on SUBJECT, LLT_NAME, and AESTDAT. Merge as you did (or else inner join, or merge only keeping total matches). That's merge 1.

- Create a new variable in the parent data set for PARAEID, or rename AESPID to PARAEID. Sort the data sets on SUBJECT, PARAEID, and AESTDAT. Merge as you did (or else inner join, that is, merge only keeping total matches). That's merge 2.

- Merge those 2 resulting data sets a third time. From merge 1's resulting data set, you should have AESPID from the parent data set. That variable would also be in merge 2's resulting data set. Then, you can merge both data sets on the parent AESPID.

Would that get you what you are looking for? You would need to sort out the variables you want to end up in the final data set, renaming them or creating new ones as appropriate, to get all the results from the RO data set properly aligned in the final data set.

djbateman · Posted 11-30-2018 09:38 AM

I don't think I followed your thoughts very well because I have no idea what I ended up with:


/*Sort the data sets on SUBJECT, LLT_NAME, and AESTDAT. Merge as you did (or else inner join, or merge only keeping total matches). That's merge 1.*/
proc sort data=ae_parent;
	by subject llt_name aestdat;
proc sort data=ae_ro;
	by subject llt_name aestdat;
run;

data merge1;
	merge ae_parent ae_ro;
	by subject llt_name aestdat;
run;


/*Create a new variable in the parent data set for PARAEID, or rename AESPID to PARAEID. Sort the data sets on SUBJECT, PARAEID, and AESTDAT. Merge as you did (or else inner join, that is, merge only keeping total matches). That's merge 2.*/
proc sort data=ae_parent out=ae_parent (rename=(aespid=PARAEID));
	by subject aespid aestdat;
proc sort data=ae_ro;
	by subject paraeid aestdat;
run;

data merge2;
	merge ae_parent ae_ro;
	by subject paraeid aestdat;
run;


/*Merge those 2 resulting data sets a third time. From merge 1's resulting data set, you should have AESPID from the parent data set. That variable would also be in merge 2's resulting data set. Then, you can merge both data sets on the parent AESPID.*/
proc sort data=merge1;
	by subject llt_name paraeid aestdat;
proc sort data=merge2;
	by subject llt_name paraeid aestdat;
run;

data ae_merged;
	merge merge1 merge2;
	by subject llt_name paraeid aestdat;
run;

But I can read SQL (though I'm not as fluent as DATA steps), so if you have a solution that would work better that way, I am willing to take a look.

heather_g · Posted 11-30-2018 10:24 AM

I think I should step back and ensure I understand what you want to do and what you want to end up with in the final data set.

"I need to merge any AE records from the parent study that are missing an end date with all AEs in the roll-over study to see if the parent AEs got entered into the roll-over study."

From what you said, I'm gathering what defines an "AE" is the SUBJECT variable; either the LLT_NAME, or this variable that is named AESPID in the parent data set and PARAEID in the roll-over data set; and the AESTDAT, which apparently is the end date. Is that correct?

None of the rows in the parent data set are missing an end date. Are they just not in the data set, if they are "missing an end date"?

So is this correct: all you're looking to end up with in your final data set, are all the variables from the roll-over study, and a new STATUS variable that indicates whether the "AE" was in the parent study?

djbateman · Posted 11-30-2018 10:30 AM

Sorry for the confusion. The data I provided in my original post is a small subset of prepared data. I did all the data cleaning to get all the AEs that were needed (i.e., only AEs with no AE end date are present). AESTDAT is "AE Start Date" (the end date is named AEENDAT and is excluded since it is missing for all records).

There are 16 records in my sample parent dataset. All 16 of those records should show up in the final output. The only think I need from the rollover dataset is to see if there is a matching record (based on LLT_NAME first and PARAEID second if LLT_NAME is missing in either dataset). If a matching record is found, then I can flag it as being copied from the parent study to the rollover study. If a matching record is NOT found, then I can flag is as NOT being copied from the parent study to the rollover study.

heather_g · Posted 11-30-2018 10:59 AM

I think I've got it. See if this attached code works.

heather_g · Posted 11-30-2018 11:03 AM

The comments are a little off in places, but the code works.

djbateman · Posted 11-30-2018 01:46 PM

It's people like you that make people like me so happy! I've been working on this for almost 2 weeks. It's good to just get a fresh set of eyes and a new perspective. All my methods were leading to a dead end, but yours seems to have done the trick. Thank you so much!

heather_g · Posted 12-03-2018 05:33 PM

Thanks! I'm glad I could help.

Merging on Varying Variables

Re: Merging on Varying Variables

Re: Merging on Varying Variables

Re: Merging on Varying Variables

Re: Merging on Varying Variables

Re: Merging on Varying Variables

Re: Merging on Varying Variables

Re: Merging on Varying Variables

Re: Merging on Varying Variables

Re: Merging on Varying Variables

SAS Innovate 2025: Call for Content

Classroom Training Available!