Solved: Re: Merging two datasets

Solvej · Posted 05-15-2018 08:47 AM

Hi

Thank you for all the great advise in this group.

I have two datasets that I want to merge, however i'm merging on dates and closest proximity to dates so this is causing me some problems.

Data

A1

data haveA1;

input Record_id admission_number (Admission_date Discharge_date date_of_diagnosis) (:ddmmyy10.) diagnosis $;format Admission_date Discharge_date date_of_diagnosis ddmmyyd10. numbermeet;

datalines;

1 1 01-01-2010 01-02-2010 02-01-2010 df200 1

1 2 03-03-2010 31-03-2010 04-03-2010 df147 1

2 1 03-04-2010 31-08-2010 04-03-2010 df201 1 ;

run;

Data A2

data haveA1;

input Record_id date_of_ques ddmmyy10. numberquest var1 var2 var3;

datalines;

1 07-01-2010 1 A B D

1 09-01-2010 2 F D G

1 09-02-2010 3 A D G

1 09-03-2010 4 F D G

1 09-03-2017 5 X T V

2 07-02-2009 1 C R V

2 03-09-2010 2 C T V

;

run;

I would like to use the following criteria for merging:

merge on record_id;

merge data A2 on closest admission in data A1 based on admission_date and discharge_date in A1 and date_of_ques in A2.

The date_of_ques has to be after admission_date

The date of questionnaire (date_of_ques) can maximum be 365 days from discharge_date, but can not be within next admission for same person (record_id).

Want:

1 1 01-01-2010 01-02-2010 02-01-2010 df200 1 07-01-2010 1 A B D

1 1 01-01-2010 01-02-2010 02-01-2010 df200 1 09-01-2010 2 F D G

1 1 01-01-2010 01-02-2010 02-01-2010 df200 1 09-02-2010 3 A D G

1 2 03-03-2010 31-03-2010 04-03-2010 df147 1 09-03-2010 4 F D G

2 1 03-04-2010 31-08-2010 04-03-2010 df201 1 03-09-2010 2 C T V ;

I hope this makes sence or let me know.

Regards

Solvej

ChrisNZ · Posted 05-15-2018 10:17 PM

1. It would be good if you tested your program before providing it. It cannot run as supplied.

2. Post code using {i}

3. This does some of the work.

Not too sure why 1.2 matches only one record in your result.

You can probably extrapolate and add to my code to get rid of the unwanted matches.

data have1;
input  Record_id  admission_number (Admission_date Discharge_date date_of_diagnosis) (:ddmmyy10.)  diagnosis $;
format Admission_date Discharge_date date_of_diagnosis ddmmyyd10. ;
cards;
1 1 01-01-2010 01-02-2010 02-01-2010 df200 1
1 2 03-03-2010 31-03-2010 04-03-2010 df147 1
2 1 03-04-2010 31-08-2010 04-03-2010 df201 1
; 
run;

data have2;
input  Record_id date_of_ques ddmmyy10.  numberquest var1 $ var2 $ var3 $;
format date_of_ques ddmmyyd10. ;
datalines;
1 07-01-2010 1 A B D
1 09-01-2010 2 F D G
1 09-02-2010 3 A D G
1 09-03-2010 4 F D G
1 09-03-2017 5 X T V
2 07-02-2009 1 C R V
2 03-09-2010 2 C T V
;
run;
proc sql;
  select have1.*, have2.*
  from have1 
        left join 
      have2
        on  have1.Record_id = have2.Record_id 
        and date_of_ques - Discharge_date between -365 and 365
        left join 
      have1 next 
        on  have1.Record_id          = next.Record_id 
        and have1.admission_number+1 = next.admission_number 
        and next.Admission_date      < have2.date_of_ques 
  where next.Record_id is null     
  order by record_id, admission_number, numberquest                         
;

Record_id	admission_number	Admission_date	Discharge_date	date_of_diagnosis	diagnosis	Record_id	date_of_ques	numberquest	var1	var2	var3
1	1	01-01-2010	01-02-2010	02-01-2010	df200	1	07-01-2010	1	A	B	D
1	1	01-01-2010	01-02-2010	02-01-2010	df200	1	09-01-2010	2	F	D	G
1	1	01-01-2010	01-02-2010	02-01-2010	df200	1	09-02-2010	3	A	D	G
1	2	03-03-2010	31-03-2010	04-03-2010	df147	1	07-01-2010	1	A	B	D
1	2	03-03-2010	31-03-2010	04-03-2010	df147	1	09-01-2010	2	F	D	G
1	2	03-03-2010	31-03-2010	04-03-2010	df147	1	09-02-2010	3	A	D	G
1	2	03-03-2010	31-03-2010	04-03-2010	df147	1	09-03-2010	4	F	D	G
2	1	03-04-2010	31-08-2010	04-03-2010	df201	2	03-09-2010	2	C	T	V

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 05-15-2018 10:17 PM

1. It would be good if you tested your program before providing it. It cannot run as supplied.

2. Post code using {i}

3. This does some of the work.

Not too sure why 1.2 matches only one record in your result.

You can probably extrapolate and add to my code to get rid of the unwanted matches.

data have1;
input  Record_id  admission_number (Admission_date Discharge_date date_of_diagnosis) (:ddmmyy10.)  diagnosis $;
format Admission_date Discharge_date date_of_diagnosis ddmmyyd10. ;
cards;
1 1 01-01-2010 01-02-2010 02-01-2010 df200 1
1 2 03-03-2010 31-03-2010 04-03-2010 df147 1
2 1 03-04-2010 31-08-2010 04-03-2010 df201 1
; 
run;

data have2;
input  Record_id date_of_ques ddmmyy10.  numberquest var1 $ var2 $ var3 $;
format date_of_ques ddmmyyd10. ;
datalines;
1 07-01-2010 1 A B D
1 09-01-2010 2 F D G
1 09-02-2010 3 A D G
1 09-03-2010 4 F D G
1 09-03-2017 5 X T V
2 07-02-2009 1 C R V
2 03-09-2010 2 C T V
;
run;
proc sql;
  select have1.*, have2.*
  from have1 
        left join 
      have2
        on  have1.Record_id = have2.Record_id 
        and date_of_ques - Discharge_date between -365 and 365
        left join 
      have1 next 
        on  have1.Record_id          = next.Record_id 
        and have1.admission_number+1 = next.admission_number 
        and next.Admission_date      < have2.date_of_ques 
  where next.Record_id is null     
  order by record_id, admission_number, numberquest                         
;

Record_id	admission_number	Admission_date	Discharge_date	date_of_diagnosis	diagnosis	Record_id	date_of_ques	numberquest	var1	var2	var3
1	1	01-01-2010	01-02-2010	02-01-2010	df200	1	07-01-2010	1	A	B	D
1	1	01-01-2010	01-02-2010	02-01-2010	df200	1	09-01-2010	2	F	D	G
1	1	01-01-2010	01-02-2010	02-01-2010	df200	1	09-02-2010	3	A	D	G
1	2	03-03-2010	31-03-2010	04-03-2010	df147	1	07-01-2010	1	A	B	D
1	2	03-03-2010	31-03-2010	04-03-2010	df147	1	09-01-2010	2	F	D	G
1	2	03-03-2010	31-03-2010	04-03-2010	df147	1	09-02-2010	3	A	D	G
1	2	03-03-2010	31-03-2010	04-03-2010	df147	1	09-03-2010	4	F	D	G
2	1	03-04-2010	31-08-2010	04-03-2010	df201	2	03-09-2010	2	C	T	V

High-Performance SAS Coding - Third Edition

Solvej · Posted 05-16-2018 08:23 AM

Thank you so much ChrisNZ.

I have tweaked the code a bit in order to only include questionnaires that were filled in after admission date. See the complete code below. I have only changed this line

and date_of_ques between admission_date and Discharge_date + 365 .

It seems to be working however I get this message when running the code. Should I be concerned?

WARNING: Column named record_id is duplicated in a select expression (or a view). Explicit

references to it will be to the first one.

WARNING: Variable Record_id already exists on file WORK.WANT.

Complete code:

{

proc sql;

create table want as

select have1.*, have2.*

from have1

left join

have2

on have1.Record_id = have2.Record_id

and date_of_ques between admission_date and Discharge_date + 365

left join

have1 next

on have1.Record_id = next.Record_id

and have1.admission_number+1 = next.admission_number

and next.Admission_date < have2.date_of_ques

where next.Record_id is null

order by record_id, admission_number, numberquest

;

quit; }

ChrisNZ · Posted 05-16-2018 05:26 PM

Just read the message, it's very clear.

You run

select have1.*, have2.*

so extract 2 variables with the same name from each of the tables.

Rewrite the select statement to only select the columns you want.

High-Performance SAS Coding - Third Edition

Solvej · Posted 05-22-2018 05:48 AM

HI ChrisNZ

I have a minor problem with the code.

If the first part regarding a maximum of 365 days from dicharge date (admission 1) to questionnaire date is correct but it turns out that the questionnaire is linked with an admission which is after admission one and this specific admission is within the 365 days. Hence this part is not satisfied

{i}
and next.Admission_date < have2.date_of_ques
{i}

then I loose admission 1 in the output data . Can this be fixed?

I hope this makes sence

Kind regards Solvej

ChrisNZ · Posted 05-22-2018 05:31 PM

I do not understand your requirement.

Know that you can use this syntax if the variables contain SAS dates:

and next.Admission_date + 1 < have2.date_of_ques

to shift by one day.

High-Performance SAS Coding - Third Edition

Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away