Hi
Thank you for all the great advise in this group.
I have two datasets that I want to merge, however i'm merging on dates and closest proximity to dates so this is causing me some problems.
Data
A1
data haveA1;
input Record_id admission_number (Admission_date Discharge_date date_of_diagnosis) (:ddmmyy10.) diagnosis $;format Admission_date Discharge_date date_of_diagnosis ddmmyyd10. numbermeet;
datalines;
1 1 01-01-2010 01-02-2010 02-01-2010 df200 1
1 2 03-03-2010 31-03-2010 04-03-2010 df147 1
2 1 03-04-2010 31-08-2010 04-03-2010 df201 1 ;
run;
Data A2
data haveA1;
input Record_id date_of_ques ddmmyy10. numberquest var1 var2 var3;
datalines;
1 07-01-2010 1 A B D
1 09-01-2010 2 F D G
1 09-02-2010 3 A D G
1 09-03-2010 4 F D G
1 09-03-2017 5 X T V
2 07-02-2009 1 C R V
2 03-09-2010 2 C T V
;
run;
I would like to use the following criteria for merging:
merge on record_id;
merge data A2 on closest admission in data A1 based on admission_date and discharge_date in A1 and date_of_ques in A2.
The date_of_ques has to be after admission_date
The date of questionnaire (date_of_ques) can maximum be 365 days from discharge_date, but can not be within next admission for same person (record_id).
Want:
1 1 01-01-2010 01-02-2010 02-01-2010 df200 1 07-01-2010 1 A B D
1 1 01-01-2010 01-02-2010 02-01-2010 df200 1 09-01-2010 2 F D G
1 1 01-01-2010 01-02-2010 02-01-2010 df200 1 09-02-2010 3 A D G
1 2 03-03-2010 31-03-2010 04-03-2010 df147 1 09-03-2010 4 F D G
2 1 03-04-2010 31-08-2010 04-03-2010 df201 1 03-09-2010 2 C T V ;
I hope this makes sence or let me know.
Regards
Solvej
1. It would be good if you tested your program before providing it. It cannot run as supplied.
2. Post code using {i}
3. This does some of the work.
Not too sure why 1.2 matches only one record in your result.
You can probably extrapolate and add to my code to get rid of the unwanted matches.
data have1;
input Record_id admission_number (Admission_date Discharge_date date_of_diagnosis) (:ddmmyy10.) diagnosis $;
format Admission_date Discharge_date date_of_diagnosis ddmmyyd10. ;
cards;
1 1 01-01-2010 01-02-2010 02-01-2010 df200 1
1 2 03-03-2010 31-03-2010 04-03-2010 df147 1
2 1 03-04-2010 31-08-2010 04-03-2010 df201 1
;
run;
data have2;
input Record_id date_of_ques ddmmyy10. numberquest var1 $ var2 $ var3 $;
format date_of_ques ddmmyyd10. ;
datalines;
1 07-01-2010 1 A B D
1 09-01-2010 2 F D G
1 09-02-2010 3 A D G
1 09-03-2010 4 F D G
1 09-03-2017 5 X T V
2 07-02-2009 1 C R V
2 03-09-2010 2 C T V
;
run;
proc sql;
select have1.*, have2.*
from have1
left join
have2
on have1.Record_id = have2.Record_id
and date_of_ques - Discharge_date between -365 and 365
left join
have1 next
on have1.Record_id = next.Record_id
and have1.admission_number+1 = next.admission_number
and next.Admission_date < have2.date_of_ques
where next.Record_id is null
order by record_id, admission_number, numberquest
;
Record_id | admission_number | Admission_date | Discharge_date | date_of_diagnosis | diagnosis | Record_id | date_of_ques | numberquest | var1 | var2 | var3 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 01-01-2010 | 01-02-2010 | 02-01-2010 | df200 | 1 | 07-01-2010 | 1 | A | B | D |
1 | 1 | 01-01-2010 | 01-02-2010 | 02-01-2010 | df200 | 1 | 09-01-2010 | 2 | F | D | G |
1 | 1 | 01-01-2010 | 01-02-2010 | 02-01-2010 | df200 | 1 | 09-02-2010 | 3 | A | D | G |
1 | 2 | 03-03-2010 | 31-03-2010 | 04-03-2010 | df147 | 1 | 07-01-2010 | 1 | A | B | D |
1 | 2 | 03-03-2010 | 31-03-2010 | 04-03-2010 | df147 | 1 | 09-01-2010 | 2 | F | D | G |
1 | 2 | 03-03-2010 | 31-03-2010 | 04-03-2010 | df147 | 1 | 09-02-2010 | 3 | A | D | G |
1 | 2 | 03-03-2010 | 31-03-2010 | 04-03-2010 | df147 | 1 | 09-03-2010 | 4 | F | D | G |
2 | 1 | 03-04-2010 | 31-08-2010 | 04-03-2010 | df201 | 2 | 03-09-2010 | 2 | C | T | V |
1. It would be good if you tested your program before providing it. It cannot run as supplied.
2. Post code using {i}
3. This does some of the work.
Not too sure why 1.2 matches only one record in your result.
You can probably extrapolate and add to my code to get rid of the unwanted matches.
data have1;
input Record_id admission_number (Admission_date Discharge_date date_of_diagnosis) (:ddmmyy10.) diagnosis $;
format Admission_date Discharge_date date_of_diagnosis ddmmyyd10. ;
cards;
1 1 01-01-2010 01-02-2010 02-01-2010 df200 1
1 2 03-03-2010 31-03-2010 04-03-2010 df147 1
2 1 03-04-2010 31-08-2010 04-03-2010 df201 1
;
run;
data have2;
input Record_id date_of_ques ddmmyy10. numberquest var1 $ var2 $ var3 $;
format date_of_ques ddmmyyd10. ;
datalines;
1 07-01-2010 1 A B D
1 09-01-2010 2 F D G
1 09-02-2010 3 A D G
1 09-03-2010 4 F D G
1 09-03-2017 5 X T V
2 07-02-2009 1 C R V
2 03-09-2010 2 C T V
;
run;
proc sql;
select have1.*, have2.*
from have1
left join
have2
on have1.Record_id = have2.Record_id
and date_of_ques - Discharge_date between -365 and 365
left join
have1 next
on have1.Record_id = next.Record_id
and have1.admission_number+1 = next.admission_number
and next.Admission_date < have2.date_of_ques
where next.Record_id is null
order by record_id, admission_number, numberquest
;
Record_id | admission_number | Admission_date | Discharge_date | date_of_diagnosis | diagnosis | Record_id | date_of_ques | numberquest | var1 | var2 | var3 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 01-01-2010 | 01-02-2010 | 02-01-2010 | df200 | 1 | 07-01-2010 | 1 | A | B | D |
1 | 1 | 01-01-2010 | 01-02-2010 | 02-01-2010 | df200 | 1 | 09-01-2010 | 2 | F | D | G |
1 | 1 | 01-01-2010 | 01-02-2010 | 02-01-2010 | df200 | 1 | 09-02-2010 | 3 | A | D | G |
1 | 2 | 03-03-2010 | 31-03-2010 | 04-03-2010 | df147 | 1 | 07-01-2010 | 1 | A | B | D |
1 | 2 | 03-03-2010 | 31-03-2010 | 04-03-2010 | df147 | 1 | 09-01-2010 | 2 | F | D | G |
1 | 2 | 03-03-2010 | 31-03-2010 | 04-03-2010 | df147 | 1 | 09-02-2010 | 3 | A | D | G |
1 | 2 | 03-03-2010 | 31-03-2010 | 04-03-2010 | df147 | 1 | 09-03-2010 | 4 | F | D | G |
2 | 1 | 03-04-2010 | 31-08-2010 | 04-03-2010 | df201 | 2 | 03-09-2010 | 2 | C | T | V |
Thank you so much ChrisNZ.
I have tweaked the code a bit in order to only include questionnaires that were filled in after admission date. See the complete code below. I have only changed this line
and date_of_ques between admission_date and Discharge_date + 365 .
It seems to be working however I get this message when running the code. Should I be concerned?
WARNING: Column named record_id is duplicated in a select expression (or a view). Explicit
references to it will be to the first one.
WARNING: Variable Record_id already exists on file WORK.WANT.
Complete code:
{
proc sql;
create table want as
select have1.*, have2.*
from have1
left join
have2
on have1.Record_id = have2.Record_id
and date_of_ques between admission_date and Discharge_date + 365
left join
have1 next
on have1.Record_id = next.Record_id
and have1.admission_number+1 = next.admission_number
and next.Admission_date < have2.date_of_ques
where next.Record_id is null
order by record_id, admission_number, numberquest
;
quit; }
Just read the message, it's very clear.
You run
select have1.*, have2.*
so extract 2 variables with the same name from each of the tables.
Rewrite the select statement to only select the columns you want.
I do not understand your requirement.
Know that you can use this syntax if the variables contain SAS dates:
and next.Admission_date + 1 < have2.date_of_ques
to shift by one day.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.