Please, I need help merging data with proc sql. It had worked previously but currently I noticed that I was losing observations after merging the two data sets with proc sql. I did not amend my codes. I am losing about 4,431 observations out of 21,513. Thank you in advance.
Please, find attached the codes and the log message after the merge
proc sql;
create table Newdata as
select *
from New6, scheduling_data5
where New6.provider = scheduling_data5.provider
and New6.start_date between scheduling_data5.start_date and scheduling_data5.Shift_end_date;
quit;
The problem was with the way SAS was importing the datetime variable. I was using SAS(English with DBCS) instead of SAS(English) and it was converting some mmddyyyy observations to ddmmyyyy. It completely altered the datetime variable. Thank you all for your assistance
where New6.provider = scheduling_data5.provider
and New6.start_date between scheduling_data5.start_date and scheduling_data5.Shift_end_date;
Clearly, this is where the observations are being deleted.
To debug this: try commenting out each condition, one at a time, to see if you can identify the records that are being removed (which should not be removed)
Hello Paige,
Thank you for your prompt response. Please, I am not sure I understand what you mean by "commenting out each condition, one at a time".
where /* New6.provider = scheduling_data5.provider
and */ New6.start_date between scheduling_data5.start_date and scheduling_data5.Shift_end_date;
See what observations return to the results with the above code. Compare to the observations you got when you didn't comment out that part. Are the right observations removed?
Continue the process with the next parts of the WHERE clause.
You might get better performance with
proc sql;
create table Newdata as
select *
from New6 left join scheduling_data5
on New6.provider = scheduling_data5.provider
where New6.start_date between scheduling_data5.start_date and scheduling_data5.Shift_end_date;
quit;
And you would have all of the providers from New6 even if they do not appear (yet) in the scheduling_data5 data set.
Your current code is matching every single record with every single record in the other set and then filtering. So you are wasting a lot of clock cycles. The Join on will reduce the number of comparisons as it starts with the providers then filters on the the dates.
Since we have the word "provider" I suspect this is medical data and you may find that some "provider" has changed spelling, business identification code or similar in one (if not both) of the data sets if the dates aren't the issue. When code doesn't change and results do the most likely culprit is changes in the data. Be aware that as simple a change as inclusion/removal of a single leading space in value will render what you think should be equal not.
Hello Ballardw,
Thank you for your response. The data has not changed. I only switched computer CPU and reinstalled SAS(same version). The code is still giving me the same
result. I am still losing variables after the merge.
proc sql;
create table Newdata as
select *
from New6 left join scheduling_data5
on New6.provider = scheduling_data5.provider
where New6.start_date between scheduling_data5.start_date and scheduling_data5.Shift_end_date;
quit;
I mean losing observations not variables.
Hello @UcheOkoro,
@UcheOkoro wrote:
The code is still giving me the same result. I am still losing [observations] after the merge.
proc sql; create table Newdata as select * from New6 left join scheduling_data5 on New6.provider = scheduling_data5.provider where New6.start_date between scheduling_data5.start_date and scheduling_data5.Shift_end_date; quit;
This is because the providers from New6 that you may keep additionally by replacing the (implicit) inner join with a left join will hardly meet the WHERE condition with missing scheduling_data5.start_date and Shift_end_date as they are not contained in scheduling_data5.
Here's a small example demonstrating how observations get "lost" with your initial code -- because the provider ID is not contained in both datasets or the condition involving the dates is not met -- and how you can keep some or all of them with other types of joins:
/* Create sample data for demonstration */
data n6;
input prov startd;
cards;
1 20
2 10
3 30
4 15
;
data schd5;
input startd endd prov;
cards;
22 29 1
17 31 3
10 24 5
;
/* Inner join similar to your initial code */
proc sql;
create table nd as
select *
from n6, schd5
where n6.prov = schd5.prov
and n6.startd between schd5.startd and schd5.endd;
quit; /* 1 obs., two warnings in the log */
/* Left join */
proc sql;
create table nd(drop=sprov) as
select *
from n6(rename=(startd=nstartd)) left join schd5(rename=(prov=sprov))
on prov = sprov and nstartd between schd5.startd and schd5.endd;
quit; /* 4 obs., no warnings */
/* Right join */
proc sql;
create table nd(drop=nprov) as
select *
from n6(rename=(startd=nstartd prov=nprov)) right join schd5
on nprov = prov and nstartd between schd5.startd and schd5.endd;
quit; /* 3 obs., no warnings */
/* Full join */
proc sql;
create table nd as
select *
from n6(rename=(startd=nstartd prov=nprov)) full join schd5
on nprov = prov and nstartd between schd5.startd and schd5.endd;
quit; /* 6 obs., no warnings */
I attached serial numbers to the unmerged data(main data) and after the merged, I noticed it was deleting serial number 1- 4366. So my data started at serial number 4367.
@UcheOkoro wrote:
I attached serial numbers to the unmerged data(main data) and after the merged, I noticed it was deleting serial number 1- 4366. So my data started at serial number 4367.
@UcheOkoro we don't want to do this work for you, we want you to do the debugging and dig through the data sets and find out why certain observations are deleted.
The problem was with the way SAS was importing the datetime variable. I was using SAS(English with DBCS) instead of SAS(English) and it was converting some mmddyyyy observations to ddmmyyyy. It completely altered the datetime variable. Thank you all for your assistance
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.