BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
UcheOkoro
Lapis Lazuli | Level 10

Please, I need help merging data with proc sql. It had worked previously but currently I noticed that I was losing observations after merging the two data sets with proc sql. I did not amend my codes. I am losing about 4,431 observations out of 21,513. Thank you in advance.

 

Please, find attached the codes and the log message after the merge

proc sql;
     create table Newdata as
	 select *
         from New6, scheduling_data5 
         where New6.provider = scheduling_data5.provider
          and New6.start_date between scheduling_data5.start_date and scheduling_data5.Shift_end_date;
quit;

UcheOkoro_0-1626367548739.png

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
UcheOkoro
Lapis Lazuli | Level 10

The problem was with the way SAS was importing the datetime variable. I was using SAS(English with DBCS) instead of SAS(English) and it was converting some mmddyyyy observations to ddmmyyyy. It completely altered the datetime variable. Thank you all for your assistance

View solution in original post

10 REPLIES 10
PaigeMiller
Diamond | Level 26
where New6.provider = scheduling_data5.provider
          and New6.start_date between scheduling_data5.start_date and scheduling_data5.Shift_end_date;

Clearly, this is where the observations are being deleted.

 

To debug this: try commenting out each condition, one at a time, to see if you can identify the records that are being removed (which should not be removed)

--
Paige Miller
UcheOkoro
Lapis Lazuli | Level 10

Hello Paige,

 

Thank you for your prompt response. Please, I am not sure I understand what you mean by "commenting out each condition, one at a time".

 

 

PaigeMiller
Diamond | Level 26
where /* New6.provider = scheduling_data5.provider
          and */ New6.start_date between scheduling_data5.start_date and scheduling_data5.Shift_end_date;

See what observations return to the results with the above code. Compare to the observations you got when you didn't comment out that part. Are the right observations removed?


Continue the process with the next parts of the WHERE clause.

--
Paige Miller
ballardw
Super User

You might get better performance with

 

proc sql;
     create table Newdata as
	 select *
         from New6 left join scheduling_data5 
         on New6.provider = scheduling_data5.provider
         where New6.start_date between scheduling_data5.start_date and scheduling_data5.Shift_end_date;
quit;

And you would have all of the providers from New6 even if they do not appear (yet) in the scheduling_data5 data set.

Your current code is matching every single record with every single record in the other set and then filtering. So you are wasting a lot of clock cycles. The Join on will reduce the number of comparisons as it starts with the providers then filters on the the dates.

 

Since we have the word "provider" I suspect this is medical data and you may find that some "provider" has changed spelling, business identification code or similar in one (if not both) of the data sets if the dates aren't the issue. When code doesn't change and results do the most likely culprit is changes in the data. Be aware that as simple a change as inclusion/removal of a single leading space in value will render what you think should be equal not.

UcheOkoro
Lapis Lazuli | Level 10

Hello Ballardw,

 

Thank you for your response. The data has not changed. I only switched  computer CPU and reinstalled SAS(same version). The code is still giving me the same 

result. I am still losing variables after the merge.


proc sql;
     create table Newdata as
	 select *
         from New6 left join scheduling_data5 
         on New6.provider = scheduling_data5.provider
         where New6.start_date between scheduling_data5.start_date and scheduling_data5.Shift_end_date;
quit;

UcheOkoro_0-1626369899043.png

 

UcheOkoro
Lapis Lazuli | Level 10

I mean losing observations not variables.

FreelanceReinh
Jade | Level 19

Hello @UcheOkoro,


@UcheOkoro wrote:

The code is still giving me the same result. I am still losing [observations] after the merge.


proc sql;
     create table Newdata as
	 select *
         from New6 left join scheduling_data5 
         on New6.provider = scheduling_data5.provider
         where New6.start_date between scheduling_data5.start_date and scheduling_data5.Shift_end_date;
quit;

This is because the providers from New6 that you may keep additionally by replacing the (implicit) inner join with a left join will hardly meet the WHERE condition with missing scheduling_data5.start_date and Shift_end_date as they are not contained in scheduling_data5.

 

Here's a small example demonstrating how observations get "lost" with your initial code -- because the provider ID is not contained in both datasets or the condition involving the dates is not met -- and how you can keep some or all of them with other types of joins:

/* Create sample data for demonstration */

data n6;
input prov startd;
cards;
1 20
2 10
3 30
4 15
;

data schd5;
input startd endd prov;
cards;
22 29 1
17 31 3
10 24 5
;

/* Inner join similar to your initial code */

proc sql;
create table nd as
select *
from n6, schd5 
where n6.prov = schd5.prov
      and n6.startd between schd5.startd and schd5.endd;
quit; /* 1 obs., two warnings in the log */

/* Left join */

proc sql;
create table nd(drop=sprov) as
select *
from n6(rename=(startd=nstartd)) left join schd5(rename=(prov=sprov))
on prov = sprov and nstartd between schd5.startd and schd5.endd;
quit; /* 4 obs., no warnings */

/* Right join */

proc sql;
create table nd(drop=nprov) as
select *
from n6(rename=(startd=nstartd prov=nprov)) right join schd5
on nprov = prov and nstartd between schd5.startd and schd5.endd;
quit; /* 3 obs., no warnings */

/* Full join */

proc sql;
create table nd as
select *
from n6(rename=(startd=nstartd prov=nprov)) full join schd5
on nprov = prov and nstartd between schd5.startd and schd5.endd;
quit; /* 6 obs., no warnings */
UcheOkoro
Lapis Lazuli | Level 10

I attached serial numbers to the unmerged data(main data) and after the merged, I noticed it was deleting serial number 1- 4366.  So my data started at serial number 4367.

PaigeMiller
Diamond | Level 26

@UcheOkoro wrote:

I attached serial numbers to the unmerged data(main data) and after the merged, I noticed it was deleting serial number 1- 4366.  So my data started at serial number 4367.


@UcheOkoro we don't want to do this work for you, we want you to do the debugging and dig through the data sets and find out why certain observations are deleted.

--
Paige Miller
UcheOkoro
Lapis Lazuli | Level 10

The problem was with the way SAS was importing the datetime variable. I was using SAS(English with DBCS) instead of SAS(English) and it was converting some mmddyyyy observations to ddmmyyyy. It completely altered the datetime variable. Thank you all for your assistance

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1353 views
  • 0 likes
  • 4 in conversation