DATA Step, Macro, Functions and more

Keep obs in one dataset based on ID match and other variables match condition in another dataset

Accepted Solution Solved
Reply
Contributor
Posts: 38
Accepted Solution

Keep obs in one dataset based on ID match and other variables match condition in another dataset

[ Edited ]

I have two datasets x and y.

 

I want to subset dataset 2 (y) based on the ID match and condition that if Date in dataset 1 (x) is at or in between Session Begin and End Date of the dataset y then keep those observation in Dataset y. Where End date is blank (because "End" event has not occurred)

, then apply condition that if Date in dataset x is at or after End Date in dataset y then also keep these observation in Dataset y. Any suggestion to correct the code or any other method to do this?

 

Code I used:

Proc SQL;

Create Table want as select y.*

from dataset1 x, dataset2 y

where x.ID=y.ID and

((x.Date between y.SessionBeginDate and y.SessionEndDate) /* Do this when both Begin date and End date is available*/

or

x.Date GE y.SessionBeginDate);/* Do this When only Begin date is available but End date is blank because "End" event has not occurred*/

quit;

 

Dataset I have:

Dataset 1: x 
IDDate
00000001012017-03-22
00000001042017-06-20
00000001052017-05-17
00000001062017-06-20
00000001082017-10-12
00000001092017-11-01
00000001102017-11-08
00000001112017-08-01
00000001122017-06-14
00000001132017-09-19
00000001142017-07-23

 

 

Dataset 2: y   
IDSessionCodeSessionBeginDateSessionEndDate
0000000101E2018-03-262018-04-09
0000000101W2018-04-09 
0000000102C2018-01-312018-05-30
0000000102M2018-03-052018-05-30
0000000102S2018-01-312018-05-30
0000000103U2016-10-18 
0000000103W2018-03-23 
0000000103C2016-10-19 
0000000103M2016-10-19 
0000000103S2016-10-26 
0000000104J2017-04-28 
0000000104F2017-04-28 
0000000105J2017-12-112018-04-30
0000000105F2017-12-112018-04-30
0000000106M2018-03-13 
0000000106S2015-05-14 
0000000106W2018-03-192018-05-22
0000000107M2017-01-18 
0000000107S2017-01-17 
0000000108M2017-12-18 
0000000108S2017-10-03 
0000000109J2017-09-22 
0000000109S2016-01-04 
0000000109F2017-09-22 
0000000110C2017-08-28 
0000000110M2017-08-28 
0000000110S2013-02-14 
0000000111C2017-11-132018-04-17
0000000111C2018-04-18 
0000000111M2017-11-132018-04-17
0000000111M2018-04-18 
0000000111S2017-11-132018-04-17
0000000111S2018-04-18 
0000000112M2017-08-092018-04-05
0000000112S2017-07-312018-04-05
0000000113A2018-04-17 
0000000113M2018-04-17 
0000000113C2017-09-192018-04-16
0000000113C2017-09-192018-04-16
0000000113M2017-09-192018-04-16
0000000113M2018-04-17 
0000000113S2016-09-20 
0000000114M2018-02-19 
0000000114S2017-07-07 
0000000115E2018-03-292018-05-15
0000000115J2018-03-30 
0000000115P2018-03-302018-05-14
0000000115E2018-03-292018-05-15
0000000115J2018-03-30 
0000000115P2018-03-302018-05-14

 

 

Result: for example lets say for matched ID  0000000106, 0000000110 and 0000000113, obs I want to keep in Dataset 2 (Y) looks like:

IDSessionCodeSessionBeginDateSessionEndDate
0000000106S2015-05-14 
0000000110C2017-08-28 
0000000110M2017-08-28 
0000000110S2013-02-14 
0000000113C2017-09-192018-04-16
0000000113C2017-09-192018-04-16
0000000113M2017-09-192018-04-16
0000000113S2016-09-20 

Accepted Solutions
Solution
2 weeks ago
Trusted Advisor
Posts: 1,250

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Posted in reply to FreelanceReinhard

Example providing evidence that the code does work as intended:

data dataset1;
length ID $10;
input ID Date :yymmdd10.;
format Date yymmdd10.;
cards;
0000000001 2017-03-22
0000000002 2017-05-01
0000000003 2018-01-01
0000000004 2018-05-01
;

data dataset2;
length ID $10;
infile cards missover;
input ID SessionCode $ (SessionBeginDate SessionEndDate) (:yymmdd10.);
format _numeric_ yymmdd10.;
cards;
0000000001 A 2017-05-10
0000000001 B 2017-05-10 2017-12-31
0000000002 C 2017-03-22
0000000002 D 2017-03-22 2017-12-31
0000000003 E 2018-01-01
0000000003 F 2017-03-22 2017-12-31
0000000003 G 2017-12-01 2018-01-01
0000000005 H 2017-03-22
0000000005 I 2017-03-22 2017-12-31
;

proc sql;
create table want as select y.*
from dataset1 x, dataset2 y
where x.ID=y.ID and
(.z<y.SessionBeginDate<=x.date<=y.SessionEndDate
or
x.Date>=y.SessionBeginDate>.z & y.SessionEndDate is missing);
quit;

Result:

                     Session     Session         Session
Obs        ID         Code      BeginDate        EndDate

 1     0000000002       C       2017-03-22             .
 2     0000000002       D       2017-03-22    2017-12-31
 3     0000000003       E       2018-01-01             .
 4     0000000003       G       2017-12-01    2018-01-01

(SessionCode values have been chosen to serve as identifiers for observations in dataset2.)

 

  • Matching observations A and B were correctly not selected, because 2017-03-22 was before 2017-05-10.
  • Matching observation C was correctly selected, because 2017-05-01 was after 2017-03-22 and SessionEndDate is missing.
  • Matching observation D was correctly selected, because 2017-05-01 was between 2017-03-22 and 2017-12-31.
  • Matching observation E was correctly selected, because 2018-01-01 was on SessionBeginDate.
  • Matching observation F was correctly not selected, because 2018-01-01 was after 2017-12-31.
  • Matching observation G was correctly selected, because 2018-01-01 was on SessionEndDate.
  • Non-matching observations H and I were correctly not selected.

 

 

Please note that duplicate IDs in dataset1 (if any) satisfying the WHERE condition would cause duplicates in dataset WANT, but not incorrectly selected observations.

View solution in original post


All Replies
Valued Guide
Posts: 591

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

What is your issue with the code your using? Are those dates in Character or Numeric?

 

You can change that to LEFT Joins.

 

Proc SQL;
Create Table want as 
select y.*
	from dataset2 y 
		left join dataset1 x 
		 on x.ID=y.ID
Where (x.Date between y.SessionBeginDate and y.SessionEndDate) 
	or (x.Date GE y.SessionEndDate);

quit;
Thanks,
Suryakiran
Contributor
Posts: 38

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Posted in reply to SuryaKiran

In my code where condition is not working, basically it is returning me all observations for a matched ID.

Trusted Advisor
Posts: 1,250

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset


@d0816 wrote:

In my code where condition is not working, basically it is returning me all observations for a matched ID.


... which is not surprising, as the condition

x.Date GE y.SessionEndDate

will always be met if y.SessionEndDate is missing and x.Date is non-missing (and even in many cases with missing x.Date).

Contributor
Posts: 38

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Posted in reply to FreelanceReinhard

My mistake I meant to write x.Date GE y.SessionBeginDate which is what I actually used.

Proc SQL;

Create Table want as select y.*

from dataset1 x, dataset2 y

where x.ID=y.ID and

((x.Date between y.SessionBeginDate and y.SessionEndDate) /* Do this when both Begin date and End date is available*/

or

x.Date GE y.SessionBeginDate);/* Do this When only Begin date is available but End date is blank because "End" event has not occurred*/

quit;

Trusted Advisor
Posts: 1,250

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

[ Edited ]

Good. This makes more sense.

 

Then I think, all it takes is to include the criteria about missing values in the WHERE condition, e.g.

where x.id=y.id and
(.z<y.Sessionbegindate<=x.date<=y.Sessionenddate
or
x.date>=y.Sessionbegindate>.z & y.Sessionenddate is missing);

(using the special missing value .z)

Contributor
Posts: 38

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Posted in reply to FreelanceReinhard

This code is giving me all observations for unmatched ID as well.

 

And also where condition did not work. For example one matched ID has "Date" as 3/22/2017 in Dataset x. The resulting dataset should not have highlighted red rows below, which is not according to the condition specified (3/22/2017 is before 5/10/2017).

IDSessionBeginDateSessionEndDate
00000002002/22/20172/12/2018
00000002002/22/20173/22/2017
00000002002/22/20172/22/2017
00000002002/22/20172/12/2018
00000002002/22/20173/22/2017
00000002002/22/20172/12/2018
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002003/21/2018 
Trusted Advisor
Posts: 1,250

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

[ Edited ]

@d0816 wrote:

This code is giving me all observations for unmatched ID as well.

 

And also where condition did not work. For example one matched ID has "Date" as 3/22/2017 in Dataset x. The resulting dataset should not have highlighted red rows below, which is not according to the condition specified (3/22/2017 is before 5/10/2017).



To avoid any misunderstandings, let me post the complete PROC SQL step that my suggested WHERE condition was meant to be part of:

proc sql;
create table want as select y.*
from dataset1 x, dataset2 y
where x.id=y.id and
(.z<y.Sessionbegindate<=x.date<=y.Sessionenddate
or
x.date>=y.Sessionbegindate>.z & y.Sessionenddate is missing);
quit;

Do you get 

  • "observations for unmatched ID" or
  • observations where (non-missing) x.date is before y.SessionBeginDate

with the above code? I don't think so.

Contributor
Posts: 38

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Posted in reply to FreelanceReinhard

In the resulting dataset what I want to do is:

First, in Dataset 2 (y), keep only those IDs that are in Dataset 1 (x). and remove all IDs that are not in Dataset 1.

Then Second, apply the date condition for a matched/common IDs.

 

To your question, Do you get 

  • "observations for unmatched ID" or
  • observations where (non-missing) x.date is before y.SessionBeginDate

I am getting both.

Trusted Advisor
Posts: 1,250

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset


@d0816 wrote:

First, in Dataset 2 (y), keep only those IDs that are in Dataset 1 (x). and remove all IDs that are not in Dataset 1.

 


This is covered by the first part of the WHERE condition, x.id=y.id.


@d0816 wrote:

Then Second, apply the date condition for a matched/common IDs.

 


This is covered by the second part of the WHERE condition.

 


@d0816 wrote:

To your question, Do you get 

  • "observations for unmatched ID" or
  • observations where (non-missing) x.date is before y.SessionBeginDate

I am getting both.


I'm still not convinced. Can you please provide some evidence to support this claim? Test datasets with a few observations should be sufficient.

Contributor
Posts: 38

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Posted in reply to FreelanceReinhard
Let me work on the test dataset that I can share.


Solution
2 weeks ago
Trusted Advisor
Posts: 1,250

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Posted in reply to FreelanceReinhard

Example providing evidence that the code does work as intended:

data dataset1;
length ID $10;
input ID Date :yymmdd10.;
format Date yymmdd10.;
cards;
0000000001 2017-03-22
0000000002 2017-05-01
0000000003 2018-01-01
0000000004 2018-05-01
;

data dataset2;
length ID $10;
infile cards missover;
input ID SessionCode $ (SessionBeginDate SessionEndDate) (:yymmdd10.);
format _numeric_ yymmdd10.;
cards;
0000000001 A 2017-05-10
0000000001 B 2017-05-10 2017-12-31
0000000002 C 2017-03-22
0000000002 D 2017-03-22 2017-12-31
0000000003 E 2018-01-01
0000000003 F 2017-03-22 2017-12-31
0000000003 G 2017-12-01 2018-01-01
0000000005 H 2017-03-22
0000000005 I 2017-03-22 2017-12-31
;

proc sql;
create table want as select y.*
from dataset1 x, dataset2 y
where x.ID=y.ID and
(.z<y.SessionBeginDate<=x.date<=y.SessionEndDate
or
x.Date>=y.SessionBeginDate>.z & y.SessionEndDate is missing);
quit;

Result:

                     Session     Session         Session
Obs        ID         Code      BeginDate        EndDate

 1     0000000002       C       2017-03-22             .
 2     0000000002       D       2017-03-22    2017-12-31
 3     0000000003       E       2018-01-01             .
 4     0000000003       G       2017-12-01    2018-01-01

(SessionCode values have been chosen to serve as identifiers for observations in dataset2.)

 

  • Matching observations A and B were correctly not selected, because 2017-03-22 was before 2017-05-10.
  • Matching observation C was correctly selected, because 2017-05-01 was after 2017-03-22 and SessionEndDate is missing.
  • Matching observation D was correctly selected, because 2017-05-01 was between 2017-03-22 and 2017-12-31.
  • Matching observation E was correctly selected, because 2018-01-01 was on SessionBeginDate.
  • Matching observation F was correctly not selected, because 2018-01-01 was after 2017-12-31.
  • Matching observation G was correctly selected, because 2018-01-01 was on SessionEndDate.
  • Non-matching observations H and I were correctly not selected.

 

 

Please note that duplicate IDs in dataset1 (if any) satisfying the WHERE condition would cause duplicates in dataset WANT, but not incorrectly selected observations.

Contributor
Posts: 38

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Posted in reply to FreelanceReinhard

Thanks so much for providing this explanation.

 

I looked further into my dataset 1 (x), the list of unique IDs and Date. I found that there was one ID (JUST ONE) that was duplicated. In the sample dataset, I had not included all the variables in the dataset 1. But the duplication of the ID occurred due to these other variables. This duplication may have caused the code to not work for me, which worked in the explanation you gave. This duplication may be presenting difficulty for the subsetting of dataset 2 with the code.

IDDateCompanyExam type
00000003332017-06-05AIVA
00000003332017-10-10BGA
Contributor
Posts: 38

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

I ran code after removing that one duplicated ID and the code worked for unduplicated IDs and Date in Dataset 1.

 

Thanks so very much.

Trusted Advisor
Posts: 1,250

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset


@d0816 wrote:

Thanks so much for providing this explanation.

 

I looked further into my dataset 1 (x), the list of unique IDs and Date. I found that there was one ID (JUST ONE) that was duplicated. In the sample dataset, I had not included all the variables in the dataset 1. But the duplication of the ID occurred due to these other variables. This duplication may have caused the code to not work for me, which worked in the explanation you gave. This duplication may be presenting difficulty for the subsetting of dataset 2 with the code.

ID Date Company Exam type
0000000333 2017-06-05 A IVA
0000000333 2017-10-10 B GA

I still think that duplicate IDs in dataset1 have the effect described in my previous post.

 

Let's take a look at an example:

data dataset1;
length ID $10;
input ID Date :yymmdd10.;
format Date yymmdd10.;
cards;
0000000333 2017-06-05
0000000333 2017-10-10
;

data dataset2;
length ID $10;
infile cards missover;
input ID SessionCode $ (SessionBeginDate SessionEndDate) (:yymmdd10.);
format _numeric_ yymmdd10.;
cards;
0000000333 J 2017-07-08
0000000333 K 2017-10-10
0000000333 L 2017-03-22 2017-09-30
0000000333 M 2017-09-22 2017-10-30
0000000333 N 2017-03-22 2017-12-31
0000000333 O 2017-01-01 2017-03-31
0000000333 P 2018-01-01 2018-03-31
;

(PROC SQL step unchanged.)

 

Result:

                     Session     Session         Session
Obs        ID         Code      BeginDate        EndDate

 1     0000000333       J       2017-07-08             .
 2     0000000333       K       2017-10-10             .
 3     0000000333       L       2017-03-22    2017-09-30
 4     0000000333       M       2017-09-22    2017-10-30
 5     0000000333       N       2017-03-22    2017-12-31
 6     0000000333       N       2017-03-22    2017-12-31

 

  • Observation J was correctly selected once, because 2017-10-10 (but not 2017-06-05) was after 2017-07-08 and SessionEndDate is missing.
  • Observation K was correctly selected once, because 2017-10-10 was on SessionBeginDate (whereas 2017-06-05 was before this date).
  • Observation L was correctly selected once, because 2017-06-05 (but not 2017-10-10) was between 2017-03-22 and 2017-09-30.
  • Observation M was correctly selected once, because 2017-10-10 (but not 2017-06-05) was between 2017-09-22 and 2017-10-30.
  • Observation N was "correctly" selected twice, because both 2017-06-05 and 2017-10-10 were between 2017-03-22 and 2017-12-31.
  • Observation O was correctly not selected, because neither 2017-06-05 nor 2017-10-10 was between 2017-01-01 and 2017-03-31.
  • Observation P was correctly not selected, because neither 2017-06-05 nor 2017-10-10 was between 2018-01-01 and 2018-03-31.

So, I don't see any incorrect selections. The duplicate record(s) could be suppressed easily by using the distinct keyword after "select". Additional variables in dataset1 should not interfere with the selection, as they are disregarded.

 

(Sorry, I have to call it a day now. It's close to midnight in my time zone.)

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 20 replies
  • 151 views
  • 0 likes
  • 3 in conversation