Solved: Re: Keep obs in one dataset based on ID match and other variables matc...

d0816 · Posted 07-02-2018 11:56 AM

I have two datasets x and y.

I want to subset dataset 2 (y) based on the ID match and condition that if Date in dataset 1 (x) is at or in between Session Begin and End Date of the dataset y then keep those observation in Dataset y. Where End date is blank (because "End" event has not occurred)

, then apply condition that if Date in dataset x is at or after End Date in dataset y then also keep these observation in Dataset y. Any suggestion to correct the code or any other method to do this?

Code I used:

Proc SQL;

Create Table want as select y.*

from dataset1 x, dataset2 y

where x.ID=y.ID and

((x.Date between y.SessionBeginDate and y.SessionEndDate) /* Do this when both Begin date and End date is available*/

or

x.Date GE y.SessionBeginDate);/* Do this When only Begin date is available but End date is blank because "End" event has not occurred*/

quit;

Dataset I have:

Dataset 1: x
ID	Date
0000000101	2017-03-22
0000000104	2017-06-20
0000000105	2017-05-17
0000000106	2017-06-20
0000000108	2017-10-12
0000000109	2017-11-01
0000000110	2017-11-08
0000000111	2017-08-01
0000000112	2017-06-14
0000000113	2017-09-19
0000000114	2017-07-23

Dataset 2: y
ID	SessionCode	SessionBeginDate	SessionEndDate
0000000101	E	2018-03-26	2018-04-09
0000000101	W	2018-04-09
0000000102	C	2018-01-31	2018-05-30
0000000102	M	2018-03-05	2018-05-30
0000000102	S	2018-01-31	2018-05-30
0000000103	U	2016-10-18
0000000103	W	2018-03-23
0000000103	C	2016-10-19
0000000103	M	2016-10-19
0000000103	S	2016-10-26
0000000104	J	2017-04-28
0000000104	F	2017-04-28
0000000105	J	2017-12-11	2018-04-30
0000000105	F	2017-12-11	2018-04-30
0000000106	M	2018-03-13
0000000106	S	2015-05-14
0000000106	W	2018-03-19	2018-05-22
0000000107	M	2017-01-18
0000000107	S	2017-01-17
0000000108	M	2017-12-18
0000000108	S	2017-10-03
0000000109	J	2017-09-22
0000000109	S	2016-01-04
0000000109	F	2017-09-22
0000000110	C	2017-08-28
0000000110	M	2017-08-28
0000000110	S	2013-02-14
0000000111	C	2017-11-13	2018-04-17
0000000111	C	2018-04-18
0000000111	M	2017-11-13	2018-04-17
0000000111	M	2018-04-18
0000000111	S	2017-11-13	2018-04-17
0000000111	S	2018-04-18
0000000112	M	2017-08-09	2018-04-05
0000000112	S	2017-07-31	2018-04-05
0000000113	A	2018-04-17
0000000113	M	2018-04-17
0000000113	C	2017-09-19	2018-04-16
0000000113	C	2017-09-19	2018-04-16
0000000113	M	2017-09-19	2018-04-16
0000000113	M	2018-04-17
0000000113	S	2016-09-20
0000000114	M	2018-02-19
0000000114	S	2017-07-07
0000000115	E	2018-03-29	2018-05-15
0000000115	J	2018-03-30
0000000115	P	2018-03-30	2018-05-14
0000000115	E	2018-03-29	2018-05-15
0000000115	J	2018-03-30
0000000115	P	2018-03-30	2018-05-14

Result: for example lets say for matched ID 0000000106, 0000000110 and 0000000113, obs I want to keep in Dataset 2 (Y) looks like:

ID	SessionCode	SessionBeginDate	SessionEndDate
0000000106	S	2015-05-14
0000000110	C	2017-08-28
0000000110	M	2017-08-28
0000000110	S	2013-02-14
0000000113	C	2017-09-19	2018-04-16
0000000113	C	2017-09-19	2018-04-16
0000000113	M	2017-09-19	2018-04-16
0000000113	S	2016-09-20

FreelanceReinh · Posted 07-02-2018 04:34 PM

Example providing evidence that the code does work as intended:

data dataset1;
length ID $10;
input ID Date :yymmdd10.;
format Date yymmdd10.;
cards;
0000000001 2017-03-22
0000000002 2017-05-01
0000000003 2018-01-01
0000000004 2018-05-01
;

data dataset2;
length ID $10;
infile cards missover;
input ID SessionCode $ (SessionBeginDate SessionEndDate) (:yymmdd10.);
format _numeric_ yymmdd10.;
cards;
0000000001 A 2017-05-10
0000000001 B 2017-05-10 2017-12-31
0000000002 C 2017-03-22
0000000002 D 2017-03-22 2017-12-31
0000000003 E 2018-01-01
0000000003 F 2017-03-22 2017-12-31
0000000003 G 2017-12-01 2018-01-01
0000000005 H 2017-03-22
0000000005 I 2017-03-22 2017-12-31
;

proc sql;
create table want as select y.*
from dataset1 x, dataset2 y
where x.ID=y.ID and
(.z<y.SessionBeginDate<=x.date<=y.SessionEndDate
or
x.Date>=y.SessionBeginDate>.z & y.SessionEndDate is missing);
quit;

Result:

                     Session     Session         Session
Obs        ID         Code      BeginDate        EndDate

 1     0000000002       C       2017-03-22             .
 2     0000000002       D       2017-03-22    2017-12-31
 3     0000000003       E       2018-01-01             .
 4     0000000003       G       2017-12-01    2018-01-01

(SessionCode values have been chosen to serve as identifiers for observations in dataset2.)

Matching observations A and B were correctly not selected, because 2017-03-22 was before 2017-05-10.
Matching observation C was correctly selected, because 2017-05-01 was after 2017-03-22 and SessionEndDate is missing.
Matching observation D was correctly selected, because 2017-05-01 was between 2017-03-22 and 2017-12-31.
Matching observation E was correctly selected, because 2018-01-01 was on SessionBeginDate.
Matching observation F was correctly not selected, because 2018-01-01 was after 2017-12-31.
Matching observation G was correctly selected, because 2018-01-01 was on SessionEndDate.
Non-matching observations H and I were correctly not selected.

Please note that duplicate IDs in dataset1 (if any) satisfying the WHERE condition would cause duplicates in dataset WANT, but not incorrectly selected observations.

View solution in original post

SuryaKiran · Posted 07-02-2018 12:09 PM

What is your issue with the code your using? Are those dates in Character or Numeric?

You can change that to LEFT Joins.

Proc SQL;
Create Table want as 
select y.*
	from dataset2 y 
		left join dataset1 x 
		 on x.ID=y.ID
Where (x.Date between y.SessionBeginDate and y.SessionEndDate) 
	or (x.Date GE y.SessionEndDate);

quit;

Thanks,
Suryakiran

d0816 · Posted 07-02-2018 12:19 PM

In my code where condition is not working, basically it is returning me all observations for a matched ID.

FreelanceReinh · Posted 07-02-2018 12:34 PM

@d0816 wrote:

In my code where condition is not working, basically it is returning me all observations for a matched ID.

... which is not surprising, as the condition

x.Date GE y.SessionEndDate

will always be met if y.SessionEndDate is missing and x.Date is non-missing (and even in many cases with missing x.Date).

d0816 · Posted 07-02-2018 12:39 PM

My mistake I meant to write x.Date GE y.SessionBeginDate which is what I actually used.

Proc SQL;

Create Table want as select y.*

from dataset1 x, dataset2 y

where x.ID=y.ID and

((x.Date between y.SessionBeginDate and y.SessionEndDate) /* Do this when both Begin date and End date is available*/

or

x.Date GE y.SessionBeginDate);/* Do this When only Begin date is available but End date is blank because "End" event has not occurred*/

quit;

FreelanceReinh · Posted 07-02-2018 12:53 PM

Good. This makes more sense.

Then I think, all it takes is to include the criteria about missing values in the WHERE condition, e.g.

where x.id=y.id and
(.z<y.Sessionbegindate<=x.date<=y.Sessionenddate
or
x.date>=y.Sessionbegindate>.z & y.Sessionenddate is missing);

(using the special missing value .z)

d0816 · Posted 07-02-2018 02:07 PM

This code is giving me all observations for unmatched ID as well.

And also where condition did not work. For example one matched ID has "Date" as 3/22/2017 in Dataset x. The resulting dataset should not have highlighted red rows below, which is not according to the condition specified (3/22/2017 is before 5/10/2017).

ID	SessionBeginDate	SessionEndDate
0000000200	2/22/2017	2/12/2018
0000000200	2/22/2017	3/22/2017
0000000200	2/22/2017	2/22/2017
0000000200	2/22/2017	2/12/2018
0000000200	2/22/2017	3/22/2017
0000000200	2/22/2017	2/12/2018
0000000200	2/22/2017	2/12/2018
0000000200	5/10/2017
0000000200	5/10/2017
0000000200	2/22/2017	2/12/2018
0000000200	2/22/2017	2/12/2018
0000000200	5/10/2017
0000000200	5/10/2017
0000000200	2/22/2017	2/12/2018
0000000200	5/10/2017
0000000200	2/22/2017	2/12/2018
0000000200	5/10/2017
0000000200	2/22/2017	2/12/2018
0000000200	2/22/2017	2/12/2018
0000000200	5/10/2017
0000000200	5/10/2017
0000000200	2/22/2017	2/12/2018
0000000200	5/10/2017
0000000200	2/22/2017	2/12/2018
0000000200	5/10/2017
0000000200	2/22/2017	2/12/2018
0000000200	5/10/2017
0000000200	3/21/2018

FreelanceReinh · Posted 07-02-2018 02:48 PM

@d0816 wrote:

This code is giving me all observations for unmatched ID as well.

And also where condition did not work. For example one matched ID has "Date" as 3/22/2017 in Dataset x. The resulting dataset should not have highlighted red rows below, which is not according to the condition specified (3/22/2017 is before 5/10/2017).

To avoid any misunderstandings, let me post the complete PROC SQL step that my suggested WHERE condition was meant to be part of:

proc sql;
create table want as select y.*
from dataset1 x, dataset2 y
where x.id=y.id and
(.z<y.Sessionbegindate<=x.date<=y.Sessionenddate
or
x.date>=y.Sessionbegindate>.z & y.Sessionenddate is missing);
quit;

Do you get

"observations for unmatched ID" or
observations where (non-missing) x.date is before y.SessionBeginDate

with the above code? I don't think so.

d0816 · Posted 07-02-2018 03:13 PM

In the resulting dataset what I want to do is:

First, in Dataset 2 (y), keep only those IDs that are in Dataset 1 (x). and remove all IDs that are not in Dataset 1.

Then Second, apply the date condition for a matched/common IDs.

To your question, Do you get

"observations for unmatched ID" or
observations where (non-missing) x.date is before y.SessionBeginDate

I am getting both.

FreelanceReinh · Posted 07-02-2018 03:38 PM

@d0816 wrote:

First, in Dataset 2 (y), keep only those IDs that are in Dataset 1 (x). and remove all IDs that are not in Dataset 1.

This is covered by the first part of the WHERE condition, x.id=y.id.

@d0816 wrote:

Then Second, apply the date condition for a matched/common IDs.

This is covered by the second part of the WHERE condition.

@d0816 wrote:

To your question, Do you get

"observations for unmatched ID" or

observations where (non-missing) x.date is before y.SessionBeginDate

I am getting both.

I'm still not convinced. Can you please provide some evidence to support this claim? Test datasets with a few observations should be sufficient.

d0816 · Posted 07-02-2018 03:42 PM

Let me work on the test dataset that I can share.

FreelanceReinh · Posted 07-02-2018 04:34 PM

Example providing evidence that the code does work as intended:

data dataset1;
length ID $10;
input ID Date :yymmdd10.;
format Date yymmdd10.;
cards;
0000000001 2017-03-22
0000000002 2017-05-01
0000000003 2018-01-01
0000000004 2018-05-01
;

data dataset2;
length ID $10;
infile cards missover;
input ID SessionCode $ (SessionBeginDate SessionEndDate) (:yymmdd10.);
format _numeric_ yymmdd10.;
cards;
0000000001 A 2017-05-10
0000000001 B 2017-05-10 2017-12-31
0000000002 C 2017-03-22
0000000002 D 2017-03-22 2017-12-31
0000000003 E 2018-01-01
0000000003 F 2017-03-22 2017-12-31
0000000003 G 2017-12-01 2018-01-01
0000000005 H 2017-03-22
0000000005 I 2017-03-22 2017-12-31
;

proc sql;
create table want as select y.*
from dataset1 x, dataset2 y
where x.ID=y.ID and
(.z<y.SessionBeginDate<=x.date<=y.SessionEndDate
or
x.Date>=y.SessionBeginDate>.z & y.SessionEndDate is missing);
quit;

Result:

                     Session     Session         Session
Obs        ID         Code      BeginDate        EndDate

 1     0000000002       C       2017-03-22             .
 2     0000000002       D       2017-03-22    2017-12-31
 3     0000000003       E       2018-01-01             .
 4     0000000003       G       2017-12-01    2018-01-01

(SessionCode values have been chosen to serve as identifiers for observations in dataset2.)

Matching observations A and B were correctly not selected, because 2017-03-22 was before 2017-05-10.
Matching observation C was correctly selected, because 2017-05-01 was after 2017-03-22 and SessionEndDate is missing.
Matching observation D was correctly selected, because 2017-05-01 was between 2017-03-22 and 2017-12-31.
Matching observation E was correctly selected, because 2018-01-01 was on SessionBeginDate.
Matching observation F was correctly not selected, because 2018-01-01 was after 2017-12-31.
Matching observation G was correctly selected, because 2018-01-01 was on SessionEndDate.
Non-matching observations H and I were correctly not selected.

Please note that duplicate IDs in dataset1 (if any) satisfying the WHERE condition would cause duplicates in dataset WANT, but not incorrectly selected observations.

d0816 · Posted 07-02-2018 04:59 PM

Thanks so much for providing this explanation.

I looked further into my dataset 1 (x), the list of unique IDs and Date. I found that there was one ID (JUST ONE) that was duplicated. In the sample dataset, I had not included all the variables in the dataset 1. But the duplication of the ID occurred due to these other variables. This duplication may have caused the code to not work for me, which worked in the explanation you gave. This duplication may be presenting difficulty for the subsetting of dataset 2 with the code.

ID	Date	Company	Exam type
0000000333	2017-06-05	A	IVA
0000000333	2017-10-10	B	GA

d0816 · Posted 07-02-2018 05:17 PM

I ran code after removing that one duplicated ID and the code worked for unduplicated IDs and Date in Dataset 1.

Thanks so very much.

FreelanceReinh · Posted 07-02-2018 05:49 PM

@d0816 wrote:

Thanks so much for providing this explanation.

I looked further into my dataset 1 (x), the list of unique IDs and Date. I found that there was one ID (JUST ONE) that was duplicated. In the sample dataset, I had not included all the variables in the dataset 1. But the duplication of the ID occurred due to these other variables. This duplication may have caused the code to not work for me, which worked in the explanation you gave. This duplication may be presenting difficulty for the subsetting of dataset 2 with the code.

ID Date Company Exam type

0000000333 2017-06-05 A IVA

0000000333 2017-10-10 B GA

I still think that duplicate IDs in dataset1 have the effect described in my previous post.

Let's take a look at an example:

data dataset1;
length ID $10;
input ID Date :yymmdd10.;
format Date yymmdd10.;
cards;
0000000333 2017-06-05
0000000333 2017-10-10
;

data dataset2;
length ID $10;
infile cards missover;
input ID SessionCode $ (SessionBeginDate SessionEndDate) (:yymmdd10.);
format _numeric_ yymmdd10.;
cards;
0000000333 J 2017-07-08
0000000333 K 2017-10-10
0000000333 L 2017-03-22 2017-09-30
0000000333 M 2017-09-22 2017-10-30
0000000333 N 2017-03-22 2017-12-31
0000000333 O 2017-01-01 2017-03-31
0000000333 P 2018-01-01 2018-03-31
;

(PROC SQL step unchanged.)

Result:

                     Session     Session         Session
Obs        ID         Code      BeginDate        EndDate

 1     0000000333       J       2017-07-08             .
 2     0000000333       K       2017-10-10             .
 3     0000000333       L       2017-03-22    2017-09-30
 4     0000000333       M       2017-09-22    2017-10-30
 5     0000000333       N       2017-03-22    2017-12-31
 6     0000000333       N       2017-03-22    2017-12-31

Observation J was correctly selected once, because 2017-10-10 (but not 2017-06-05) was after 2017-07-08 and SessionEndDate is missing.
Observation K was correctly selected once, because 2017-10-10 was on SessionBeginDate (whereas 2017-06-05 was before this date).
Observation L was correctly selected once, because 2017-06-05 (but not 2017-10-10) was between 2017-03-22 and 2017-09-30.
Observation M was correctly selected once, because 2017-10-10 (but not 2017-06-05) was between 2017-09-22 and 2017-10-30.
Observation N was "correctly" selected twice, because both 2017-06-05 and 2017-10-10 were between 2017-03-22 and 2017-12-31.
Observation O was correctly not selected, because neither 2017-06-05 nor 2017-10-10 was between 2017-01-01 and 2017-03-31.
Observation P was correctly not selected, because neither 2017-06-05 nor 2017-10-10 was between 2018-01-01 and 2018-03-31.

So, I don't see any incorrect selections. The duplicate record(s) could be suppressed easily by using the distinct keyword after "select". Additional variables in dataset1 should not interfere with the selection, as they are disregarded.

(Sorry, I have to call it a day now. It's close to midnight in my time zone.)

Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Re: Keep obs in one dataset based on ID match and other variables match condition in another dataset

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away