I have two datasets x and y.
I want to subset dataset 2 (y) based on the ID match and condition that if Date in dataset 1 (x) is at or in between Session Begin and End Date of the dataset y then keep those observation in Dataset y. Where End date is blank (because "End" event has not occurred)
, then apply condition that if Date in dataset x is at or after End Date in dataset y then also keep these observation in Dataset y. Any suggestion to correct the code or any other method to do this?
Code I used:
Proc SQL;
Create Table want as select y.*
from dataset1 x, dataset2 y
where x.ID=y.ID and
((x.Date between y.SessionBeginDate and y.SessionEndDate) /* Do this when both Begin date and End date is available*/
or
x.Date GE y.SessionBeginDate);/* Do this When only Begin date is available but End date is blank because "End" event has not occurred*/
quit;
Dataset I have:
Dataset 1: x | |
ID | Date |
0000000101 | 2017-03-22 |
0000000104 | 2017-06-20 |
0000000105 | 2017-05-17 |
0000000106 | 2017-06-20 |
0000000108 | 2017-10-12 |
0000000109 | 2017-11-01 |
0000000110 | 2017-11-08 |
0000000111 | 2017-08-01 |
0000000112 | 2017-06-14 |
0000000113 | 2017-09-19 |
0000000114 | 2017-07-23 |
Dataset 2: y | |||
ID | SessionCode | SessionBeginDate | SessionEndDate |
0000000101 | E | 2018-03-26 | 2018-04-09 |
0000000101 | W | 2018-04-09 | |
0000000102 | C | 2018-01-31 | 2018-05-30 |
0000000102 | M | 2018-03-05 | 2018-05-30 |
0000000102 | S | 2018-01-31 | 2018-05-30 |
0000000103 | U | 2016-10-18 | |
0000000103 | W | 2018-03-23 | |
0000000103 | C | 2016-10-19 | |
0000000103 | M | 2016-10-19 | |
0000000103 | S | 2016-10-26 | |
0000000104 | J | 2017-04-28 | |
0000000104 | F | 2017-04-28 | |
0000000105 | J | 2017-12-11 | 2018-04-30 |
0000000105 | F | 2017-12-11 | 2018-04-30 |
0000000106 | M | 2018-03-13 | |
0000000106 | S | 2015-05-14 | |
0000000106 | W | 2018-03-19 | 2018-05-22 |
0000000107 | M | 2017-01-18 | |
0000000107 | S | 2017-01-17 | |
0000000108 | M | 2017-12-18 | |
0000000108 | S | 2017-10-03 | |
0000000109 | J | 2017-09-22 | |
0000000109 | S | 2016-01-04 | |
0000000109 | F | 2017-09-22 | |
0000000110 | C | 2017-08-28 | |
0000000110 | M | 2017-08-28 | |
0000000110 | S | 2013-02-14 | |
0000000111 | C | 2017-11-13 | 2018-04-17 |
0000000111 | C | 2018-04-18 | |
0000000111 | M | 2017-11-13 | 2018-04-17 |
0000000111 | M | 2018-04-18 | |
0000000111 | S | 2017-11-13 | 2018-04-17 |
0000000111 | S | 2018-04-18 | |
0000000112 | M | 2017-08-09 | 2018-04-05 |
0000000112 | S | 2017-07-31 | 2018-04-05 |
0000000113 | A | 2018-04-17 | |
0000000113 | M | 2018-04-17 | |
0000000113 | C | 2017-09-19 | 2018-04-16 |
0000000113 | C | 2017-09-19 | 2018-04-16 |
0000000113 | M | 2017-09-19 | 2018-04-16 |
0000000113 | M | 2018-04-17 | |
0000000113 | S | 2016-09-20 | |
0000000114 | M | 2018-02-19 | |
0000000114 | S | 2017-07-07 | |
0000000115 | E | 2018-03-29 | 2018-05-15 |
0000000115 | J | 2018-03-30 | |
0000000115 | P | 2018-03-30 | 2018-05-14 |
0000000115 | E | 2018-03-29 | 2018-05-15 |
0000000115 | J | 2018-03-30 | |
0000000115 | P | 2018-03-30 | 2018-05-14 |
Result: for example lets say for matched ID 0000000106, 0000000110 and 0000000113, obs I want to keep in Dataset 2 (Y) looks like:
ID | SessionCode | SessionBeginDate | SessionEndDate |
0000000106 | S | 2015-05-14 | |
0000000110 | C | 2017-08-28 | |
0000000110 | M | 2017-08-28 | |
0000000110 | S | 2013-02-14 | |
0000000113 | C | 2017-09-19 | 2018-04-16 |
0000000113 | C | 2017-09-19 | 2018-04-16 |
0000000113 | M | 2017-09-19 | 2018-04-16 |
0000000113 | S | 2016-09-20 |
Example providing evidence that the code does work as intended:
data dataset1;
length ID $10;
input ID Date :yymmdd10.;
format Date yymmdd10.;
cards;
0000000001 2017-03-22
0000000002 2017-05-01
0000000003 2018-01-01
0000000004 2018-05-01
;
data dataset2;
length ID $10;
infile cards missover;
input ID SessionCode $ (SessionBeginDate SessionEndDate) (:yymmdd10.);
format _numeric_ yymmdd10.;
cards;
0000000001 A 2017-05-10
0000000001 B 2017-05-10 2017-12-31
0000000002 C 2017-03-22
0000000002 D 2017-03-22 2017-12-31
0000000003 E 2018-01-01
0000000003 F 2017-03-22 2017-12-31
0000000003 G 2017-12-01 2018-01-01
0000000005 H 2017-03-22
0000000005 I 2017-03-22 2017-12-31
;
proc sql;
create table want as select y.*
from dataset1 x, dataset2 y
where x.ID=y.ID and
(.z<y.SessionBeginDate<=x.date<=y.SessionEndDate
or
x.Date>=y.SessionBeginDate>.z & y.SessionEndDate is missing);
quit;
Result:
Session Session Session Obs ID Code BeginDate EndDate 1 0000000002 C 2017-03-22 . 2 0000000002 D 2017-03-22 2017-12-31 3 0000000003 E 2018-01-01 . 4 0000000003 G 2017-12-01 2018-01-01
(SessionCode values have been chosen to serve as identifiers for observations in dataset2.)
Please note that duplicate IDs in dataset1 (if any) satisfying the WHERE condition would cause duplicates in dataset WANT, but not incorrectly selected observations.
What is your issue with the code your using? Are those dates in Character or Numeric?
You can change that to LEFT Joins.
Proc SQL;
Create Table want as
select y.*
from dataset2 y
left join dataset1 x
on x.ID=y.ID
Where (x.Date between y.SessionBeginDate and y.SessionEndDate)
or (x.Date GE y.SessionEndDate);
quit;
In my code where condition is not working, basically it is returning me all observations for a matched ID.
@d0816 wrote:
In my code where condition is not working, basically it is returning me all observations for a matched ID.
... which is not surprising, as the condition
x.Date GE y.SessionEndDate
will always be met if y.SessionEndDate is missing and x.Date is non-missing (and even in many cases with missing x.Date).
My mistake I meant to write x.Date GE y.SessionBeginDate which is what I actually used.
Proc SQL;
Create Table want as select y.*
from dataset1 x, dataset2 y
where x.ID=y.ID and
((x.Date between y.SessionBeginDate and y.SessionEndDate) /* Do this when both Begin date and End date is available*/
or
x.Date GE y.SessionBeginDate);/* Do this When only Begin date is available but End date is blank because "End" event has not occurred*/
quit;
Good. This makes more sense.
Then I think, all it takes is to include the criteria about missing values in the WHERE condition, e.g.
where x.id=y.id and
(.z<y.Sessionbegindate<=x.date<=y.Sessionenddate
or
x.date>=y.Sessionbegindate>.z & y.Sessionenddate is missing);
(using the special missing value .z)
This code is giving me all observations for unmatched ID as well.
And also where condition did not work. For example one matched ID has "Date" as 3/22/2017 in Dataset x. The resulting dataset should not have highlighted red rows below, which is not according to the condition specified (3/22/2017 is before 5/10/2017).
ID | SessionBeginDate | SessionEndDate |
0000000200 | 2/22/2017 | 2/12/2018 |
0000000200 | 2/22/2017 | 3/22/2017 |
0000000200 | 2/22/2017 | 2/22/2017 |
0000000200 | 2/22/2017 | 2/12/2018 |
0000000200 | 2/22/2017 | 3/22/2017 |
0000000200 | 2/22/2017 | 2/12/2018 |
0000000200 | 2/22/2017 | 2/12/2018 |
0000000200 | 5/10/2017 | |
0000000200 | 5/10/2017 | |
0000000200 | 2/22/2017 | 2/12/2018 |
0000000200 | 2/22/2017 | 2/12/2018 |
0000000200 | 5/10/2017 | |
0000000200 | 5/10/2017 | |
0000000200 | 2/22/2017 | 2/12/2018 |
0000000200 | 5/10/2017 | |
0000000200 | 2/22/2017 | 2/12/2018 |
0000000200 | 5/10/2017 | |
0000000200 | 2/22/2017 | 2/12/2018 |
0000000200 | 2/22/2017 | 2/12/2018 |
0000000200 | 5/10/2017 | |
0000000200 | 5/10/2017 | |
0000000200 | 2/22/2017 | 2/12/2018 |
0000000200 | 5/10/2017 | |
0000000200 | 2/22/2017 | 2/12/2018 |
0000000200 | 5/10/2017 | |
0000000200 | 2/22/2017 | 2/12/2018 |
0000000200 | 5/10/2017 | |
0000000200 | 3/21/2018 |
@d0816 wrote:
This code is giving me all observations for unmatched ID as well.
And also where condition did not work. For example one matched ID has "Date" as 3/22/2017 in Dataset x. The resulting dataset should not have highlighted red rows below, which is not according to the condition specified (3/22/2017 is before 5/10/2017).
To avoid any misunderstandings, let me post the complete PROC SQL step that my suggested WHERE condition was meant to be part of:
proc sql;
create table want as select y.*
from dataset1 x, dataset2 y
where x.id=y.id and
(.z<y.Sessionbegindate<=x.date<=y.Sessionenddate
or
x.date>=y.Sessionbegindate>.z & y.Sessionenddate is missing);
quit;
Do you get
with the above code? I don't think so.
In the resulting dataset what I want to do is:
First, in Dataset 2 (y), keep only those IDs that are in Dataset 1 (x). and remove all IDs that are not in Dataset 1.
Then Second, apply the date condition for a matched/common IDs.
To your question, Do you get
I am getting both.
@d0816 wrote:
First, in Dataset 2 (y), keep only those IDs that are in Dataset 1 (x). and remove all IDs that are not in Dataset 1.
This is covered by the first part of the WHERE condition, x.id=y.id.
@d0816 wrote:
Then Second, apply the date condition for a matched/common IDs.
This is covered by the second part of the WHERE condition.
@d0816 wrote:
To your question, Do you get
- "observations for unmatched ID" or
- observations where (non-missing) x.date is before y.SessionBeginDate
I am getting both.
I'm still not convinced. Can you please provide some evidence to support this claim? Test datasets with a few observations should be sufficient.
Example providing evidence that the code does work as intended:
data dataset1;
length ID $10;
input ID Date :yymmdd10.;
format Date yymmdd10.;
cards;
0000000001 2017-03-22
0000000002 2017-05-01
0000000003 2018-01-01
0000000004 2018-05-01
;
data dataset2;
length ID $10;
infile cards missover;
input ID SessionCode $ (SessionBeginDate SessionEndDate) (:yymmdd10.);
format _numeric_ yymmdd10.;
cards;
0000000001 A 2017-05-10
0000000001 B 2017-05-10 2017-12-31
0000000002 C 2017-03-22
0000000002 D 2017-03-22 2017-12-31
0000000003 E 2018-01-01
0000000003 F 2017-03-22 2017-12-31
0000000003 G 2017-12-01 2018-01-01
0000000005 H 2017-03-22
0000000005 I 2017-03-22 2017-12-31
;
proc sql;
create table want as select y.*
from dataset1 x, dataset2 y
where x.ID=y.ID and
(.z<y.SessionBeginDate<=x.date<=y.SessionEndDate
or
x.Date>=y.SessionBeginDate>.z & y.SessionEndDate is missing);
quit;
Result:
Session Session Session Obs ID Code BeginDate EndDate 1 0000000002 C 2017-03-22 . 2 0000000002 D 2017-03-22 2017-12-31 3 0000000003 E 2018-01-01 . 4 0000000003 G 2017-12-01 2018-01-01
(SessionCode values have been chosen to serve as identifiers for observations in dataset2.)
Please note that duplicate IDs in dataset1 (if any) satisfying the WHERE condition would cause duplicates in dataset WANT, but not incorrectly selected observations.
Thanks so much for providing this explanation.
I looked further into my dataset 1 (x), the list of unique IDs and Date. I found that there was one ID (JUST ONE) that was duplicated. In the sample dataset, I had not included all the variables in the dataset 1. But the duplication of the ID occurred due to these other variables. This duplication may have caused the code to not work for me, which worked in the explanation you gave. This duplication may be presenting difficulty for the subsetting of dataset 2 with the code.
ID | Date | Company | Exam type |
0000000333 | 2017-06-05 | A | IVA |
0000000333 | 2017-10-10 | B | GA |
I ran code after removing that one duplicated ID and the code worked for unduplicated IDs and Date in Dataset 1.
Thanks so very much.
@d0816 wrote:
Thanks so much for providing this explanation.
I looked further into my dataset 1 (x), the list of unique IDs and Date. I found that there was one ID (JUST ONE) that was duplicated. In the sample dataset, I had not included all the variables in the dataset 1. But the duplication of the ID occurred due to these other variables. This duplication may have caused the code to not work for me, which worked in the explanation you gave. This duplication may be presenting difficulty for the subsetting of dataset 2 with the code.
ID Date Company Exam type 0000000333 2017-06-05 A IVA 0000000333 2017-10-10 B GA
I still think that duplicate IDs in dataset1 have the effect described in my previous post.
Let's take a look at an example:
data dataset1;
length ID $10;
input ID Date :yymmdd10.;
format Date yymmdd10.;
cards;
0000000333 2017-06-05
0000000333 2017-10-10
;
data dataset2;
length ID $10;
infile cards missover;
input ID SessionCode $ (SessionBeginDate SessionEndDate) (:yymmdd10.);
format _numeric_ yymmdd10.;
cards;
0000000333 J 2017-07-08
0000000333 K 2017-10-10
0000000333 L 2017-03-22 2017-09-30
0000000333 M 2017-09-22 2017-10-30
0000000333 N 2017-03-22 2017-12-31
0000000333 O 2017-01-01 2017-03-31
0000000333 P 2018-01-01 2018-03-31
;
(PROC SQL step unchanged.)
Result:
Session Session Session Obs ID Code BeginDate EndDate 1 0000000333 J 2017-07-08 . 2 0000000333 K 2017-10-10 . 3 0000000333 L 2017-03-22 2017-09-30 4 0000000333 M 2017-09-22 2017-10-30 5 0000000333 N 2017-03-22 2017-12-31 6 0000000333 N 2017-03-22 2017-12-31
So, I don't see any incorrect selections. The duplicate record(s) could be suppressed easily by using the distinct keyword after "select". Additional variables in dataset1 should not interfere with the selection, as they are disregarded.
(Sorry, I have to call it a day now. It's close to midnight in my time zone.)
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.