BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
d0816
Quartz | Level 8

I have two datasets x and y.

 

I want to subset dataset 2 (y) based on the ID match and condition that if Date in dataset 1 (x) is at or in between Session Begin and End Date of the dataset y then keep those observation in Dataset y. Where End date is blank (because "End" event has not occurred)

, then apply condition that if Date in dataset x is at or after End Date in dataset y then also keep these observation in Dataset y. Any suggestion to correct the code or any other method to do this?

 

Code I used:

Proc SQL;

Create Table want as select y.*

from dataset1 x, dataset2 y

where x.ID=y.ID and

((x.Date between y.SessionBeginDate and y.SessionEndDate) /* Do this when both Begin date and End date is available*/

or

x.Date GE y.SessionBeginDate);/* Do this When only Begin date is available but End date is blank because "End" event has not occurred*/

quit;

 

Dataset I have:

Dataset 1: x 
IDDate
00000001012017-03-22
00000001042017-06-20
00000001052017-05-17
00000001062017-06-20
00000001082017-10-12
00000001092017-11-01
00000001102017-11-08
00000001112017-08-01
00000001122017-06-14
00000001132017-09-19
00000001142017-07-23

 

 

Dataset 2: y   
IDSessionCodeSessionBeginDateSessionEndDate
0000000101E2018-03-262018-04-09
0000000101W2018-04-09 
0000000102C2018-01-312018-05-30
0000000102M2018-03-052018-05-30
0000000102S2018-01-312018-05-30
0000000103U2016-10-18 
0000000103W2018-03-23 
0000000103C2016-10-19 
0000000103M2016-10-19 
0000000103S2016-10-26 
0000000104J2017-04-28 
0000000104F2017-04-28 
0000000105J2017-12-112018-04-30
0000000105F2017-12-112018-04-30
0000000106M2018-03-13 
0000000106S2015-05-14 
0000000106W2018-03-192018-05-22
0000000107M2017-01-18 
0000000107S2017-01-17 
0000000108M2017-12-18 
0000000108S2017-10-03 
0000000109J2017-09-22 
0000000109S2016-01-04 
0000000109F2017-09-22 
0000000110C2017-08-28 
0000000110M2017-08-28 
0000000110S2013-02-14 
0000000111C2017-11-132018-04-17
0000000111C2018-04-18 
0000000111M2017-11-132018-04-17
0000000111M2018-04-18 
0000000111S2017-11-132018-04-17
0000000111S2018-04-18 
0000000112M2017-08-092018-04-05
0000000112S2017-07-312018-04-05
0000000113A2018-04-17 
0000000113M2018-04-17 
0000000113C2017-09-192018-04-16
0000000113C2017-09-192018-04-16
0000000113M2017-09-192018-04-16
0000000113M2018-04-17 
0000000113S2016-09-20 
0000000114M2018-02-19 
0000000114S2017-07-07 
0000000115E2018-03-292018-05-15
0000000115J2018-03-30 
0000000115P2018-03-302018-05-14
0000000115E2018-03-292018-05-15
0000000115J2018-03-30 
0000000115P2018-03-302018-05-14

 

 

Result: for example lets say for matched ID  0000000106, 0000000110 and 0000000113, obs I want to keep in Dataset 2 (Y) looks like:

IDSessionCodeSessionBeginDateSessionEndDate
0000000106S2015-05-14 
0000000110C2017-08-28 
0000000110M2017-08-28 
0000000110S2013-02-14 
0000000113C2017-09-192018-04-16
0000000113C2017-09-192018-04-16
0000000113M2017-09-192018-04-16
0000000113S2016-09-20 
1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Example providing evidence that the code does work as intended:

data dataset1;
length ID $10;
input ID Date :yymmdd10.;
format Date yymmdd10.;
cards;
0000000001 2017-03-22
0000000002 2017-05-01
0000000003 2018-01-01
0000000004 2018-05-01
;

data dataset2;
length ID $10;
infile cards missover;
input ID SessionCode $ (SessionBeginDate SessionEndDate) (:yymmdd10.);
format _numeric_ yymmdd10.;
cards;
0000000001 A 2017-05-10
0000000001 B 2017-05-10 2017-12-31
0000000002 C 2017-03-22
0000000002 D 2017-03-22 2017-12-31
0000000003 E 2018-01-01
0000000003 F 2017-03-22 2017-12-31
0000000003 G 2017-12-01 2018-01-01
0000000005 H 2017-03-22
0000000005 I 2017-03-22 2017-12-31
;

proc sql;
create table want as select y.*
from dataset1 x, dataset2 y
where x.ID=y.ID and
(.z<y.SessionBeginDate<=x.date<=y.SessionEndDate
or
x.Date>=y.SessionBeginDate>.z & y.SessionEndDate is missing);
quit;

Result:

                     Session     Session         Session
Obs        ID         Code      BeginDate        EndDate

 1     0000000002       C       2017-03-22             .
 2     0000000002       D       2017-03-22    2017-12-31
 3     0000000003       E       2018-01-01             .
 4     0000000003       G       2017-12-01    2018-01-01

(SessionCode values have been chosen to serve as identifiers for observations in dataset2.)

 

  • Matching observations A and B were correctly not selected, because 2017-03-22 was before 2017-05-10.
  • Matching observation C was correctly selected, because 2017-05-01 was after 2017-03-22 and SessionEndDate is missing.
  • Matching observation D was correctly selected, because 2017-05-01 was between 2017-03-22 and 2017-12-31.
  • Matching observation E was correctly selected, because 2018-01-01 was on SessionBeginDate.
  • Matching observation F was correctly not selected, because 2018-01-01 was after 2017-12-31.
  • Matching observation G was correctly selected, because 2018-01-01 was on SessionEndDate.
  • Non-matching observations H and I were correctly not selected.

 

 

Please note that duplicate IDs in dataset1 (if any) satisfying the WHERE condition would cause duplicates in dataset WANT, but not incorrectly selected observations.

View solution in original post

20 REPLIES 20
SuryaKiran
Meteorite | Level 14

What is your issue with the code your using? Are those dates in Character or Numeric?

 

You can change that to LEFT Joins.

 

Proc SQL;
Create Table want as 
select y.*
	from dataset2 y 
		left join dataset1 x 
		 on x.ID=y.ID
Where (x.Date between y.SessionBeginDate and y.SessionEndDate) 
	or (x.Date GE y.SessionEndDate);

quit;
Thanks,
Suryakiran
d0816
Quartz | Level 8

In my code where condition is not working, basically it is returning me all observations for a matched ID.

FreelanceReinh
Jade | Level 19

@d0816 wrote:

In my code where condition is not working, basically it is returning me all observations for a matched ID.


... which is not surprising, as the condition

x.Date GE y.SessionEndDate

will always be met if y.SessionEndDate is missing and x.Date is non-missing (and even in many cases with missing x.Date).

d0816
Quartz | Level 8

My mistake I meant to write x.Date GE y.SessionBeginDate which is what I actually used.

Proc SQL;

Create Table want as select y.*

from dataset1 x, dataset2 y

where x.ID=y.ID and

((x.Date between y.SessionBeginDate and y.SessionEndDate) /* Do this when both Begin date and End date is available*/

or

x.Date GE y.SessionBeginDate);/* Do this When only Begin date is available but End date is blank because "End" event has not occurred*/

quit;

FreelanceReinh
Jade | Level 19

Good. This makes more sense.

 

Then I think, all it takes is to include the criteria about missing values in the WHERE condition, e.g.

where x.id=y.id and
(.z<y.Sessionbegindate<=x.date<=y.Sessionenddate
or
x.date>=y.Sessionbegindate>.z & y.Sessionenddate is missing);

(using the special missing value .z)

d0816
Quartz | Level 8

This code is giving me all observations for unmatched ID as well.

 

And also where condition did not work. For example one matched ID has "Date" as 3/22/2017 in Dataset x. The resulting dataset should not have highlighted red rows below, which is not according to the condition specified (3/22/2017 is before 5/10/2017).

IDSessionBeginDateSessionEndDate
00000002002/22/20172/12/2018
00000002002/22/20173/22/2017
00000002002/22/20172/22/2017
00000002002/22/20172/12/2018
00000002002/22/20173/22/2017
00000002002/22/20172/12/2018
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002002/22/20172/12/2018
00000002005/10/2017 
00000002003/21/2018 
FreelanceReinh
Jade | Level 19

@d0816 wrote:

This code is giving me all observations for unmatched ID as well.

 

And also where condition did not work. For example one matched ID has "Date" as 3/22/2017 in Dataset x. The resulting dataset should not have highlighted red rows below, which is not according to the condition specified (3/22/2017 is before 5/10/2017).



To avoid any misunderstandings, let me post the complete PROC SQL step that my suggested WHERE condition was meant to be part of:

proc sql;
create table want as select y.*
from dataset1 x, dataset2 y
where x.id=y.id and
(.z<y.Sessionbegindate<=x.date<=y.Sessionenddate
or
x.date>=y.Sessionbegindate>.z & y.Sessionenddate is missing);
quit;

Do you get 

  • "observations for unmatched ID" or
  • observations where (non-missing) x.date is before y.SessionBeginDate

with the above code? I don't think so.

d0816
Quartz | Level 8

In the resulting dataset what I want to do is:

First, in Dataset 2 (y), keep only those IDs that are in Dataset 1 (x). and remove all IDs that are not in Dataset 1.

Then Second, apply the date condition for a matched/common IDs.

 

To your question, Do you get 

  • "observations for unmatched ID" or
  • observations where (non-missing) x.date is before y.SessionBeginDate

I am getting both.

FreelanceReinh
Jade | Level 19

@d0816 wrote:

First, in Dataset 2 (y), keep only those IDs that are in Dataset 1 (x). and remove all IDs that are not in Dataset 1.

 


This is covered by the first part of the WHERE condition, x.id=y.id.


@d0816 wrote:

Then Second, apply the date condition for a matched/common IDs.

 


This is covered by the second part of the WHERE condition.

 


@d0816 wrote:

To your question, Do you get 

  • "observations for unmatched ID" or
  • observations where (non-missing) x.date is before y.SessionBeginDate

I am getting both.


I'm still not convinced. Can you please provide some evidence to support this claim? Test datasets with a few observations should be sufficient.

d0816
Quartz | Level 8
Let me work on the test dataset that I can share.


FreelanceReinh
Jade | Level 19

Example providing evidence that the code does work as intended:

data dataset1;
length ID $10;
input ID Date :yymmdd10.;
format Date yymmdd10.;
cards;
0000000001 2017-03-22
0000000002 2017-05-01
0000000003 2018-01-01
0000000004 2018-05-01
;

data dataset2;
length ID $10;
infile cards missover;
input ID SessionCode $ (SessionBeginDate SessionEndDate) (:yymmdd10.);
format _numeric_ yymmdd10.;
cards;
0000000001 A 2017-05-10
0000000001 B 2017-05-10 2017-12-31
0000000002 C 2017-03-22
0000000002 D 2017-03-22 2017-12-31
0000000003 E 2018-01-01
0000000003 F 2017-03-22 2017-12-31
0000000003 G 2017-12-01 2018-01-01
0000000005 H 2017-03-22
0000000005 I 2017-03-22 2017-12-31
;

proc sql;
create table want as select y.*
from dataset1 x, dataset2 y
where x.ID=y.ID and
(.z<y.SessionBeginDate<=x.date<=y.SessionEndDate
or
x.Date>=y.SessionBeginDate>.z & y.SessionEndDate is missing);
quit;

Result:

                     Session     Session         Session
Obs        ID         Code      BeginDate        EndDate

 1     0000000002       C       2017-03-22             .
 2     0000000002       D       2017-03-22    2017-12-31
 3     0000000003       E       2018-01-01             .
 4     0000000003       G       2017-12-01    2018-01-01

(SessionCode values have been chosen to serve as identifiers for observations in dataset2.)

 

  • Matching observations A and B were correctly not selected, because 2017-03-22 was before 2017-05-10.
  • Matching observation C was correctly selected, because 2017-05-01 was after 2017-03-22 and SessionEndDate is missing.
  • Matching observation D was correctly selected, because 2017-05-01 was between 2017-03-22 and 2017-12-31.
  • Matching observation E was correctly selected, because 2018-01-01 was on SessionBeginDate.
  • Matching observation F was correctly not selected, because 2018-01-01 was after 2017-12-31.
  • Matching observation G was correctly selected, because 2018-01-01 was on SessionEndDate.
  • Non-matching observations H and I were correctly not selected.

 

 

Please note that duplicate IDs in dataset1 (if any) satisfying the WHERE condition would cause duplicates in dataset WANT, but not incorrectly selected observations.

d0816
Quartz | Level 8

Thanks so much for providing this explanation.

 

I looked further into my dataset 1 (x), the list of unique IDs and Date. I found that there was one ID (JUST ONE) that was duplicated. In the sample dataset, I had not included all the variables in the dataset 1. But the duplication of the ID occurred due to these other variables. This duplication may have caused the code to not work for me, which worked in the explanation you gave. This duplication may be presenting difficulty for the subsetting of dataset 2 with the code.

IDDateCompanyExam type
00000003332017-06-05AIVA
00000003332017-10-10BGA
d0816
Quartz | Level 8

I ran code after removing that one duplicated ID and the code worked for unduplicated IDs and Date in Dataset 1.

 

Thanks so very much.

FreelanceReinh
Jade | Level 19

@d0816 wrote:

Thanks so much for providing this explanation.

 

I looked further into my dataset 1 (x), the list of unique IDs and Date. I found that there was one ID (JUST ONE) that was duplicated. In the sample dataset, I had not included all the variables in the dataset 1. But the duplication of the ID occurred due to these other variables. This duplication may have caused the code to not work for me, which worked in the explanation you gave. This duplication may be presenting difficulty for the subsetting of dataset 2 with the code.

ID Date Company Exam type
0000000333 2017-06-05 A IVA
0000000333 2017-10-10 B GA

I still think that duplicate IDs in dataset1 have the effect described in my previous post.

 

Let's take a look at an example:

data dataset1;
length ID $10;
input ID Date :yymmdd10.;
format Date yymmdd10.;
cards;
0000000333 2017-06-05
0000000333 2017-10-10
;

data dataset2;
length ID $10;
infile cards missover;
input ID SessionCode $ (SessionBeginDate SessionEndDate) (:yymmdd10.);
format _numeric_ yymmdd10.;
cards;
0000000333 J 2017-07-08
0000000333 K 2017-10-10
0000000333 L 2017-03-22 2017-09-30
0000000333 M 2017-09-22 2017-10-30
0000000333 N 2017-03-22 2017-12-31
0000000333 O 2017-01-01 2017-03-31
0000000333 P 2018-01-01 2018-03-31
;

(PROC SQL step unchanged.)

 

Result:

                     Session     Session         Session
Obs        ID         Code      BeginDate        EndDate

 1     0000000333       J       2017-07-08             .
 2     0000000333       K       2017-10-10             .
 3     0000000333       L       2017-03-22    2017-09-30
 4     0000000333       M       2017-09-22    2017-10-30
 5     0000000333       N       2017-03-22    2017-12-31
 6     0000000333       N       2017-03-22    2017-12-31

 

  • Observation J was correctly selected once, because 2017-10-10 (but not 2017-06-05) was after 2017-07-08 and SessionEndDate is missing.
  • Observation K was correctly selected once, because 2017-10-10 was on SessionBeginDate (whereas 2017-06-05 was before this date).
  • Observation L was correctly selected once, because 2017-06-05 (but not 2017-10-10) was between 2017-03-22 and 2017-09-30.
  • Observation M was correctly selected once, because 2017-10-10 (but not 2017-06-05) was between 2017-09-22 and 2017-10-30.
  • Observation N was "correctly" selected twice, because both 2017-06-05 and 2017-10-10 were between 2017-03-22 and 2017-12-31.
  • Observation O was correctly not selected, because neither 2017-06-05 nor 2017-10-10 was between 2017-01-01 and 2017-03-31.
  • Observation P was correctly not selected, because neither 2017-06-05 nor 2017-10-10 was between 2018-01-01 and 2018-03-31.

So, I don't see any incorrect selections. The duplicate record(s) could be suppressed easily by using the distinct keyword after "select". Additional variables in dataset1 should not interfere with the selection, as they are disregarded.

 

(Sorry, I have to call it a day now. It's close to midnight in my time zone.)

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 20 replies
  • 5294 views
  • 0 likes
  • 3 in conversation