BookmarkSubscribeRSS Feed
Batman
Quartz | Level 8

I'm trying to write a program that will select observations with a group by variable and dates that are close to each other.   I can figure out a way to do it in several data steps but I suspect it can be done more efficiently in sql.   Some sample code is below (where the date is just a number), if I want to pull all the observations where each y is within 5 for each value of x.   In this case it should pull just the first two observations.

 

data a;
input x 1. y 1.;
datalines;
12
15
21
29
34
;
quit;

10 REPLIES 10
PaigeMiller
Diamond | Level 26

@Batman wrote:

I'm trying to write a program that will select observations with a group by variable and dates that are close to each other.   I can figure out a way to do it in several data steps but I suspect it can be done more efficiently in sql.   Some sample code is below (where the date is just a number), if I want to pull all the observations where each y is within 5 for each value of x.   In this case it should pull just the first two observations.

 

data a;
input x 1. y 1.;
datalines;
12
15
21
29
34
;
quit;


Why wouldn't your rule of "each y is within 5 for each value of x" select row 3 and row 5 as well?

 

Forgetting that issue, subtraction seems like it would work to find which rows you want.

 

--
Paige Miller
ballardw
Super User

First: very confusing to not bother to show a space between values.

 

I think you need to show your work or provide more details on what "rules" you are implementing as the description as stated doesn't match the supposed result.

Your description of the y within 5 of x means that the Y on the fifth row is within 5 of the x on all rows, the y of 5 on the second row is within 5 of the X on rows 1 , 2, 3, 4 and 5. Other pairs meet your requirement as well. The only thing that doesn't actually is the y on the 4th row.

So this doesn't make sense:

 I want to pull all the observations where each y is within 5 for each value of x.   In this case it should pull just the first two observations.

Worse, you mention that you want this work for dates?? If the question is to "group dates by week/month/quarter" or similar then you should be using actual date values and a format or the INTNX function is likely to get what you want.

Examples for date related questions should show dates and the actual result.

Batman
Quartz | Level 8

Ok, see if this is clearer, I put date values in y.

 

I'd like to select observations that have the same value of X and their date values (Y) are within 5 days of each other.   The first and second observations would be selected because have the same value of X and are within 3 days of each other.   The third and fourth observations have the same value of X but would not be selected because their date values after 8 days apart.

 

28 /*Create dataset*/
29 data a;
30 input x 1. y mmddyy10.;
31 format y mmddyy10.;
32 list;
33 datalines;

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
34 111-02-23
35 111-05-23
36 211-01-23
37 211-09-23
38 311-05-23
NOTE: The data set WORK.A has 5 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

39 ;

40 run;
41
42 /*print contents in log*/
43 data null;
44 set a;
45 put x= y=;
46 run;

x=1 y=11/02/2023
x=1 y=11/05/2023
x=2 y=11/01/2023
x=2 y=11/09/2023
x=3 y=11/05/2023
NOTE: There were 5 observations read from the data set WORK.A.
NOTE: The data set WORK.NULL has 5 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

PaigeMiller
Diamond | Level 26

It's still very difficult, if not impossible, to take the description you gave and then generalize the situation and then write code, because it is still not clear what you want as output in certain situations.

 

The third and fourth observations have the same value of X but would not be selected because their date values after 8 days apart.

 

Are we always looking at pairs? Are there always at most 2 records per value of X? What if there are three records per value of X, then what?

--
Paige Miller
Batman
Quartz | Level 8

No, not just looking at pairs, I added a few more rows to illustrate.  In this dataset, the 1st and 2nd obs would qualify, so would the 7th, 8th, and 9th obs

 

27 /*Create dataset*/
28 data a;
29 input x 1. y mmddyy10.;
30 format y mmddyy10.;
31 list;
32 datalines;

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
33 111-02-23
34 111-05-23
35 112-01-23
36 112-25-23
37 211-01-23
38 211-09-23
39 311-05-23
40 311-07-23
41 311-11-23
NOTE: The data set WORK.A has 9 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

42 ;

43 run;
44
45 /*print contents in log*/
46 data null;
47 set a;
48 put x= y=;
49 run;

x=1 y=11/02/2023
x=1 y=11/05/2023
x=1 y=12/01/2023
x=1 y=12/25/2023
x=2 y=11/01/2023
x=2 y=11/09/2023
x=3 y=11/05/2023
x=3 y=11/07/2023
x=3 y=11/11/2023
NOTE: There were 9 observations read from the data set WORK.A.
NOTE: The data set WORK.NULL has 9 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds

PaigeMiller
Diamond | Level 26

So, when there are three or more, are we comparing the records to all of the other records in the group? Or are we comparing just sequentially?

 

What if there results are not transitive? What records do we extract in the case below, where the second row is not within five days of the fourth row, but the second row is within five days of the other rows?

 

38 211-09-23
39 311-05-23
40 311-07-23

41 311-11-23

--
Paige Miller
Batman
Quartz | Level 8

Only comparing as part of the same group so only the 2nd and 3rd rows in your example would qualify.

PaigeMiller
Diamond | Level 26

Well, @Batman after many questions, I'm still not sure what you want here. In the case of the four dates that I gave, you say "only the 2nd and 3rd rows in your example would qualify" and you provide no explanation whatsoever why this is the case.


Can you please explain the logic that selects 2nd and 3rd rows but not the 4th row?

--
Paige Miller
Batman
Quartz | Level 8

Ok, let's try it this way.   This code gives me the desired result but I have to use a data step and then sql.   I think I should be able to do it all in sql.  However, if this method is the only way to do it, than I can stick with it.

 

/*Create dataset*/
data a;
input x 1. y mmddyy10.;
format y mmddyy10.;
list;
datalines;
111-02-23
111-05-23
112-01-23
112-25-23
211-01-23
211-09-23
311-05-23
311-07-23
311-11-23
;
run;
 
/*group observations with dates between 0 and 5 days apart*/
data b(drop=y_lag);
set a;
by x y;
y_lag=lag(y);
if first.x or y gt (y_lag + 5) then cnt+1;
run;
 
/*This gives me the desired result*/
proc sql;
create table c as
select x, y
from b
group by x, cnt
having count(*) gt 1;
quit;
PaigeMiller
Diamond | Level 26

But I asked for an explanation. I didn't ask for code.

--
Paige Miller

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 892 views
  • 0 likes
  • 3 in conversation