Re: selecting records with a values close to each other

Batman · Posted 11-08-2023 10:14 AM

I'm trying to write a program that will select observations with a group by variable and dates that are close to each other. I can figure out a way to do it in several data steps but I suspect it can be done more efficiently in sql. Some sample code is below (where the date is just a number), if I want to pull all the observations where each y is within 5 for each value of x. In this case it should pull just the first two observations.

data a;
input x 1. y 1.;
datalines;
12
15
21
29
34
;
quit;

PaigeMiller · Posted 11-08-2023 10:22 AM

@Batman wrote:

I'm trying to write a program that will select observations with a group by variable and dates that are close to each other. I can figure out a way to do it in several data steps but I suspect it can be done more efficiently in sql. Some sample code is below (where the date is just a number), if I want to pull all the observations where each y is within 5 for each value of x. In this case it should pull just the first two observations.

data a;
input x 1. y 1.;
datalines;
12
15
21
29
34
;
quit;

Why wouldn't your rule of "each y is within 5 for each value of x" select row 3 and row 5 as well?

Forgetting that issue, subtraction seems like it would work to find which rows you want.

--
Paige Miller

ballardw · Posted 11-08-2023 10:30 AM

First: very confusing to not bother to show a space between values.

I think you need to show your work or provide more details on what "rules" you are implementing as the description as stated doesn't match the supposed result.

Your description of the y within 5 of x means that the Y on the fifth row is within 5 of the x on all rows, the y of 5 on the second row is within 5 of the X on rows 1 , 2, 3, 4 and 5. Other pairs meet your requirement as well. The only thing that doesn't actually is the y on the 4th row.

So this doesn't make sense:

 I want to pull all the observations where each y is within 5 for each value of x.   In this case it should pull just the first two observations.

Worse, you mention that you want this work for dates?? If the question is to "group dates by week/month/quarter" or similar then you should be using actual date values and a format or the INTNX function is likely to get what you want.

Examples for date related questions should show dates and the actual result.

Batman · Posted 11-08-2023 11:44 AM

Ok, see if this is clearer, I put date values in y.

I'd like to select observations that have the same value of X and their date values (Y) are within 5 days of each other. The first and second observations would be selected because have the same value of X and are within 3 days of each other. The third and fourth observations have the same value of X but would not be selected because their date values after 8 days apart.

28 /*Create dataset*/
29 data a;
30 input x 1. y mmddyy10.;
31 format y mmddyy10.;
32 list;
33 datalines;

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
34 111-02-23
35 111-05-23
36 211-01-23
37 211-09-23
38 311-05-23
NOTE: The data set WORK.A has 5 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

39 ;

40 run;
41
42 /*print contents in log*/
43 data null;
44 set a;
45 put x= y=;
46 run;

x=1 y=11/02/2023
x=1 y=11/05/2023
x=2 y=11/01/2023
x=2 y=11/09/2023
x=3 y=11/05/2023
NOTE: There were 5 observations read from the data set WORK.A.
NOTE: The data set WORK.NULL has 5 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

PaigeMiller · Posted 11-08-2023 12:25 PM

It's still very difficult, if not impossible, to take the description you gave and then generalize the situation and then write code, because it is still not clear what you want as output in certain situations.

The third and fourth observations have the same value of X but would not be selected because their date values after 8 days apart.

Are we always looking at pairs? Are there always at most 2 records per value of X? What if there are three records per value of X, then what?

--
Paige Miller

Batman · Posted 11-08-2023 01:19 PM

No, not just looking at pairs, I added a few more rows to illustrate. In this dataset, the 1st and 2nd obs would qualify, so would the 7th, 8th, and 9th obs

27 /*Create dataset*/
28 data a;
29 input x 1. y mmddyy10.;
30 format y mmddyy10.;
31 list;
32 datalines;

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
33 111-02-23
34 111-05-23
35 112-01-23
36 112-25-23
37 211-01-23
38 211-09-23
39 311-05-23
40 311-07-23
41 311-11-23
NOTE: The data set WORK.A has 9 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

42 ;

43 run;
44
45 /*print contents in log*/
46 data null;
47 set a;
48 put x= y=;
49 run;

x=1 y=11/02/2023
x=1 y=11/05/2023
x=1 y=12/01/2023
x=1 y=12/25/2023
x=2 y=11/01/2023
x=2 y=11/09/2023
x=3 y=11/05/2023
x=3 y=11/07/2023
x=3 y=11/11/2023
NOTE: There were 9 observations read from the data set WORK.A.
NOTE: The data set WORK.NULL has 9 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds

PaigeMiller · Posted 11-08-2023 01:45 PM

So, when there are three or more, are we comparing the records to all of the other records in the group? Or are we comparing just sequentially?

What if there results are not transitive? What records do we extract in the case below, where the second row is not within five days of the fourth row, but the second row is within five days of the other rows?

38 211-09-23
39 311-05-23
40 311-07-23

41 311-11-23

--
Paige Miller

Batman · Posted 11-08-2023 02:48 PM

Only comparing as part of the same group so only the 2nd and 3rd rows in your example would qualify.

PaigeMiller · Posted 11-08-2023 02:59 PM

Well, @Batman after many questions, I'm still not sure what you want here. In the case of the four dates that I gave, you say "only the 2nd and 3rd rows in your example would qualify" and you provide no explanation whatsoever why this is the case.

Can you please explain the logic that selects 2nd and 3rd rows but not the 4th row?

--
Paige Miller

Batman · Posted 11-08-2023 03:20 PM

Ok, let's try it this way. This code gives me the desired result but I have to use a data step and then sql. I think I should be able to do it all in sql. However, if this method is the only way to do it, than I can stick with it.

/*Create dataset*/

data a;

input x 1. y mmddyy10.;

format y mmddyy10.;

list;

datalines;

111-02-23

111-05-23

112-01-23

112-25-23

211-01-23

211-09-23

311-05-23

311-07-23

311-11-23

;

run;

/*group observations with dates between 0 and 5 days apart*/

data b(drop=y_lag);

set a;

by x y;

y_lag=lag(y);

if first.x or y gt (y_lag + 5) then cnt+1;

run;

/*This gives me the desired result*/

proc sql;

create table c as

select x, y

from b

group by x, cnt

having count(*) gt 1;

quit;

PaigeMiller · Posted 11-08-2023 03:27 PM

But I asked for an explanation. I didn't ask for code.

--
Paige Miller

Classroom Training Available!