Solved: Subsetting based on intervals

genemroz · Posted 12-13-2024 11:28 AM

Esteemed Advisers,

I need a solution to the following exemplar problem of subsetting one dataset (Have) contingent on start/stop values from a second dataset (Interval) to create dataset Want. Thanks in advance for any insights you can offer.

Data Intervals;
Input Interval $ Start $ Stop $;
1	B	-
1	-	D
2	F	-
2	-	J
3	L	-
3	-	N
;

Data Have;
Input ID $;
A
B
C
D
E
F
G
H
J
K
L
M
N
;

Data Want;
Input Interval $ ID $;
1	B
1	C
1	D
2	G
2	H
2	J
2	K
3	L
3	M
3	N
;

ballardw · Posted 12-13-2024 12:06 PM

Before going to code some questions are in order.

Is this supposed to be a surrogate for more complex data, such as words or phrases phrases for the start/stop values of your "real" project? This may be very critical as "interval" defined on character values may be very hard to work with in general as the < or > operators do not always provide the result you expect when values are of different lengths. Not to mention the complications that would arise for mixed case of letters.

How many observations will exist in your real Have and Intervals data sets? One obvious way to do this could create an extreme number of intermediate values leading to long run times.

Why does the Interval data have start and stop on separate observations? The natural way to demonstrate and interval is with both values on a single observation (row).

Are the - in the Interval set supposed to indicate missing or blank values or does the dash have some meaning? In a data step reading values placing a period in the position to indicate a missing value will do so but a dash is a value.

Why is 2 F not in the Want set? The start of Interval 2 is F.

Correcting the Interval and Have data steps so they will run and create data (missing the datalines statement) and making the Interval set an actual interval, this works for your example data if 2 F is supposed to be in the output.

Data Intervals;
Input Interval $ Start $ Stop $;
datalines;
1	B	D
2	F	J
3	L	N
;

Data Have;
Input ID $;
datalines;
A
B
C
D
E
F
G
H
J
K
L
M
N
;

proc sql;
   create table want as 
   select a.interval, b.Id
   from intervals as a,have as b
   where a.start le b.id le a.stop
   ;
quit;

View solution in original post

ballardw · Posted 12-13-2024 12:06 PM

Before going to code some questions are in order.

Is this supposed to be a surrogate for more complex data, such as words or phrases phrases for the start/stop values of your "real" project? This may be very critical as "interval" defined on character values may be very hard to work with in general as the < or > operators do not always provide the result you expect when values are of different lengths. Not to mention the complications that would arise for mixed case of letters.

How many observations will exist in your real Have and Intervals data sets? One obvious way to do this could create an extreme number of intermediate values leading to long run times.

Why does the Interval data have start and stop on separate observations? The natural way to demonstrate and interval is with both values on a single observation (row).

Are the - in the Interval set supposed to indicate missing or blank values or does the dash have some meaning? In a data step reading values placing a period in the position to indicate a missing value will do so but a dash is a value.

Why is 2 F not in the Want set? The start of Interval 2 is F.

Correcting the Interval and Have data steps so they will run and create data (missing the datalines statement) and making the Interval set an actual interval, this works for your example data if 2 F is supposed to be in the output.

Data Intervals;
Input Interval $ Start $ Stop $;
datalines;
1	B	D
2	F	J
3	L	N
;

Data Have;
Input ID $;
datalines;
A
B
C
D
E
F
G
H
J
K
L
M
N
;

proc sql;
   create table want as 
   select a.interval, b.Id
   from intervals as a,have as b
   where a.start le b.id le a.stop
   ;
quit;

genemroz · Posted 12-13-2024 01:55 PM

Thanks for the prompt response. It was very helpful. The start/stop times came to me in a dataset as separate observations. Once I figured out how to get them into a single observation your solution works perfectly. And I apologize for my typo that left 2 F out of the Want dataset.

Ksharp · Posted 12-14-2024 02:05 AM

Just for having some fun.

Data Intervals;
Input Interval $ Start $ Stop $;
datalines;
1	B	D
2	F	J
3	L	N
;

data want;
 set Intervals;
 do i=rank(Start) to rank(Stop);
  want=byte(i);output;
 end;
 drop i;
run;

Subsetting based on intervals

Re: Subsetting based on intervals

Re: Subsetting based on intervals

Re: Subsetting based on intervals

Re: Subsetting based on intervals

Registration is open

Subsetting based on intervals

Re: Subsetting based on intervals

Re: Subsetting based on intervals

Re: Subsetting based on intervals

Re: Subsetting based on intervals

Registration is open

SAS Training: Just a Click Away