Solved: Re: SET dataset WHERE X filters strange

SaSoDude · Posted 08-15-2023 11:48 AM

Hi,

I'm new to SAS and started to program and tests some things. Now I found a strange behavior and maybe someone is nice and explains it to me. This is my code:

DATA one;
 INPUT id v3 v4;
 DATALINES;
 1 10 100
 2 15 150
 3 20 200
 ;

DATA two;
 INPUT id v3 v4;
 DATALINES;
 4 1000 10000
 5 1500 15000
 5 2000 20000
 7  800 30000
 ;

DATA theData;
    SET one;
    OUTPUT;
    IF _N_ THEN DO;
    IF id = 2 THEN DO;
    SET two (WHERE=(id = 5));
    OUTPUT;
    END;
    END;
RUN;

PROC PRINT DATA=one;
TITLE "one";
RUN;
PROC PRINT DATA=two;
TITLE "two";
RUN;
PROC PRINT DATA=theData;
TITLE "theData";
RUN;

The result is:

# id v3 v41234

1	10	100
2	15	150
5	1500	15000
3	20	200

I expected one more row, "5 2000 20000"

between

5

1500

15000

and

3

20

200

Am I'm wrong?

Best,

Markus

Quentin · Posted 08-15-2023 12:46 PM

@SaSoDude wrote:

Thanks a lot for your kind post @Tom!

I will take a closer look and try to understand it, in the meantime:

Actually, I would like to have this dataset as result:
Obs    id     v3       v4

 1      1      10      100
 2      2      15      150
 3      5    1500    15000
 4      5    2000    20000
 4      3      20      200
In words:

I want to search in dataset one for a specific id e.g. 2 and put then the rows from dataset two where the id is equal to 5 between the rows with id=2 and id=3.

You've got good answers already. And this is good stuff to play with while learning!

As has been pointed out, to get what you want you need to read multiple records from work.two, which means that SET statement needs to execute multiple times. One way to do that is with a DO UNTIL loop which will continue to iterate until the last record is read. If you're familiar with using the END= option on a SET statement, you can use it like:

data want ;
  set one ;
  put _N_= ID= V3= V4= ;
  output ;

  if id=2 then do until(done) ;
    set two (where=(ID=5)) end=done ;
    put _N_= "Read a record from work.two " ID= V3= V4= ;
    output ;
  end ;
run ;

That said, I agree with Tom, it looks like you're just interleaving two datasets. Other than the sort order, your WANT is the same as:

data want2 ;
  set one two (where=(ID=5)) ;
  by id ;
run ;

So I see your code as an interesting learning exercise in how the DATA step works, but I wouldn't actually use this code for real work.

The Boston Area SAS Users Group is hosting free webinars!
Next up: Joe Madden & Joseph Henry present Putting Power into the Hands of the Programmer with SAS Viya Workbench on Wednesday Nov 6.
Register now at https://www.basug.org/events.

View solution in original post

Tom · Posted 08-15-2023 12:06 PM

What are you trying to do? Please explain.

In the meantime let's look at your data step and see what it is saying to do.

First let's fix the indentation to make it easier for human's to read.

DATA theData;
  SET one;
  OUTPUT;
  IF _N_ THEN DO;
    IF id = 2 THEN DO;
      SET two (WHERE=(id = 5));
      OUTPUT;
    END;
  END;
RUN;

The first thing to notice is the value of _N_ (unless you explicit modify it) will also be some positive number. So we can remove the first IF block as it will always execute.

DATA theData;
  SET one;
  OUTPUT;
  IF id = 2 THEN DO;
    SET two (WHERE=(id = 5));
    OUTPUT;
  END;
RUN;

So the first two statements mean that it will read observations from ONE and then immediately write them out to THEDATA.

When the value of ID is 2 then it will attempt to read an observation from TWO, but only the subset of TWO where the value of ID is 5. When it finds one it will write a second observation out for this observation from ONE.

One thing you need to learn is that most data steps do not terminate at the RUN statement. Instead they terminate at a SET/MERGE/UPDATE statement or an INPUT statement when they attempt to read past the end of the inputs. So in this data step since there are TWO separate SET statements there are two places where you might read past the end of the input. So the normal one where you read past the end of the primary input dataset ONE. But also the secondary one where you have more observations in ONE with ID=2 than there are observations in TWO with ID=5. In that case the data step will end there, potentially before it has finished reading all of the data from ONE.

So let's try it. First let's make the test data. NOTE: Do not indent the DATALINES; statement or the lines of data themselves. That will just cause confusion about what the lines contain.

DATA one;
 INPUT id v3 v4;
DATALINES;
1 10 100
2 15 150
3 20 200
;

DATA two;
 INPUT id v3 v4;
DATALINES;
4 1000 10000
5 1500 15000
5 2000 20000
7  800 30000
;

And run it

3591  DATA theData;
3592    SET one;
3593    OUTPUT;
3594    IF id = 2 THEN DO;
3595      SET two (WHERE=(id = 5));
3596      OUTPUT;
3597    END;
3598  RUN;

NOTE: There were 3 observations read from the data set WORK.ONE.
NOTE: There were 1 observations read from the data set WORK.TWO.
      WHERE id=5;
NOTE: The data set WORK.THEDATA has 4 observations and 3 variables.

So it read read all three observations from ONE, so the extra SET statement did not cause it to end early.

It read in only one observation from TWO since there was only one observation from ONE that had ID=2 so it could go into that branch of the code.

And it wrote out 4 observations. The ones read from ONE and the extra written inside the IF/THEN block.

So let's see what we get.

proc print;
run;

Results:

Obs    id     v3       v4

 1      1      10      100
 2      2      15      150
 3      5    1500    15000
 4      3      20      200

Looks like it did what you asked. It read the first observation from ONE and wrote it. It read the second observation from ONE and wrote it. And then read the first observation in TWO that had ID=5 and which caused it to overwrite all three variables with the values from TWO and then wrote out another observation. Then it read the third observation from ONE and wrote it out. Then it tried to read a fourth observation from ONE and stopped.

SaSoDude · Posted 08-15-2023 12:14 PM

Thanks a lot for your kind post @Tom!

I will take a closer look and try to understand it, in the meantime:

Actually, I would like to have this dataset as result:

Obs    id     v3       v4

 1      1      10      100
 2      2      15      150
 3      5    1500    15000
 4      5    2000    20000
 4      3      20      200

In words:

I want to search in dataset one for a specific id e.g. 2 and put then the rows from dataset two where the id is equal to 5 between the rows with id=2 and id=3.

Quentin · Posted 08-15-2023 12:46 PM

@SaSoDude wrote:

Thanks a lot for your kind post @Tom!

I will take a closer look and try to understand it, in the meantime:

Actually, I would like to have this dataset as result:
Obs    id     v3       v4

 1      1      10      100
 2      2      15      150
 3      5    1500    15000
 4      5    2000    20000
 4      3      20      200
In words:

I want to search in dataset one for a specific id e.g. 2 and put then the rows from dataset two where the id is equal to 5 between the rows with id=2 and id=3.

You've got good answers already. And this is good stuff to play with while learning!

As has been pointed out, to get what you want you need to read multiple records from work.two, which means that SET statement needs to execute multiple times. One way to do that is with a DO UNTIL loop which will continue to iterate until the last record is read. If you're familiar with using the END= option on a SET statement, you can use it like:

data want ;
  set one ;
  put _N_= ID= V3= V4= ;
  output ;

  if id=2 then do until(done) ;
    set two (where=(ID=5)) end=done ;
    put _N_= "Read a record from work.two " ID= V3= V4= ;
    output ;
  end ;
run ;

That said, I agree with Tom, it looks like you're just interleaving two datasets. Other than the sort order, your WANT is the same as:

data want2 ;
  set one two (where=(ID=5)) ;
  by id ;
run ;

So I see your code as an interesting learning exercise in how the DATA step works, but I wouldn't actually use this code for real work.

The Boston Area SAS Users Group is hosting free webinars!
Next up: Joe Madden & Joseph Henry present Putting Power into the Hands of the Programmer with SAS Viya Workbench on Wednesday Nov 6.
Register now at https://www.basug.org/events.

SaSoDude · Posted 08-15-2023 12:57 PM

Thank you @Quentin! This is the result, I was trying to code. Huh... On one hand, it is pretty nice, how this result is achieved, on the other hand the way SAS "is thinking" is total different to me. 😅 But I will learn! 😀

Thank you all guys!

Tom · Posted 08-15-2023 01:51 PM

To avoid end the data step too early when there are no matching observations it is better to use DO WHILE(NOT done) instead of DO UNTIL(done).

That way the SET statement will not try to read past the end of the input.

Tom · Posted 08-15-2023 01:45 PM

The BEST way to do something like that is to have some extra ORDER variable you could use to INTERLEAVE the observations. Then you could use normal SET level operations on the data and not have deal with every single thing yourself as if your were writing assembly code.

We could make-up a fake one to show how it would work in that case:

DATA one;
 INPUT id v3 v4;
 order=id;
DATALINES;
1 10 100
2 15 150
3 20 200
;

DATA two;
 INPUT id v3 v4;
 order=id-3+0.5;
DATALINES;
4 1000 10000
5 1500 15000
5 2000 20000
7  800 30000
;


data want;
  set one two(where=(id=5));
  by order;
run;

proc print;
run;

Result

Obs    id     v3       v4     order

 1      1      10      100     1.0
 2      2      15      150     2.0
 3      5    1500    15000     2.5
 4      5    2000    20000     2.5
 5      3      20      200     3.0

If you want to pretend you are an assembly level programmer instead of a data analysts you could force SAS to step through the observations from TWO by adding another DO loop.

data want;
  set one ;
  output;
  if id=2 then do while(not eof);
     set two(where=(id=5)) end=eof;
     output;
  end;
run;

Or you could build the data step by step.

data want;
  set one ;
  where id <=2 ;
run;

proc append data=two(where=(id=5)) base=want;
run;

proc append data=one(where=(id>2)) base=want;
run;

SaSoDude · Posted 08-15-2023 01:51 PM

I can live with being called an assembler programmer! 🤣 And yes, I usually write in other programming languages, so it's quite a change for me to go the "SAS" way. Every programming language has "its own flow" and I will discover the "SAS Flow".
In any case: Many thanks for all the input!

SaSoDude · Posted 08-15-2023 01:54 PM

@Tom May I ask you, what this is about?

order=id-3+0.5;

The dataset two will be ordered by the id - but why did you add the calculation behind it?

Tom · Posted 08-15-2023 02:02 PM

@SaSoDude wrote:

@Tom May I ask you, what this is about?
order=id-3+0.5;
The dataset two will be ordered by the id - but why did you add the calculation behind it?

Because I was too lazy to type in the ORDER values myself. So I just wrote a formula that could make some ORDER values worked by using the values of ID that you had already typed. Since you wanted 5 to follow 2 I subtracted 3 and then added a half. The goal was just to make a variable that would work to insure the values from TWO were inserted where wanted them.

Note I could have not added the 0.5 and it still would have worked. The SET statement will process the datasets in the order they appear. So the ORDER=2 observations from ONE would be processed before the ORDER=2 observations from TWO .

SaSoDude · Posted 08-15-2023 12:25 PM

Uh! I get it. The "id=2" misleads me (and the script 😅) into only getting one observation instead of two as I expected.

Tom · Posted 08-15-2023 12:09 PM

Did you just mean to do:

data want;
  set one two(where=(id=5));
  by id;
run;

Which will produce:

Obs    id     v3       v4

 1      1      10      100
 2      2      15      150
 3      3      20      200
 4      5    1500    15000
 5      5    2000    20000

SaSoDude · Posted 08-15-2023 12:16 PM

Almost. The rows with the id=5 should be between 2 and 3.
id v3 v4
1 ...
2 ...
5 ...
5 ...
3 ...

ballardw · Posted 08-15-2023 12:11 PM

Set can be a sort of executable statement returning one observation each time executed.

So with your "if id=2" the Set only gets executed once because there is only one value of 2 for id the "where" on set two doesn't control the number of records brought in. And the behavior with a second value of 2 is going to "interleave" with id=2 then id=5 then id=2 (the second from one) then id=5 (the second from two ).

Even without conditions you get a sort of "pairing" between the number of observations in multiple data sets when they appear on separate Set statements because of the one observation at a time and quits when one set stops providing observations.

Think of what this might do. Then run the code and see if it matches expectations.

data done;
  input x;
datalines;
1
2
;

data dtwo;
  input y;
datalines;
11
22
33
;

data example;
   set done;
   set dtwo;
run;

Then change the order of the data sets on the set and do the same "guess" as to contents.

data example2;
   set dtwo;
   set done;
run;

SaSoDude · Posted 08-15-2023 12:29 PM

Thanks @ballardw ! Yes, the first id=2 mixed me up. I'm just afraid that exactly this "Then change the order of the data sets on the set and do the same "guess" as to contents." right now is my problem. 😅

SAS Innovate 2025: Save the Date