Solved: Re: Flagging if condition if ever met

stats_auca · Posted 08-19-2023 04:39 PM

I have been trying for hours so I decided to reach out for help.

Below is an example of my data. I created a flag if the Code is A by using if then do. However, I also need to flag all rows of the same VID if the code was ever A. I have tried using a data step but I cannot seem to figure out anything that works. Any help is appreciated.

Example Have:

ID VID Code CodeA

1 5 A 1

1 5 B

2 6 A 1

2 6 B

2 3 C

3 4 B

3 4 C

Example Want:

ID VID Code CodeA

1 5 A 1

1 5 B 1

2 6 A 1

2 6 B 1

2 3 C

3 4 B

3 4 C

or Example want:

ID VID Code CodeA VID_CODEA

1 5 A 1 1

1 5 B 1

2 6 A 1 1

2 6 B 1

2 3 C

3 4 B

3 4 C

Kurt_Bremser · Posted 08-20-2023 03:43 AM

Double DO loop:

data want;
do until (last.vid);
  set have;
  by id vid notsorted;
  if code = "A" then codea = 1;
end;
do until (last.vid);
  set have;
  by id vid notsorted;
  output;
end;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

maguiremq · Posted 08-19-2023 06:02 PM

This one is always a bit tricky. If you ever need to find this again, you would call it last observation carried forward.

data have;
infile datalines delimiter = "," dsd missover;
input ID $  VID  Code $ CodeA;
datalines;
1,5,A,1
1,5,B, 
2,6,A,1
2,6,B,
2,3,C,
3,4,B,
3,4,C,
;
run;

proc sort data = have;
	by id vid code;
run;

data want;
set have;
length vid_codea 3.;
retain vid_codea;
by id vid;
	if code = 'A' then codeA = 1;
	if first.vid then call missing(vid_codeA);
	if not missing(codeA) then vid_codeA = codeA;
run;

ID	VID	Code	CodeA	vid_codea
1	5	A	1	1
1	5	B	.	1
2	3	C	.	.
2	6	A	1	1
2	6	B	.	1
3	4	B	.	.
3	4	C	.	.

stats_auca · Posted 08-19-2023 08:21 PM

@maguiremq Thank you so much. To be clear, column vid_codea is what I am looking for, correct? If so, it seemed to work perfectly!

stats_auca · Posted 08-19-2023 11:35 PM

@maguiremq I spoke too soon. Below is the dataset and there should be a '1' for the bolded lines with the same VID when Code=1. Any ideas what may be the issue?

ID	VID	 Code	CodeA	vid_codea
1	5	  A	      1	       1
1	5	  B	      .	       .
2	3	  C	      .        .
2	6	  A	      1        1
2	6	  B		  .        .
3	4	  B		  .        .
3	4	  C		  .        .

mkeintz · Posted 08-19-2023 06:42 PM

As long as your data are already sorted by ID, it doesn't matter where the code "A" occurs. You can always do a MERGE of the code="A" obs with all the obs. No further sorting is needed.

Code below is edited, and tested:

data have;
infile datalines delimiter = "," dsd missover;
input ID $  VID  Code $ CodeA;
datalines;
1,5,A,1
1,5,B, 
2,6,A,1
2,6,B,
2,3,C,
3,4,B,
3,4,C,
run;
data want;
  merge have  (where=(code='A') in=found_a_qualifier)
        have (in=allobs);
  by id;
  retain vid_codea;
  if first.id then vid_codea=found_a_qualifier; /*Changed first.have to first.id */
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

stats_auca · Posted 08-19-2023 11:43 PM

@mkeintz thank you! for some reason when I run the code, both the vid_codea and found_a_qualifier are empty columns?

mkeintz · Posted 08-20-2023 09:02 AM

@stats_auca wrote:

@mkeintz thank you! for some reason when I run the code, both the vid_codea and found_a_qualifier are empty columns?

Yes. I have corrected the code to use first.id instead of first.have. This is the result of my failure to test the code against your sample data. I've edited the response, with tested code.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

vidhipatel01 · Posted 08-20-2023 12:57 AM

You can subset your first data:
Data Data_A;
Set Data_Have;
Where code = “A”;
Run;
Example: Data_A
ID VID Code CodeA
1 5 A 1
2 6 A 1

After that use proc sql to assign a new flag as you want:
Proc sql noprint;
Create table Data_A_want as
Select *, case when VID in (select distinct VID from Data_A) then 1 end else 0 as VID_ CODEA
From Data_Have;
Quit;

The above code should give you the flag you want.

Kurt_Bremser · Posted 08-20-2023 03:43 AM

Double DO loop:

data want;
do until (last.vid);
  set have;
  by id vid notsorted;
  if code = "A" then codea = 1;
end;
do until (last.vid);
  set have;
  by id vid notsorted;
  output;
end;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

stats_auca · Posted 08-22-2023 02:50 PM

This code seemed to do the trick. Thank you so much!

Tom · Posted 08-20-2023 10:11 AM

To test if a Boolean is ever TRUE just take the MAX(). This will work with your 1/. coded variable just as well as with a normal 1/0 coded Boolean variable.

data have;
  input ID $  VID  Code $ CodeA;
datalines;
1 5 A 1
1 5 B . 
2 6 A 1
2 6 B .
2 3 C .
3 4 B .
3 4 C .
;

proc sql ;
create table want as
  select *
       , max(codeA) as any_codeA
  from have
  group by id
;
quit;

proc print;
run;

Result

                            Code     any_
Obs    ID    VID    Code      A     codeA

 1     1      5      A        1       1
 2     1      5      B        .       1
 3     2      6      A        1       1
 4     2      3      C        .       1
 5     2      6      B        .       1
 6     3      4      C        .       .
 7     3      4      B        .       .

But to do other aggregations like are ALL of the observations TRUE you need to have real Boolean flags where the FALSE results are coded as zero instead of missing.

data have;
  input ID $  VID  Code $ ;
datalines;
1 5 A
1 5 B
2 6 A
2 6 B
2 3 C
3 4 B
3 4 C
4 7 A
4 7 A
;

proc sql ;
create table want as
  select *
       , (code='A') as codeA
       , max(calculated codeA) as any_codeA
       , min(calculated codeA) as all_codeA
  from have
  group by id
;
quit;

                            code     any_     all_
Obs    ID    VID    Code      A     codeA    codeA

 1     1      5      B        0       1        0
 2     1      5      A        1       1        0
 3     2      6      A        1       1        0
 4     2      3      C        0       1        0
 5     2      6      B        0       1        0
 6     3      4      C        0       0        0
 7     3      4      B        0       0        0
 8     4      7      A        1       1        1
 9     4      7      A        1       1        1

SaSoDude · Posted 08-20-2023 03:48 PM

If I may drop in?

 input ID $  VID  Code $ ;

Is there really a sense for the "$" after ID?

Tom · Posted 08-20-2023 05:05 PM

The $ means read it as character. Since I was lazy and did not DEFINE the variables before they appeared in the INPUT statement I need to have that or else it will make ID as a NUMERIC variable.

You should not store identifiers as numbers as there is no reason to do arithmetic with them. What would the MEAN value of ID mean?

mkeintz · Posted 08-20-2023 06:31 PM

@Tom wrote:

You should not store identifiers as numbers as there is no reason to do arithmetic with them. What would the MEAN value of ID mean?

Agree.

But there may be value in knowing the MINIMUM and MAXIMUM ID values (or maybe the range), if ID's were historically assigned in numeric sequence. That's not enough in my mind to justify numeric ID's, when character vars could also be sequenced. But I suspect a lot of ID systems have started out with such numeric sequences. I've certainly seen a lot of surveys do exactly that.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SaSoDude · Posted 08-22-2023 11:52 AM

Thank you @Tom - I can follow your explanation, but still have a different view on declaring IDs.

In this case the values of this ID take only numeric values and for this reason alone I see no advantage to declare the ID as a character.

I also know from other BI tools that especially with IDs numbers can be processed much faster than characters. Whether this is also the case in SAS, I do not know - I am still too new.

Best regards!

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!