I have been trying for hours so I decided to reach out for help.
Below is an example of my data. I created a flag if the Code is A by using if then do. However, I also need to flag all rows of the same VID if the code was ever A. I have tried using a data step but I cannot seem to figure out anything that works. Any help is appreciated.
Example Have:
ID VID Code CodeA
1 5 A 1
1 5 B
2 6 A 1
2 6 B
2 3 C
3 4 B
3 4 C
Example Want:
ID VID Code CodeA
1 5 A 1
1 5 B 1
2 6 A 1
2 6 B 1
2 3 C
3 4 B
3 4 C
or Example want:
ID VID Code CodeA VID_CODEA
1 5 A 1 1
1 5 B 1
2 6 A 1 1
2 6 B 1
2 3 C
3 4 B
3 4 C
Double DO loop:
data want;
do until (last.vid);
set have;
by id vid notsorted;
if code = "A" then codea = 1;
end;
do until (last.vid);
set have;
by id vid notsorted;
output;
end;
run;
This one is always a bit tricky. If you ever need to find this again, you would call it last observation carried forward.
data have;
infile datalines delimiter = "," dsd missover;
input ID $ VID Code $ CodeA;
datalines;
1,5,A,1
1,5,B,
2,6,A,1
2,6,B,
2,3,C,
3,4,B,
3,4,C,
;
run;
proc sort data = have;
by id vid code;
run;
data want;
set have;
length vid_codea 3.;
retain vid_codea;
by id vid;
if code = 'A' then codeA = 1;
if first.vid then call missing(vid_codeA);
if not missing(codeA) then vid_codeA = codeA;
run;
ID VID Code CodeA vid_codea 1 5 A 1 1 1 5 B . 1 2 3 C . . 2 6 A 1 1 2 6 B . 1 3 4 B . . 3 4 C . .
@maguiremq Thank you so much. To be clear, column vid_codea is what I am looking for, correct? If so, it seemed to work perfectly!
@maguiremq I spoke too soon. Below is the dataset and there should be a '1' for the bolded lines with the same VID when Code=1. Any ideas what may be the issue?
ID VID Code CodeA vid_codea
1 5 A 1 1
1 5 B . .
2 3 C . .
2 6 A 1 1
2 6 B . .
3 4 B . .
3 4 C . .
As long as your data are already sorted by ID, it doesn't matter where the code "A" occurs. You can always do a MERGE of the code="A" obs with all the obs. No further sorting is needed.
Code below is edited, and tested:
data have;
infile datalines delimiter = "," dsd missover;
input ID $ VID Code $ CodeA;
datalines;
1,5,A,1
1,5,B,
2,6,A,1
2,6,B,
2,3,C,
3,4,B,
3,4,C,
run;
data want;
merge have (where=(code='A') in=found_a_qualifier)
have (in=allobs);
by id;
retain vid_codea;
if first.id then vid_codea=found_a_qualifier; /*Changed first.have to first.id */
run;
@mkeintz thank you! for some reason when I run the code, both the vid_codea and found_a_qualifier are empty columns?
@stats_auca wrote:
@mkeintz thank you! for some reason when I run the code, both the vid_codea and found_a_qualifier are empty columns?
Yes. I have corrected the code to use first.id instead of first.have. This is the result of my failure to test the code against your sample data. I've edited the response, with tested code.
Double DO loop:
data want;
do until (last.vid);
set have;
by id vid notsorted;
if code = "A" then codea = 1;
end;
do until (last.vid);
set have;
by id vid notsorted;
output;
end;
run;
To test if a Boolean is ever TRUE just take the MAX(). This will work with your 1/. coded variable just as well as with a normal 1/0 coded Boolean variable.
data have;
input ID $ VID Code $ CodeA;
datalines;
1 5 A 1
1 5 B .
2 6 A 1
2 6 B .
2 3 C .
3 4 B .
3 4 C .
;
proc sql ;
create table want as
select *
, max(codeA) as any_codeA
from have
group by id
;
quit;
proc print;
run;
Result
Code any_ Obs ID VID Code A codeA 1 1 5 A 1 1 2 1 5 B . 1 3 2 6 A 1 1 4 2 3 C . 1 5 2 6 B . 1 6 3 4 C . . 7 3 4 B . .
But to do other aggregations like are ALL of the observations TRUE you need to have real Boolean flags where the FALSE results are coded as zero instead of missing.
data have;
input ID $ VID Code $ ;
datalines;
1 5 A
1 5 B
2 6 A
2 6 B
2 3 C
3 4 B
3 4 C
4 7 A
4 7 A
;
proc sql ;
create table want as
select *
, (code='A') as codeA
, max(calculated codeA) as any_codeA
, min(calculated codeA) as all_codeA
from have
group by id
;
quit;
code any_ all_ Obs ID VID Code A codeA codeA 1 1 5 B 0 1 0 2 1 5 A 1 1 0 3 2 6 A 1 1 0 4 2 3 C 0 1 0 5 2 6 B 0 1 0 6 3 4 C 0 0 0 7 3 4 B 0 0 0 8 4 7 A 1 1 1 9 4 7 A 1 1 1
If I may drop in?
input ID $ VID Code $ ;
Is there really a sense for the "$" after ID?
The $ means read it as character. Since I was lazy and did not DEFINE the variables before they appeared in the INPUT statement I need to have that or else it will make ID as a NUMERIC variable.
You should not store identifiers as numbers as there is no reason to do arithmetic with them. What would the MEAN value of ID mean?
@Tom wrote:
You should not store identifiers as numbers as there is no reason to do arithmetic with them. What would the MEAN value of ID mean?
Agree.
But there may be value in knowing the MINIMUM and MAXIMUM ID values (or maybe the range), if ID's were historically assigned in numeric sequence. That's not enough in my mind to justify numeric ID's, when character vars could also be sequenced. But I suspect a lot of ID systems have started out with such numeric sequences. I've certainly seen a lot of surveys do exactly that.
Thank you @Tom - I can follow your explanation, but still have a different view on declaring IDs.
In this case the values of this ID take only numeric values and for this reason alone I see no advantage to declare the ID as a character.
I also know from other BI tools that especially with IDs numbers can be processed much faster than characters. Whether this is also the case in SAS, I do not know - I am still too new.
Best regards!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.