- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have been trying for hours so I decided to reach out for help.
Below is an example of my data. I created a flag if the Code is A by using if then do. However, I also need to flag all rows of the same VID if the code was ever A. I have tried using a data step but I cannot seem to figure out anything that works. Any help is appreciated.
Example Have:
ID VID Code CodeA
1 5 A 1
1 5 B
2 6 A 1
2 6 B
2 3 C
3 4 B
3 4 C
Example Want:
ID VID Code CodeA
1 5 A 1
1 5 B 1
2 6 A 1
2 6 B 1
2 3 C
3 4 B
3 4 C
or Example want:
ID VID Code CodeA VID_CODEA
1 5 A 1 1
1 5 B 1
2 6 A 1 1
2 6 B 1
2 3 C
3 4 B
3 4 C
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Double DO loop:
data want;
do until (last.vid);
set have;
by id vid notsorted;
if code = "A" then codea = 1;
end;
do until (last.vid);
set have;
by id vid notsorted;
output;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This one is always a bit tricky. If you ever need to find this again, you would call it last observation carried forward.
data have;
infile datalines delimiter = "," dsd missover;
input ID $ VID Code $ CodeA;
datalines;
1,5,A,1
1,5,B,
2,6,A,1
2,6,B,
2,3,C,
3,4,B,
3,4,C,
;
run;
proc sort data = have;
by id vid code;
run;
data want;
set have;
length vid_codea 3.;
retain vid_codea;
by id vid;
if code = 'A' then codeA = 1;
if first.vid then call missing(vid_codeA);
if not missing(codeA) then vid_codeA = codeA;
run;
ID VID Code CodeA vid_codea 1 5 A 1 1 1 5 B . 1 2 3 C . . 2 6 A 1 1 2 6 B . 1 3 4 B . . 3 4 C . .
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@maguiremq Thank you so much. To be clear, column vid_codea is what I am looking for, correct? If so, it seemed to work perfectly!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@maguiremq I spoke too soon. Below is the dataset and there should be a '1' for the bolded lines with the same VID when Code=1. Any ideas what may be the issue?
ID VID Code CodeA vid_codea
1 5 A 1 1
1 5 B . .
2 3 C . .
2 6 A 1 1
2 6 B . .
3 4 B . .
3 4 C . .
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
As long as your data are already sorted by ID, it doesn't matter where the code "A" occurs. You can always do a MERGE of the code="A" obs with all the obs. No further sorting is needed.
Code below is edited, and tested:
data have;
infile datalines delimiter = "," dsd missover;
input ID $ VID Code $ CodeA;
datalines;
1,5,A,1
1,5,B,
2,6,A,1
2,6,B,
2,3,C,
3,4,B,
3,4,C,
run;
data want;
merge have (where=(code='A') in=found_a_qualifier)
have (in=allobs);
by id;
retain vid_codea;
if first.id then vid_codea=found_a_qualifier; /*Changed first.have to first.id */
run;
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@mkeintz thank you! for some reason when I run the code, both the vid_codea and found_a_qualifier are empty columns?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@stats_auca wrote:
@mkeintz thank you! for some reason when I run the code, both the vid_codea and found_a_qualifier are empty columns?
Yes. I have corrected the code to use first.id instead of first.have. This is the result of my failure to test the code against your sample data. I've edited the response, with tested code.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Data Data_A;
Set Data_Have;
Where code = “A”;
Run;
Example: Data_A
ID VID Code CodeA
1 5 A 1
2 6 A 1
After that use proc sql to assign a new flag as you want:
Proc sql noprint;
Create table Data_A_want as
Select *, case when VID in (select distinct VID from Data_A) then 1 end else 0 as VID_ CODEA
From Data_Have;
Quit;
The above code should give you the flag you want.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Double DO loop:
data want;
do until (last.vid);
set have;
by id vid notsorted;
if code = "A" then codea = 1;
end;
do until (last.vid);
set have;
by id vid notsorted;
output;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
To test if a Boolean is ever TRUE just take the MAX(). This will work with your 1/. coded variable just as well as with a normal 1/0 coded Boolean variable.
data have;
input ID $ VID Code $ CodeA;
datalines;
1 5 A 1
1 5 B .
2 6 A 1
2 6 B .
2 3 C .
3 4 B .
3 4 C .
;
proc sql ;
create table want as
select *
, max(codeA) as any_codeA
from have
group by id
;
quit;
proc print;
run;
Result
Code any_ Obs ID VID Code A codeA 1 1 5 A 1 1 2 1 5 B . 1 3 2 6 A 1 1 4 2 3 C . 1 5 2 6 B . 1 6 3 4 C . . 7 3 4 B . .
But to do other aggregations like are ALL of the observations TRUE you need to have real Boolean flags where the FALSE results are coded as zero instead of missing.
data have;
input ID $ VID Code $ ;
datalines;
1 5 A
1 5 B
2 6 A
2 6 B
2 3 C
3 4 B
3 4 C
4 7 A
4 7 A
;
proc sql ;
create table want as
select *
, (code='A') as codeA
, max(calculated codeA) as any_codeA
, min(calculated codeA) as all_codeA
from have
group by id
;
quit;
code any_ all_ Obs ID VID Code A codeA codeA 1 1 5 B 0 1 0 2 1 5 A 1 1 0 3 2 6 A 1 1 0 4 2 3 C 0 1 0 5 2 6 B 0 1 0 6 3 4 C 0 0 0 7 3 4 B 0 0 0 8 4 7 A 1 1 1 9 4 7 A 1 1 1
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If I may drop in?
input ID $ VID Code $ ;
Is there really a sense for the "$" after ID?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The $ means read it as character. Since I was lazy and did not DEFINE the variables before they appeared in the INPUT statement I need to have that or else it will make ID as a NUMERIC variable.
You should not store identifiers as numbers as there is no reason to do arithmetic with them. What would the MEAN value of ID mean?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Tom wrote:
You should not store identifiers as numbers as there is no reason to do arithmetic with them. What would the MEAN value of ID mean?
Agree.
But there may be value in knowing the MINIMUM and MAXIMUM ID values (or maybe the range), if ID's were historically assigned in numeric sequence. That's not enough in my mind to justify numeric ID's, when character vars could also be sequenced. But I suspect a lot of ID systems have started out with such numeric sequences. I've certainly seen a lot of surveys do exactly that.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you @Tom - I can follow your explanation, but still have a different view on declaring IDs.
In this case the values of this ID take only numeric values and for this reason alone I see no advantage to declare the ID as a character.
I also know from other BI tools that especially with IDs numbers can be processed much faster than characters. Whether this is also the case in SAS, I do not know - I am still too new.
Best regards!