Hi,
I have a dataset as below. There are different number of responses nested within each cluster, each response came from a subject with same or different job levels, I would like to aggregate the information of V1 and V2 to the cluster level using the following criteria:
1. Use the values of V1 and V2 from the subject with the highest job level (1>2>3).
2. If there are more than one responses from the highest job level, the cluster level variable will be 1 if any response from the highest job level is 1.
3. If there are more than one responses from the highest job level and some of them are missing, the cluster level variable will use the response from the nonmissing values if it is 0.
4. If V1 or V2 from the highest job level are all missing, then move to the next job level.
data a;
input clusterid joblevel v1 v2;
datalines;
1 1 . 1
1 1 . 1
1 3 0 0
1 3 0 .
1 3 1 0
2 1 1 1
2 2 0 .
2 3 . 1
2 3 0 0
2 3 1 .
3 1 . 1
3 1 0 0
3 2 1 1
3 2 1 0
3 3 0 0
3 3 . .
3 3 0 1
;
run;
I am hoping to get a data like this:
clusterid V1 V2
1 1 1
2 1 1
3 0 1
Thanks A lot~
Im not sure I understand your logic completely 🙂
You write
3. If there are more than one responses from the highest job level and some of them are missing, the cluster level variable will use the response from the nonmissing values if it is 0.
But isn't that exactly the case in the first case where clusterid = 1? then you want to look only at responses from joblevel 1, where you have missing values, so you want to use the response from the nonmissing if it is zero? but the nonmissing value there is not zero, it is 1? 🙂
Sorry it is confusing. I should delete 'if it is 0'. It should be:
3. If there are more than one responses from the highest job level and some of them are missing, the cluster level variable will use the response from the nonmissing values.
Cluster 3 is the scenario for the criteria. There are two responses from job level 1, The first record of joblevel 1 has missing on V1, but the second has value 0. So for cluster3 V1, we use the value 0.
Among Cluster 1, the two records from the highest joblevel (1) are all missing on V1, therefore we need to use records from joblevel 3 (next level).
Does this make sense?
If often helps to show what the actual output for your example input data would be.
Question: Is the program supposed to tread V1 and V2 independently?
mk
If appears that within each clusterid, you data is sorted by joblevel. So if the desired result for V1 is produced within a given job level, then V1 is finished, and subsequent job levels should not be used. Same for V2. The program below nests of loop over each joblevel within its clusterid. It
data want (drop=result: finished: I);
do until (last.clusterid);
do until (last.joblevel);
set a;
by clusterid joblevel;
array v{2} v1 v2;
array result{2};
array finished{2};
do I=1 to 2;
if finished{I}=1 then continue;
if v{I}=1 then result{I}=1;
else if result{I}=. then result{I}=v{I};
end;
end;
do I=1 to 2;
if not(missing(result{I})) then finished{I}=1;
end;
end;
do I=1 to 2;
v{I}=result{I};
end;
put (clusterid joblevel v1 v2) (:);
run;
Regards,
Mark
Your question is kind of hard to understand. data a; input clusterid joblevel v1 v2; datalines; 1 1 . 1 1 1 . 1 1 3 0 0 1 3 0 . 1 3 1 0 2 1 1 1 2 2 0 . 2 3 . 1 2 3 0 0 2 3 1 . 3 1 . 1 3 1 0 0 3 2 1 1 3 2 1 0 3 3 0 0 3 3 . . 3 3 0 1 ; run; data want; do until(last.joblevel); set a; by clusterid joblevel notsorted; if first.clusterid then first=1; new_v1=coalesce(new_v1,v1); new_v2=coalesce(new_v2,v2); end; new_v1=coalesce(new_v1,1); new_v2=coalesce(new_v2,1); if first then output; drop first; run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.