BookmarkSubscribeRSS Feed
yqsmm
Calcite | Level 5

Hi,

 

I have a dataset as below. There are different number of responses nested within each cluster, each response came from a subject with same or different job levels,  I would like to aggregate the information of V1 and V2 to the cluster level using the following criteria:

1. Use the values of V1 and V2 from the subject with the highest job level (1>2>3).

2. If there are more than one responses from the highest job level, the cluster level variable will be 1 if any response from the highest job level is 1.

3. If there are more than one responses from the highest job level and some of them are missing, the cluster level variable will use the response from the nonmissing values if it is 0.

4. If V1 or V2 from the highest job level are all missing, then move to the next job level.  


data a;
input clusterid joblevel v1 v2;
datalines;
1 1 . 1
1 1 . 1
1 3 0 0
1 3 0 .
1 3 1 0
2 1 1 1
2 2 0 .
2 3 . 1
2 3 0 0
2 3 1 .
3 1 . 1
3 1 0 0
3 2 1 1
3 2 1 0
3 3 0 0
3 3 . .
3 3 0 1
;
run;

 

I am hoping to get a data like this:

clusterid V1 V2

1 1 1

2 1 1

3 0 1

 

Thanks A lot~

8 REPLIES 8
PeterClemmensen
Tourmaline | Level 20

Im not sure I understand your logic completely 🙂

 

You write

 

3. If there are more than one responses from the highest job level and some of them are missing, the cluster level variable will use the response from the nonmissing values if it is 0.

 

But isn't that exactly the case in the first case where clusterid = 1? then you want to look only at responses from joblevel 1, where you have missing values, so you want to use the response from the nonmissing if it is zero? but the nonmissing value there is not zero, it is 1? 🙂

 

yqsmm
Calcite | Level 5

Sorry it is confusing.  I should delete 'if it is 0'. It should be:

3. If there are more than one responses from the highest job level and some of them are missing, the cluster level variable will use the response from the nonmissing values. 

 

Cluster 3 is the scenario for the criteria. There are two responses from job level 1, The first record of joblevel 1 has missing on V1, but the second has value 0. So for cluster3 V1, we use the value 0.

 

Among Cluster 1, the two records from the highest joblevel (1) are all missing on V1, therefore we need to use records from joblevel 3 (next level).   

 

Does this make sense? 

ballardw
Super User

If often helps to show what the actual output for your example input data would be.

yqsmm
Calcite | Level 5
Yes. The data I want is at the end of the original post.
mkeintz
PROC Star

Question:  Is the program supposed to tread V1 and V2 independently?

 

mk

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
yqsmm
Calcite | Level 5
Yes. So there will be two cluster level variable V1 and V2.
mkeintz
PROC Star

If appears that within each clusterid, you data is sorted by joblevel.  So if the desired result for V1 is produced within a given job level, then V1 is finished, and subsequent job levels should not be used.  Same for V2.  The program below nests of loop over each joblevel within its clusterid.  It

  1. At the end of processing the clusterid, it copies result values back to the original vars, ready for output.
  2. At the end of each joblevel, if the result is non-missing it declare that variable as finished.
  3. For every record in every subsequent joblevel, if a variable is marked finished, it is not processed further
  4. The CONTINUE statement tells the loop to skip directly to the next iteration, so that one var may be marked finished, but the other var is still processed. 
data want (drop=result: finished: I);
  
  do until (last.clusterid);
    do until (last.joblevel);
      set a;
      by clusterid joblevel;

      array v{2} v1 v2; 
      array result{2};  
      array finished{2}; 

      do I=1 to 2;
        if finished{I}=1 then continue;
        if v{I}=1 then result{I}=1;
        else if result{I}=. then result{I}=v{I};
      end; 
    end;   

    do I=1 to 2;
      if not(missing(result{I})) then finished{I}=1;
    end;
  end;     
  do I=1 to 2;
    v{I}=result{I};
  end;
  put (clusterid joblevel v1 v2) (:);
run;

 

Regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User
Your question is kind of hard to understand.



data a;
input clusterid joblevel v1 v2;
datalines;
1 1 . 1
1 1 . 1
1 3 0 0
1 3 0 .
1 3 1 0
2 1 1 1
2 2 0 .
2 3 . 1
2 3 0 0
2 3 1 .
3 1 . 1
3 1 0 0
3 2 1 1
3 2 1 0
3 3 0 0
3 3 . .
3 3 0 1
;
run;
data want;
do until(last.joblevel);
 set a;
 by clusterid joblevel notsorted;
 if first.clusterid then first=1;
 new_v1=coalesce(new_v1,v1);
 new_v2=coalesce(new_v2,v2);
end;
 new_v1=coalesce(new_v1,1);
 new_v2=coalesce(new_v2,1);
if first then output;
drop first;
run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 1571 views
  • 0 likes
  • 5 in conversation