Hi Tom,
I am jumping back and forth between two processes. As you said, "Sounds like you want an indicator of whether every observation for a variable is zero (that is was it ever non-zero)." However, the indicator variable is within each value of the variable ITEM. So, for ITEM = 1 there are 9 rows. You will note that at some point, the probability of selecting response option _0 goes from 0 to 0.5 and then 1. This tells me that at least one subject selected this response option. Therefore, I want to leave the values for _0 alone. The same holds true for _1 to _3. However, for _4, the values are all zero for ITEM = 1. This tells me that no subject selected response option _4 for this item.
The second process is to set to missing all the zeros for a specific item number and response category based on the results for the indicator variable in the prior step. For ITEM=1, this means leaving the values for _0 to _3 alone since there is at least one row with a value >0. However, for _4, I want to set the zeros to missing for ITEM=1 because subjects never used this response option. Basically, only a 4-point response scale was needed for this item.
In response to your question:
Are you saying you want to eliminate all observations for that level of ITEM? Only for the _# where all the observations were zero. So, for ITEM=1 and _4, all the observations are zero. I want to set them to missing and repeat this process for the remaining items.
The code produced by Patrick using the HASH language produces exactly what I am looking for. I am trying to read up on HASH now as I have never seen it used before and the code appears very strange to me. I have to document it as one of my colleagues will need to peer review the code. I also need to understand it because I am trying to program a generic solution that can be applied to different datasets that may have a different number of items and response options.
Cristian
This is probably doing the same thing as the HASH but might be easier to explain.
proc sql ;
create table want as
select
item
, row
, case when(max(cat0)=0) then . else cat0 end as cat0
, case when(max(cat1)=0) then . else cat1 end as cat1
, case when(max(cat2)=0) then . else cat2 end as cat2
, case when(max(cat3)=0) then . else cat3 end as cat3
, case when(max(cat4)=0) then . else cat4 end as cat4
from have
group by item
order by item, row
;
quit;
The only tricky thing is that PROC SQL will automatically re-merge the values of the aggregate function (MAX(cat0) etc) back with the original data for you. If you where doing this in some SQL implementation you would need to calculate the max(cat...) per ITEM in a subquery and join it back to the original dataset yourself.
Data used:
data have;
input ITEM CAT0-CAT4 ;
row+1;
if item ne lag(item) then row=1;
datalines4;
1 0 0 0 1 0
1 0 0 0.5 0 0
1 0 0 0.5 0.5 0
1 0 0 0.5 1 0
1 0 0.5 0.5 0 0
1 0.5 0 0 0 0
1 0.5 0.5 0 0 0
1 0.5 0.5 0.5 0 0
1 1 0 0 0 0
2 0 0 0 1 0
2 0 0 0.5 0 0
2 0 0 0.5 0.5 0
2 0 0 0.5 1 0
2 0 0 1 0 0
2 0 0.5 0.5 0 0
2 0.5 0 0 0 0
2 0.5 0.5 0 0 0
2 0.5 0.5 0.5 0 0
2 1 0 0 0 0
3 0 0 0 1 0
3 0 0 0.5 0 0
3 0 0 0.5 0.5 0
3 0 0 0.5 1 0
3 0 0.5 0 0 0
3 0 0.5 0.5 0 0
3 0.5 0 0 0 0
3 0.5 0.5 0 0 0
3 1 0 0 0 0
4 0 0 0 0.5 0
4 0 0 0 1 0
4 0 0 0.5 0.5 0
4 0 0.5 0 0 0
4 0 0.5 0.5 0 0
4 0 0.5 0.5 0.5 0
4 0 1 0 0 0
4 0.5 0.5 0 0 0
4 1 0 0 0 0
4 1 0.5 0 0 0
5 0 0 0 0.5 0
5 0 0 0 1 0
5 0 0 0.5 0 0
5 0 0 0.5 0.5 0
5 0 0.5 0 0 0
5 0 0.5 0.5 0 0
5 0.5 0.5 0 0 0
5 1 0 0 0 0
5 1 0.5 0 0 0
6 0 0 0 0.5 0
6 0 0 0 1 0
6 0 0 0.5 0.5 0
6 0 0.5 0 0 0
6 0 0.5 0.5 0 0
6 0 0.5 0.5 0.5 0
6 0 1 0 0 0
6 0.5 0.5 0 0 0
6 1 0 0 0 0
6 1 0.5 0 0 0
7 0 0 0 1 0
7 0 0 0.5 0 0
7 0 0 0.5 0.5 0
7 0 0 0.5 1 0
7 0 0.5 0 0 0
7 0 0.5 0.5 0 0
7 0.5 0.5 0 0 0
7 1 0 0 0 0
7 1 0.5 0 0 0
8 0 0 0 0.5 0
8 0 0 0 1 0
8 0 0 0.5 0.5 0
8 0 0.5 0 0 0
8 0 0.5 0.5 0 0
8 0 0.5 0.5 0.5 0
8 0 1 0 0 0
8 0.5 0.5 0 0 0
8 0.5 1 0 0 0
8 1 0 0 0 0
8 1 0.5 0 0 0
9 0 0 0 1 0
9 0 0 0.5 0 0
9 0 0 0.5 0.5 0
9 0 0 0.5 1 0
9 0 0.5 0 0 0
9 0 0.5 0.5 0 0
9 0.5 0 0 0 0
9 0.5 0.5 0 0 0
9 1 0 0 0 0
10 0 0 0 0 0.5
10 0 0 0 0 1
10 0 0 0 0.5 0.5
10 0 0 0.5 0 0
10 0 0 0.5 0.5 0
10 0 0 0.5 0.5 0.5
10 0 0.5 0.5 0 0
10 0.5 0 0 0 0
10 0.5 0.5 0 0 0
10 0.5 0.5 0.5 0 0
10 1 0 0 0 0
;;;;
To see the results let's just print the first row for each ITEM value.
proc print data=want ;
where row=1;
run;
And you can see that CAT4 is set to missing for all items except the last one.
Obs ITEM row cat0 cat1 cat2 cat3 cat4 1 1 1 0 0 0 1.0 . 10 2 1 0 0 0 1.0 . 20 3 1 0 0 0 1.0 . 29 4 1 0 0 0 0.5 . 39 5 1 0 0 0 0.5 . 48 6 1 0 0 0 0.5 . 58 7 1 0 0 0 1.0 . 67 8 1 0 0 0 0.5 . 78 9 1 0 0 0 1.0 . 87 10 1 0 0 0 0.0 0.5
Tom,
This code works quite nicely and I was able to follow it much better than the HASH approach. I like both approaches. The HASH ran slightly faster 0.06 seconds versus 0.08 seconds and may be easier to macro as I would only need to replace _0 and _4 with the category labels for any new dataset. That said, the SQL approach is much easier to understand.
Thank you both.
Cristian
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.