- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Trying to figure out a way where i can summarize all observations to a unique line corresponding to a distinct YEAR and MEMB_ID. I am able to use a proc summary function to get the needed results but it takes a long time and is very messy.
Table Name = INPUT_TEST
MEMB_ID | DATA_YEAR | DIABETES | CEREBROVASC | SKIN_ULCER | CORONARY_A_DISEASE | HISTORY_OF_SMOKING | HYPERTENSION | CLAUDICATION | DISCOLORATION_LOWER | MYOCARDIAL_INFRAC |
123A | 2018 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
123A | 2018 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
123A | 2018 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
123A | 2019 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
123A | 2019 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
My current data step is working but its not catching the Hypertension flag=1 for 2018, not sure why its not finding this observation.
data TEST123 ;
update INPUT_TEST (obs=0) INPUT_TEST;
by MEMB_ID DATA_YEAR;
run;
Results:
MEMB_ID | DATA_YEAR | DIABETES | CEREBROVASC | SKIN_ULCER | CORONARY_ARTERY_DISEASE | HISTORY_OF_SMOKING | HYPERTENSION | CLAUDICATION | DISCOLORATION_LOWER | MYOCARDIAL_INFRAC |
123A | 2018 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
123A | 2019 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Results I Want:
MEMB_ID | DATA_YEAR | DIABETES | CEREBROVASC | SKIN_ULCER | CORONARY_ARTERY_DISEASE | HISTORY_OF_SMOKING | HYPERTENSION | CLAUDICATION | DISCOLORATION_LOWER | MYOCARDIAL_INFRAC |
123A | 2018 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
123A | 2019 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
I performed another data step summary by renaming the field values and implementing a RETAIN and an IF function. This worked on my small example but found when performed on the over 800k memb_id's it was inserting values of 1 where there were only 0's so it did not work. Still wanted to provide my code.
data SUMMARY_TEST (keep=MEMB_ID DATA_YEAR HYPERTENSION1 cerebrovasc1 claudication1 coronary_artery_disease1 diabetes1 discoloration_lower1 history_of_smoking1 myocardial_infrac1 skin_ulcer1 ) ;
set INPUT_TEST;
by MEMB_ID DATA_YEAR;
length HYPERTENSION1
cerebrovasc1
claudication1
coronary_artery_disease1
diabetes1
discoloration_lower1
history_of_smoking1
myocardial_infrac1
skin_ulcer1 8.;
retain HYPERTENSION1
cerebrovasc1
claudication1
coronary_artery_disease1
diabetes1
discoloration_lower1
history_of_smoking1
myocardial_infrac1
skin_ulcer1 ;
if HYPERTENSION > 0 then HYPERTENSION1=HYPERTENSION;
if cerebrovasc > 0 then cerebrovasc1=cerebrovasc;
if claudication > 0 then claudication1=claudication;
if coronary_artery_disease > 0 then coronary_artery_disease1=coronary_artery_disease;
if diabetes > 0 then diabetes1=diabetes;
if discoloration_lower > 0 then discoloration_lower1=discoloration_lower;
if history_of_smoking > 0 then history_of_smoking1=history_of_smoking;
if myocardial_infrac > 0 then myocardial_infrac1=myocardial_infrac;
if skin_ulcer > 0 then skin_ulcer1=skin_ulcer;
if last.data_year then output;
run;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I like the idea about using the Update Statement. However, this would work only if the zeros were missing as below. If you can set your zeros to missing, then you're all good 🙂
data have;
input MEMB_ID $ DATA_YEAR DIABETES CEREBROVASC SKIN_ULCER
CORONARY_A_DISEASE HISTORY_OF_SMOKING HYPERTENSION CLAUDICATION DISCOLORATION_LOWER MYOCARDIAL_INFRAC;
datalines;
123A 2.18 . . . . . . . . .
123A 2.18 . . . . . 1 . . .
123A 2.18 1 . . . . . . . .
123A 2.19 . . . . . . . . .
123A 2.19 . . . . . 1 . . .
;
data want;
update have(obs=0) have;
by MEMB_ID DATA_YEAR;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How big is your data set? Can't see why this would take long time using PROC SUMMARY?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@PeterClemmensen wrote:
How big is your data set? Can't see why this would take long time using PROC SUMMARY?
Agree.
Unless there are unstated requirements this would be my approach:
proc summary data=input_test nway; class memb_id data_year; var HYPERTENSION cerebrovasc claudication coronary_artery_disease diabetes discoloration_lower history_of_smoking myocardial_infrac skin_ulcer ; output out= work.summary (drop=_:) max=; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Note that if the data is already sorted then it is better to use BY instead of CLASS. Then do not have to worry about having more class levels (by groups) than will fit in memory.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Tom wrote:
Note that if the data is already sorted then it is better to use BY instead of CLASS. Then do not have to worry about having more class levels (by groups) than will fit in memory.
True if you want the NWAY option from PROC SUMMARY, then if sorted BY will work faster than CLASS. If you want to turn off NWAY, then you have to use CLASS.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Have no idea what anyone could find wrong with this approach - MEANS/SUMMARY is designed for that. The only thing I'd change to save my sanity (and also since I'm a lousy typist) is limit the VAR spec to:
var hypertension--skin_ulcer ;
And of course, if the real input file is sorted as shown, would use BY instead of CLASS (as @Tom has suggested) and get rid of NWAY (as in this case it's not needed).
Kind regards
Paul D.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
yes I concur with draycut as I feel intrigued to find out what makes proc summary messy. That's rather strange
Anyways, here is an array based solution
data have;
input MEMB_ID $ DATA_YEAR DIABETES CEREBROVASC SKIN_ULCER CORONARY_A_DISEASE HISTORY_OF_SMOKING HYPERTENSION CLAUDICATION DISCOLORATION_LOWER MYOCARDIAL_INFRAC;
cards;
123A 2018 0 0 0 0 0 0 0 0 0
123A 2018 0 0 0 0 0 1 0 0 0
123A 2018 1 0 0 0 0 0 0 0 0
123A 2019 0 0 0 0 0 0 0 0 0
123A 2019 0 0 0 0 0 1 0 0 0
;
data _null_;
set have;
array t DIABETES--MYOCARDIAL_INFRAC;
call symputx('n',dim(t));
stop;
run;
data want;
do until(last.DATA_YEAR);
set have;
by MEMB_ID DATA_YEAR;
array t DIABETES--MYOCARDIAL_INFRAC;
array j(&n) _temporary_;
do over t;
j(_i_)=j(_i_) max t(_i_);
end;
end;
call pokelong(peekclong(addrlong(j(1)),&n*8),addrlong(t(1)),&n*8);
call missing(of j(*));
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Well, since the data are binary, instead of scanning the array in every record, I'd rather take advantage of the bitwise functions to scan it after each BY group:
data have ;
input memb_id $ data_year
diabetes
cerebrovasc
skin_ulcer
coronary_a_disease
history_of_smoking
hypertension
claudication
discoloration_lower
myocardial_infrac
;
cards ;
123a 2018 0 0 0 0 0 0 0 0 0
123a 2018 0 0 0 0 0 1 0 0 0
123a 2018 1 0 0 0 0 0 0 0 0
123a 2019 0 0 0 0 0 0 0 0 0
123a 2019 0 0 0 0 0 1 0 0 0
;
run ;
data _null_ ;
set have ;
array t DIABETES--MYOCARDIAL_INFRAC ;
call symputx ('n', dim (t)) ;
stop ;
run ;
data want (drop = _:) ;
do _b = 0 by 0 until (last.data_year) ;
set have ;
by memb_id data_year ;
array vv diabetes -- myocardial_infrac ;
_b = BOR (_b, input (cats (of vv[*]), binary&n..)) ;
end ;
do over vv ;
vv = input (char (put (_b, binary&n..), _i_), 1.) ;
end ;
run ;
After all, why not use the bitwise function, albeit rarely, if they are there .
Kind regards
Paul D.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I like the idea about using the Update Statement. However, this would work only if the zeros were missing as below. If you can set your zeros to missing, then you're all good 🙂
data have;
input MEMB_ID $ DATA_YEAR DIABETES CEREBROVASC SKIN_ULCER
CORONARY_A_DISEASE HISTORY_OF_SMOKING HYPERTENSION CLAUDICATION DISCOLORATION_LOWER MYOCARDIAL_INFRAC;
datalines;
123A 2.18 . . . . . . . . .
123A 2.18 . . . . . 1 . . .
123A 2.18 1 . . . . . . . .
123A 2.19 . . . . . . . . .
123A 2.19 . . . . . 1 . . .
;
data want;
update have(obs=0) have;
by MEMB_ID DATA_YEAR;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This worked just the way i needed it to. Thank you for the insight on dropping the 0's to Null's.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Anytime, glad you found your answer 🙂