SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
bknitch
Quartz | Level 8

Trying to figure out a way where i can summarize all observations to a unique line corresponding to a distinct YEAR and MEMB_ID. I am able to use a proc summary function to get the needed results but it takes a long time and is very messy. 

 

Table Name = INPUT_TEST

MEMB_IDDATA_YEARDIABETESCEREBROVASCSKIN_ULCERCORONARY_A_DISEASEHISTORY_OF_SMOKINGHYPERTENSIONCLAUDICATIONDISCOLORATION_LOWERMYOCARDIAL_INFRAC
123A2018000000000
123A2018000001000
123A2018100000000
123A2019000000000
123A2019000001000

 

My current data step is working but its not catching the Hypertension flag=1 for 2018, not sure why its not finding this observation. 

 

data TEST123 ;
update INPUT_TEST (obs=0) INPUT_TEST;
by MEMB_ID DATA_YEAR;
run;

 

Results: 

MEMB_IDDATA_YEARDIABETESCEREBROVASCSKIN_ULCERCORONARY_ARTERY_DISEASEHISTORY_OF_SMOKINGHYPERTENSIONCLAUDICATIONDISCOLORATION_LOWERMYOCARDIAL_INFRAC
123A2018100000000
123A2019000001000

 

 

Results I Want: 

MEMB_IDDATA_YEARDIABETESCEREBROVASCSKIN_ULCERCORONARY_ARTERY_DISEASEHISTORY_OF_SMOKINGHYPERTENSIONCLAUDICATIONDISCOLORATION_LOWERMYOCARDIAL_INFRAC
123A2018100001000
123A2019000001000

 

I performed another data step summary by renaming the field values and implementing a RETAIN and an IF function. This worked on my small example but found when performed on the over 800k memb_id's it was inserting values of 1 where there were only 0's so it did not work. Still wanted to provide my code. 

 

 

data SUMMARY_TEST (keep=MEMB_ID DATA_YEAR HYPERTENSION1 cerebrovasc1 claudication1 coronary_artery_disease1 diabetes1 discoloration_lower1 history_of_smoking1 myocardial_infrac1 skin_ulcer1 ) ;
set INPUT_TEST; 
by MEMB_ID DATA_YEAR;
length HYPERTENSION1 
cerebrovasc1 
claudication1 
coronary_artery_disease1 
diabetes1 
discoloration_lower1 
history_of_smoking1 
myocardial_infrac1 
skin_ulcer1 8.;
retain HYPERTENSION1 
cerebrovasc1 
claudication1 
coronary_artery_disease1 
diabetes1 
discoloration_lower1 
history_of_smoking1 
myocardial_infrac1 
skin_ulcer1 ;
if HYPERTENSION > 0 then HYPERTENSION1=HYPERTENSION;
if cerebrovasc > 0 then cerebrovasc1=cerebrovasc;
if claudication > 0 then claudication1=claudication;
if coronary_artery_disease > 0 then coronary_artery_disease1=coronary_artery_disease;
if diabetes > 0 then diabetes1=diabetes;
if discoloration_lower > 0 then discoloration_lower1=discoloration_lower;
if history_of_smoking > 0 then history_of_smoking1=history_of_smoking;
if myocardial_infrac > 0 then myocardial_infrac1=myocardial_infrac;
if skin_ulcer > 0 then skin_ulcer1=skin_ulcer;
if last.data_year then output; 
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

I like the idea about using the Update Statement. However, this would work only if the zeros were missing as below. If you can set your zeros to missing, then you're all good 🙂

 

data have;
input MEMB_ID $ DATA_YEAR DIABETES CEREBROVASC SKIN_ULCER 
	  CORONARY_A_DISEASE HISTORY_OF_SMOKING HYPERTENSION CLAUDICATION DISCOLORATION_LOWER MYOCARDIAL_INFRAC;
datalines;
123A 2.18 . . . . . . . . .
123A 2.18 . . . . . 1 . . .
123A 2.18 1 . . . . . . . .
123A 2.19 . . . . . . . . .
123A 2.19 . . . . . 1 . . .
;

data want;
	update have(obs=0) have;
	by MEMB_ID DATA_YEAR;
run;

View solution in original post

10 REPLIES 10
PeterClemmensen
Tourmaline | Level 20

How big is your data set? Can't see why this would take long time using PROC SUMMARY?

ballardw
Super User

@PeterClemmensen wrote:

How big is your data set? Can't see why this would take long time using PROC SUMMARY?


Agree.

 

Unless there are unstated requirements this would be my approach:

proc summary data=input_test nway;
   class memb_id data_year;
   var HYPERTENSION cerebrovasc claudication coronary_artery_disease 
         diabetes discoloration_lower history_of_smoking myocardial_infrac 
         skin_ulcer ;
   output out= work.summary (drop=_:) max=;
run;

 

Tom
Super User Tom
Super User

Note that if the data is already sorted then it is better to use BY instead of CLASS.  Then do not have to worry about having more class levels (by groups) than will fit in memory.

PaigeMiller
Diamond | Level 26

@Tom wrote:

Note that if the data is already sorted then it is better to use BY instead of CLASS.  Then do not have to worry about having more class levels (by groups) than will fit in memory.


True if you want the NWAY option from PROC SUMMARY, then if sorted BY will work faster than CLASS. If you want to turn off NWAY, then you have to use CLASS.

--
Paige Miller
hashman
Ammonite | Level 13

@ballardw :

Have no idea what anyone could find wrong with this approach - MEANS/SUMMARY is designed for that. The only thing I'd change to save my sanity (and also since I'm a lousy typist) is limit the VAR spec to:

 

  var hypertension--skin_ulcer ;

 

And of course, if the real input file is sorted as shown, would use BY instead of CLASS (as @Tom has suggested) and get rid of NWAY (as in this case it's not needed). 

 

Kind regards

Paul D.

novinosrin
Tourmaline | Level 20

yes I concur with draycut  as I feel intrigued to find out what makes proc summary messy. That's rather strange

 

Anyways, here is an array based solution

 

data have;
input MEMB_ID $	DATA_YEAR	DIABETES	CEREBROVASC	SKIN_ULCER	CORONARY_A_DISEASE	HISTORY_OF_SMOKING	HYPERTENSION	CLAUDICATION	DISCOLORATION_LOWER	MYOCARDIAL_INFRAC;
cards;
123A	2018	0	0	0	0	0	0	0	0	0
123A	2018	0	0	0	0	0	1	0	0	0
123A	2018	1	0	0	0	0	0	0	0	0
123A	2019	0	0	0	0	0	0	0	0	0
123A	2019	0	0	0	0	0	1	0	0	0
;

data _null_;
 set have;
 array t DIABETES--MYOCARDIAL_INFRAC;
 call symputx('n',dim(t));
 stop;
run;

data want;
 do until(last.DATA_YEAR);
  set have;
  by MEMB_ID DATA_YEAR;
  array t DIABETES--MYOCARDIAL_INFRAC;
  array j(&n) _temporary_;
  do over t;
   j(_i_)=j(_i_) max t(_i_);
  end;
 end;
call pokelong(peekclong(addrlong(j(1)),&n*8),addrlong(t(1)),&n*8);
call missing(of j(*));
run;

 

hashman
Ammonite | Level 13

@novinosrin :

Well, since the data are binary, instead of scanning the array in every record, I'd rather take advantage of the bitwise functions to scan it after each BY group:

data have ;                                             
  input memb_id $ data_year                             
        diabetes                                        
        cerebrovasc                                     
        skin_ulcer                                      
        coronary_a_disease                              
        history_of_smoking                              
        hypertension                                    
        claudication                                    
        discoloration_lower                             
        myocardial_infrac                               
  ;                                                     
  cards ;                                               
123a 2018 0 0 0 0 0 0 0 0 0                             
123a 2018 0 0 0 0 0 1 0 0 0                             
123a 2018 1 0 0 0 0 0 0 0 0                             
123a 2019 0 0 0 0 0 0 0 0 0                             
123a 2019 0 0 0 0 0 1 0 0 0                             
;                                                       
run ;                                                   
                                                        
data _null_ ;                                           
  set have ;                                            
  array t DIABETES--MYOCARDIAL_INFRAC ;                 
  call symputx ('n', dim (t)) ;                         
  stop ;                                                
run ;                                                   
                                                        
data want (drop = _:) ;                                 
  do _b = 0 by 0 until (last.data_year) ;               
    set have ;                                          
    by memb_id data_year ;                              
    array vv diabetes -- myocardial_infrac ;            
    _b = BOR (_b, input (cats (of vv[*]), binary&n..)) ;
  end ;                                                 
  do over vv ;                                          
    vv = input (char (put (_b, binary&n..), _i_), 1.) ; 
  end ;                                                 
run ;                                                   

After all, why not use the bitwise function, albeit rarely, if they are there Smiley Happy.

 

Kind regards

Paul D.

PeterClemmensen
Tourmaline | Level 20

I like the idea about using the Update Statement. However, this would work only if the zeros were missing as below. If you can set your zeros to missing, then you're all good 🙂

 

data have;
input MEMB_ID $ DATA_YEAR DIABETES CEREBROVASC SKIN_ULCER 
	  CORONARY_A_DISEASE HISTORY_OF_SMOKING HYPERTENSION CLAUDICATION DISCOLORATION_LOWER MYOCARDIAL_INFRAC;
datalines;
123A 2.18 . . . . . . . . .
123A 2.18 . . . . . 1 . . .
123A 2.18 1 . . . . . . . .
123A 2.19 . . . . . . . . .
123A 2.19 . . . . . 1 . . .
;

data want;
	update have(obs=0) have;
	by MEMB_ID DATA_YEAR;
run;
bknitch
Quartz | Level 8

This worked just the way i needed it to. Thank you for the insight on dropping the 0's to Null's. 

PeterClemmensen
Tourmaline | Level 20

Anytime, glad you found your answer 🙂

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2196 views
  • 12 likes
  • 7 in conversation