SAS Programming

bknitch · Posted 08-16-2019 12:06 PM

Trying to figure out a way where i can summarize all observations to a unique line corresponding to a distinct YEAR and MEMB_ID. I am able to use a proc summary function to get the needed results but it takes a long time and is very messy.

Table Name = INPUT_TEST

MEMB_ID	DATA_YEAR	DIABETES	CEREBROVASC	SKIN_ULCER	CORONARY_A_DISEASE	HISTORY_OF_SMOKING	HYPERTENSION	CLAUDICATION	DISCOLORATION_LOWER	MYOCARDIAL_INFRAC
123A	2018	0	0	0	0	0	0	0	0	0
123A	2018	0	0	0	0	0	1	0	0	0
123A	2018	1	0	0	0	0	0	0	0	0
123A	2019	0	0	0	0	0	0	0	0	0
123A	2019	0	0	0	0	0	1	0	0	0

My current data step is working but its not catching the Hypertension flag=1 for 2018, not sure why its not finding this observation.

data TEST123 ;
update INPUT_TEST (obs=0) INPUT_TEST;
by MEMB_ID DATA_YEAR;
run;

Results:

MEMB_ID	DATA_YEAR	DIABETES	CEREBROVASC	SKIN_ULCER	CORONARY_ARTERY_DISEASE	HISTORY_OF_SMOKING	HYPERTENSION	CLAUDICATION	DISCOLORATION_LOWER	MYOCARDIAL_INFRAC
123A	2018	1	0	0	0	0	0	0	0	0
123A	2019	0	0	0	0	0	1	0	0	0

Results I Want:

MEMB_ID	DATA_YEAR	DIABETES	CEREBROVASC	SKIN_ULCER	CORONARY_ARTERY_DISEASE	HISTORY_OF_SMOKING	HYPERTENSION	CLAUDICATION	DISCOLORATION_LOWER	MYOCARDIAL_INFRAC
123A	2018	1	0	0	0	0	1	0	0	0
123A	2019	0	0	0	0	0	1	0	0	0

I performed another data step summary by renaming the field values and implementing a RETAIN and an IF function. This worked on my small example but found when performed on the over 800k memb_id's it was inserting values of 1 where there were only 0's so it did not work. Still wanted to provide my code.

data SUMMARY_TEST (keep=MEMB_ID DATA_YEAR HYPERTENSION1 cerebrovasc1 claudication1 coronary_artery_disease1 diabetes1 discoloration_lower1 history_of_smoking1 myocardial_infrac1 skin_ulcer1 ) ;
set INPUT_TEST; 
by MEMB_ID DATA_YEAR;
length HYPERTENSION1 
cerebrovasc1 
claudication1 
coronary_artery_disease1 
diabetes1 
discoloration_lower1 
history_of_smoking1 
myocardial_infrac1 
skin_ulcer1 8.;
retain HYPERTENSION1 
cerebrovasc1 
claudication1 
coronary_artery_disease1 
diabetes1 
discoloration_lower1 
history_of_smoking1 
myocardial_infrac1 
skin_ulcer1 ;
if HYPERTENSION > 0 then HYPERTENSION1=HYPERTENSION;
if cerebrovasc > 0 then cerebrovasc1=cerebrovasc;
if claudication > 0 then claudication1=claudication;
if coronary_artery_disease > 0 then coronary_artery_disease1=coronary_artery_disease;
if diabetes > 0 then diabetes1=diabetes;
if discoloration_lower > 0 then discoloration_lower1=discoloration_lower;
if history_of_smoking > 0 then history_of_smoking1=history_of_smoking;
if myocardial_infrac > 0 then myocardial_infrac1=myocardial_infrac;
if skin_ulcer > 0 then skin_ulcer1=skin_ulcer;
if last.data_year then output; 
run;

PeterClemmensen · Posted 08-16-2019 12:26 PM

I like the idea about using the Update Statement. However, this would work only if the zeros were missing as below. If you can set your zeros to missing, then you're all good 🙂

data have;
input MEMB_ID $ DATA_YEAR DIABETES CEREBROVASC SKIN_ULCER 
	  CORONARY_A_DISEASE HISTORY_OF_SMOKING HYPERTENSION CLAUDICATION DISCOLORATION_LOWER MYOCARDIAL_INFRAC;
datalines;
123A 2.18 . . . . . . . . .
123A 2.18 . . . . . 1 . . .
123A 2.18 1 . . . . . . . .
123A 2.19 . . . . . . . . .
123A 2.19 . . . . . 1 . . .
;

data want;
	update have(obs=0) have;
	by MEMB_ID DATA_YEAR;
run;

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 08-16-2019 12:13 PM

How big is your data set? Can't see why this would take long time using PROC SUMMARY?

The DATA to DATA Step Macro
Blog: SASnrd

ballardw · Posted 08-16-2019 01:02 PM

@PeterClemmensen wrote:

How big is your data set? Can't see why this would take long time using PROC SUMMARY?

Agree.

Unless there are unstated requirements this would be my approach:

proc summary data=input_test nway;
   class memb_id data_year;
   var HYPERTENSION cerebrovasc claudication coronary_artery_disease 
         diabetes discoloration_lower history_of_smoking myocardial_infrac 
         skin_ulcer ;
   output out= work.summary (drop=_:) max=;
run;

Tom · Posted 08-16-2019 03:27 PM

Note that if the data is already sorted then it is better to use BY instead of CLASS. Then do not have to worry about having more class levels (by groups) than will fit in memory.

PaigeMiller · Posted 08-17-2019 06:48 AM

@Tom wrote:

Note that if the data is already sorted then it is better to use BY instead of CLASS. Then do not have to worry about having more class levels (by groups) than will fit in memory.

True if you want the NWAY option from PROC SUMMARY, then if sorted BY will work faster than CLASS. If you want to turn off NWAY, then you have to use CLASS.

--
Paige Miller

hashman · Posted 08-17-2019 01:16 AM

@ballardw :

Have no idea what anyone could find wrong with this approach - MEANS/SUMMARY is designed for that. The only thing I'd change to save my sanity (and also since I'm a lousy typist) is limit the VAR spec to:

var hypertension--skin_ulcer ;

And of course, if the real input file is sorted as shown, would use BY instead of CLASS (as @Tom has suggested) and get rid of NWAY (as in this case it's not needed).

Kind regards

Paul D.

novinosrin · Posted 08-16-2019 12:24 PM

yes I concur with draycut as I feel intrigued to find out what makes proc summary messy. That's rather strange

Anyways, here is an array based solution

data have;
input MEMB_ID $	DATA_YEAR	DIABETES	CEREBROVASC	SKIN_ULCER	CORONARY_A_DISEASE	HISTORY_OF_SMOKING	HYPERTENSION	CLAUDICATION	DISCOLORATION_LOWER	MYOCARDIAL_INFRAC;
cards;
123A	2018	0	0	0	0	0	0	0	0	0
123A	2018	0	0	0	0	0	1	0	0	0
123A	2018	1	0	0	0	0	0	0	0	0
123A	2019	0	0	0	0	0	0	0	0	0
123A	2019	0	0	0	0	0	1	0	0	0
;

data _null_;
 set have;
 array t DIABETES--MYOCARDIAL_INFRAC;
 call symputx('n',dim(t));
 stop;
run;

data want;
 do until(last.DATA_YEAR);
  set have;
  by MEMB_ID DATA_YEAR;
  array t DIABETES--MYOCARDIAL_INFRAC;
  array j(&n) _temporary_;
  do over t;
   j(_i_)=j(_i_) max t(_i_);
  end;
 end;
call pokelong(peekclong(addrlong(j(1)),&n*8),addrlong(t(1)),&n*8);
call missing(of j(*));
run;

hashman · Posted 08-17-2019 02:08 AM

@novinosrin :

Well, since the data are binary, instead of scanning the array in every record, I'd rather take advantage of the bitwise functions to scan it after each BY group:

data have ;                                             
  input memb_id $ data_year                             
        diabetes                                        
        cerebrovasc                                     
        skin_ulcer                                      
        coronary_a_disease                              
        history_of_smoking                              
        hypertension                                    
        claudication                                    
        discoloration_lower                             
        myocardial_infrac                               
  ;                                                     
  cards ;                                               
123a 2018 0 0 0 0 0 0 0 0 0                             
123a 2018 0 0 0 0 0 1 0 0 0                             
123a 2018 1 0 0 0 0 0 0 0 0                             
123a 2019 0 0 0 0 0 0 0 0 0                             
123a 2019 0 0 0 0 0 1 0 0 0                             
;                                                       
run ;                                                   
                                                        
data _null_ ;                                           
  set have ;                                            
  array t DIABETES--MYOCARDIAL_INFRAC ;                 
  call symputx ('n', dim (t)) ;                         
  stop ;                                                
run ;                                                   
                                                        
data want (drop = _:) ;                                 
  do _b = 0 by 0 until (last.data_year) ;               
    set have ;                                          
    by memb_id data_year ;                              
    array vv diabetes -- myocardial_infrac ;            
    _b = BOR (_b, input (cats (of vv[*]), binary&n..)) ;
  end ;                                                 
  do over vv ;                                          
    vv = input (char (put (_b, binary&n..), _i_), 1.) ; 
  end ;                                                 
run ;

After all, why not use the bitwise function, albeit rarely, if they are there .

Kind regards

Paul D.

PeterClemmensen · Posted 08-16-2019 12:26 PM

I like the idea about using the Update Statement. However, this would work only if the zeros were missing as below. If you can set your zeros to missing, then you're all good 🙂

data have;
input MEMB_ID $ DATA_YEAR DIABETES CEREBROVASC SKIN_ULCER 
	  CORONARY_A_DISEASE HISTORY_OF_SMOKING HYPERTENSION CLAUDICATION DISCOLORATION_LOWER MYOCARDIAL_INFRAC;
datalines;
123A 2.18 . . . . . . . . .
123A 2.18 . . . . . 1 . . .
123A 2.18 1 . . . . . . . .
123A 2.19 . . . . . . . . .
123A 2.19 . . . . . 1 . . .
;

data want;
	update have(obs=0) have;
	by MEMB_ID DATA_YEAR;
run;

The DATA to DATA Step Macro
Blog: SASnrd

bknitch · Posted 08-16-2019 01:57 PM

This worked just the way i needed it to. Thank you for the insight on dropping the 0's to Null's.

PeterClemmensen · Posted 08-17-2019 01:27 AM

Anytime, glad you found your answer 🙂

The DATA to DATA Step Macro
Blog: SASnrd

SAS Programming

Rollup observations in a DATASTEP

Re: Rollup observations in a DATASTEP

Re: Rollup observations in a DATASTEP

Re: Rollup observations in a DATASTEP

Re: Rollup observations in a DATASTEP

Re: Rollup observations in a DATASTEP

Re: Rollup observations in a DATASTEP

Re: Rollup observations in a DATASTEP

Re: Rollup observations in a DATASTEP

Re: Rollup observations in a DATASTEP

Re: Rollup observations in a DATASTEP

Re: Rollup observations in a DATASTEP

Data Rollup

Proc SQL RollUps

log window full and datastep stops

Count Unique Observations

Missing Observations

Follow Us

What is...

SAS Programming

Register Today!

SAS Training: Just a Click Away

Follow Us

What is...