Hi all,
I have two dataset that I merged and proceed to array and do loop to create new variables but each step is too long to run and I am running out of space on my computer. I posted a sample code and would appreciate any input.
Thanks!
proc sort data= BL.data;
by pid;
run;
proc sort data= BL.fudata;
by pid;
run;
/*Merging both data*/
data merged_data;
merge BL.data BL.fudata;
by pid;
run;
/*creating new variables*/
data mergedclean1;
set mergeddata_fmt;
array ablpromis {4} promis_pa_scale1 promis_pa_scale3 promis_pa_scale4 promis_pa_scale5;
array afupromis {4} promis_pa_scale1_fu promis_pa_scale3_fu promis_pa_scale4_fu promis_pa_scale5_fu;
do i=1 to 4;
blpromis= ablpromis {i};
fupromis= afupromis {i};
if blpromis= 99 then blpromis= .;
if blpromis= '' then blpromis= .;
if fupromis= 99 then fupromis= .;
if fupromis= '' then fupromis= .;
output;
end;
drop i promis_pa_scale1 promis_pa_scale3 promis_pa_scale4 promis_pa_scale5 promis_pa_scale1_fu promis_pa_scale3_fu promis_pa_scale4_fu promis_pa_scale5_fu;
run;
proc print data= mergedclean1;
run;
data mergedclean2;
set mergedclean1;
array ablefficacy {8} c_paselfefficacy1 c_paselfefficacy2 c_paselfefficacy3 c_paselfefficacy4 c_paselfefficacy5 c_paselfefficacy6 c_paselfefficacy7 c_paselfefficacy8;
array afuefficacy {8} c_paselfefficacy1_fu c_paselfefficacy2_fu c_paselfefficacy3_fu c_paselfefficacy4_fu c_paselfefficacy5_fu c_paselfefficacy6_fu c_paselfefficacy7_fu c_paselfefficacy8_fu;
do i=1 to 8;
blefficacy= ablefficacy {i};
fueffecicay= afuefficacy {i};
if blefficacy= 99 then blefficacy= .;
if blefficacy= '' then blefficacy= .;
if fuefficacy= 99 then fuefficacy= .;
if fuefficacy= '' then fuefficacy= .;
output;
end;
drop i c_paselfefficacy1 c_paselfefficacy2 c_paselfefficacy3 c_paselfefficacy4 c_paselfefficacy5 c_paselfefficacy6 c_paselfefficacy7 c_paselfefficacy8 c_paselfefficacy1_fu c_paselfefficacy2_fu c_paselfefficacy3_fu c_paselfefficacy4_fu c_paselfefficacy5_fu c_paselfefficacy6_fu c_paselfefficacy7_fu c_paselfefficacy8_fu;
run;
A comment about this type of coding that may be impacting your performance slightly:
if blpromis= 99 then blpromis= .; if blpromis= '' then blpromis= .; if fupromis= 99 then fupromis= .; if fupromis= '' then fupromis= .;
The variables blpromis and fupromis are either character or numeric. If the value is character then SAS has to do a conversion here to make 99 character for comparison with blpromis or fupromis. Minor hit but you are doing this repeatedly. Plus the result will not be a numeric missing, i.e. the dot but an actual character.
351 data example; 352 /* character valued variable*/ 353 x='99'; 354 if x=99 then x=.; 355 run; NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column). 354:7 NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column). 354:19
When you see those notes about "converted to" then SAS is wasting cycles. Also the conversions can use values you do not expect because of the default rules for the conversions.
In any case you are likely doing extra comparisons if the first one is true. If the value were 99 then there is no reason to test for missing because you just set it to missing. In a generic sense to reduce wasted time you would code something more like
if blpromis= 99 then blpromis= .; ELSE if blpromis= <a different value goes here> then blpromis= .;
In this case the second comparison is not executed when the first one is true.
If the value is actually numeric then this forces a different numeric/character conversion: if blpromis= '' AND wasting a comparison as the value of a missing numeric is already . So the comparison would not even be needed.
Make sure you know your variable type. Also instead of "if blpromis= '' " you can use " If missing(blpromis) " which will do the correct comparison for missing regardless of character type.
So in the first merge you are likely doing 4(array size)*2 (number of arrays) =8 for each record doing unneeded comparisons for missing, and similarly 16 in the second.
Your MergedClean1 data set will make 4 copies of every single variable except for the variables in the arrays and i. So the data set goes get larger.
And then the MergedClean2 makes 8 copies of every single variable except the ones in the arrays and i. So if you have other variables you have made 32 copies of them.
You don't show how Mergeddata_fmt is made.
What do the NOTEs in the SAS log tell you? How long is each step taking?
How many observations are in each dataset? How many variables.
The flow of you program does not seem to make much sense.
You sort and merge two datasets to create a third.
Then your next step is referencing some other previously never mentioned fourth dataset to create a fifth. Why? If you are not going to use the results of the merge why bother doing it?
Why do you have PROC PRINT step? How large is this data? Are you really going to look at potentially hundreds of pages of printout?
A comment about this type of coding that may be impacting your performance slightly:
if blpromis= 99 then blpromis= .; if blpromis= '' then blpromis= .; if fupromis= 99 then fupromis= .; if fupromis= '' then fupromis= .;
The variables blpromis and fupromis are either character or numeric. If the value is character then SAS has to do a conversion here to make 99 character for comparison with blpromis or fupromis. Minor hit but you are doing this repeatedly. Plus the result will not be a numeric missing, i.e. the dot but an actual character.
351 data example; 352 /* character valued variable*/ 353 x='99'; 354 if x=99 then x=.; 355 run; NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column). 354:7 NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column). 354:19
When you see those notes about "converted to" then SAS is wasting cycles. Also the conversions can use values you do not expect because of the default rules for the conversions.
In any case you are likely doing extra comparisons if the first one is true. If the value were 99 then there is no reason to test for missing because you just set it to missing. In a generic sense to reduce wasted time you would code something more like
if blpromis= 99 then blpromis= .; ELSE if blpromis= <a different value goes here> then blpromis= .;
In this case the second comparison is not executed when the first one is true.
If the value is actually numeric then this forces a different numeric/character conversion: if blpromis= '' AND wasting a comparison as the value of a missing numeric is already . So the comparison would not even be needed.
Make sure you know your variable type. Also instead of "if blpromis= '' " you can use " If missing(blpromis) " which will do the correct comparison for missing regardless of character type.
So in the first merge you are likely doing 4(array size)*2 (number of arrays) =8 for each record doing unneeded comparisons for missing, and similarly 16 in the second.
Your MergedClean1 data set will make 4 copies of every single variable except for the variables in the arrays and i. So the data set goes get larger.
And then the MergedClean2 makes 8 copies of every single variable except the ones in the arrays and i. So if you have other variables you have made 32 copies of them.
You don't show how Mergeddata_fmt is made.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.