I have a large number of edits/rules that are provided by the federal reserve that I run against our mortgage data. Several (38) sub totals are required to get the correct percentage. In order to label these different threshold I added a new column for each of the 38 like so:
if bankruptcy_flag = 'Y' then do;
_df4 = "df4";
'_54_Bankruptcy Loans Only'n = bankruptcy_chapter not in('7','9','11','12','13','U');
end;
if investor_type = '7' then do;
_df5="df5";
if com_loan_flg ne 1 and paid_in_full_coding = 0 and missing(baselII_pd) or
com_loan_flg in (1,.) and paid_in_full_coding = 0 and not missing(baselII_pd) and baselII_pd<.0003 or
baselII_pd>1 then
'_125.1_Basel II BHC Portfolio'n = 1;
Later in the process I join each subtotal like this:
select _df1 as _df,count(loan_number) as Count
from First_Lien_Rules_ETL_&propdate
where _df1 = "df1"
union
select _df2 as _df,count(loan_number) as count
from First_Lien_Rules_ETL_&propdate
where _df2 = 'df2'
union
select _df3 as _df,count(loan_number) as count
from First_Lien_Rules_ETL_&propdate
where _df3 = 'df3'
union
the _df field will give me the correct total, so I can look at the amount of fails/subtotal = percentage fail.
Within my experience this is the best solution I came up with. The entire process takes under 30 minutes but this step takes over 10. I'd love to get that time down. Can anyone give me suggestions for another solution to the same output?
Thank You,
It is not clear from your description whether your _df condtions are mutually exclusive (one account can only have one _df) but from the way you have coded it one account can have multiple _dfs.
Using an array you could have running totals in the same step where you define _dfs and avoid the following SQL step altogether. You would just need an array for all _df counts then have logic like this:
array _df_counts (*) _df_count1 - _df_count38;
if bankruptcy_flag = 'Y' then do;
_df4 = 'df4';
_df_counts(4) +1;
......
end;
It is not clear from your description whether your _df condtions are mutually exclusive (one account can only have one _df) but from the way you have coded it one account can have multiple _dfs.
Using an array you could have running totals in the same step where you define _dfs and avoid the following SQL step altogether. You would just need an array for all _df counts then have logic like this:
array _df_counts (*) _df_count1 - _df_count38;
if bankruptcy_flag = 'Y' then do;
_df4 = 'df4';
_df_counts(4) +1;
......
end;
If you could provide complete details for maybe 4 or 5 of the rules and a dataset (data step please) that would exercise all of the rules and the desired output we might be able to give you more detailed coding support.
Often when I see "one datset per variable" types of solutions often a data re-arrangement can help.
For instance, this code:
select _df2 as _df,count(loan_number) as count
from First_Lien_Rules_ETL_&propdate
where _df2 = 'df2'
looks like if you had a numeric variable that had values of 0 and 1 with 1 the condition of interest then SUMing it would give the desired count. And you can usually create that type of flag with _df = (a > b) or some other logical.
And then a procedure like Proc Means, Summary, Report or Tabulate can go a long way getting other things. (Mean of a 0/1 variable is percent as decimal in many cases if defined correctly).
Here is a very simple example but the principal may work for your. Note the difference in how OnlyYoungFemale is defined and the effect on the percentage as the conditional assignment changes the denominator.
data junk;
set sashelp.class;
IsFemale = (Sex='F');
IsYoungFemale= (IsFemale and (Age le 14));
If IsFemale then OnlyYoungFemale = (Age le 14);
IsTallYoungFemale = (IsYoungFemale and (Height ge 64));
run;
Proc Tabulate data=junk;
var IsFemale IsYoungFemale OnlyYoungFemale IsTallYoungFemale;
table IsFemale IsYoungFemale IsTallYoungFemale OnlyYoungFemale,
Sum='Count'*f=best4. Mean='%'*f=percent8.1;
label
IsFemale ='Females'
IsYoungFemale ="Young Females of pop"
OnlyYoungFemale= "Young of Females"
IsTallYoungFemale="Young Tall Females of Pop"
;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.