Help using Base SAS procedures

possible hash/array solution

Accepted Solution Solved
Reply
Valued Guide
Posts: 858
Accepted Solution

possible hash/array solution

I have a large number of edits/rules that are provided by the federal reserve that I run against our mortgage data.  Several (38) sub totals are required to get the correct percentage.  In order to label these different threshold I added a new column for each of the 38 like so:

 

if bankruptcy_flag = 'Y' then do;
_df4 = "df4";
'_54_Bankruptcy Loans Only'n = bankruptcy_chapter not in('7','9','11','12','13','U');
end;

if investor_type = '7' then do;
_df5="df5";
if com_loan_flg ne 1 and paid_in_full_coding = 0 and missing(baselII_pd) or
   com_loan_flg in (1,.) and paid_in_full_coding = 0 and not missing(baselII_pd) and baselII_pd<.0003 or
   baselII_pd>1 then
'_125.1_Basel II BHC Portfolio'n = 1;

 

Later in the process I join each subtotal like this:

 

select _df1 as _df,count(loan_number) as Count
from First_Lien_Rules_ETL_&propdate
where _df1 = "df1"
union
select _df2 as _df,count(loan_number) as count
from First_Lien_Rules_ETL_&propdate
where _df2 = 'df2'
union
select _df3 as _df,count(loan_number) as count
from First_Lien_Rules_ETL_&propdate
where _df3 = 'df3'
union

 

the _df field will give me the correct total, so I can look at the amount of fails/subtotal = percentage fail. 

 

Within my experience this is the best solution I came up with.  The entire process takes under 30 minutes but this step takes over 10.  I'd love to get that time down.  Can anyone give me suggestions for another solution to the same output?

 

Thank You,

 

 


Accepted Solutions
Solution
‎06-23-2016 01:56 PM
Super User
Posts: 3,114

Re: possible hash/array solution

It is not clear from your description whether your _df condtions are mutually exclusive (one account can only have one _df) but from the way you have coded it one account can have multiple _dfs.

 

Using an array you could have running totals in the same step where you define _dfs and avoid the following SQL step altogether. You would just need an array for all _df counts then have logic like this:

 

array _df_counts (*) _df_count1 - _df_count38;

if bankruptcy_flag = 'Y' then do;
  _df4 = 'df4';
  _df_counts(4) +1;
 ......
end;

  

View solution in original post


All Replies
Solution
‎06-23-2016 01:56 PM
Super User
Posts: 3,114

Re: possible hash/array solution

It is not clear from your description whether your _df condtions are mutually exclusive (one account can only have one _df) but from the way you have coded it one account can have multiple _dfs.

 

Using an array you could have running totals in the same step where you define _dfs and avoid the following SQL step altogether. You would just need an array for all _df counts then have logic like this:

 

array _df_counts (*) _df_count1 - _df_count38;

if bankruptcy_flag = 'Y' then do;
  _df4 = 'df4';
  _df_counts(4) +1;
 ......
end;

  

Super User
Posts: 10,538

Re: possible hash/array solution

[ Edited ]

If you could provide complete details for maybe 4 or 5 of the rules and a dataset (data step please) that would exercise all of the rules and the desired output we might be able to give you more detailed coding support.

 

Often when I see "one datset per variable" types of solutions often a data re-arrangement can help.

 

For instance, this code:

select _df2 as _df,count(loan_number) as count
from First_Lien_Rules_ETL_&propdate
where _df2 = 'df2'

 

looks like if you had a numeric variable that had values of 0 and 1 with 1 the condition of interest then SUMing it would give the desired count. And you can usually create that type of flag with _df = (a > b) or some other logical.

And then a procedure like Proc Means, Summary, Report or Tabulate can go a long way getting other things. (Mean of a 0/1 variable is percent as decimal in many cases if defined correctly).

 

Here is a very simple example but the principal may work for your. Note the difference in how OnlyYoungFemale is defined and the effect on the percentage as the conditional assignment changes the denominator.

data junk;
   set sashelp.class;
   IsFemale = (Sex='F');
   IsYoungFemale= (IsFemale and (Age le 14));
   If IsFemale then OnlyYoungFemale = (Age le 14);
   IsTallYoungFemale = (IsYoungFemale and (Height ge 64));
run;

Proc Tabulate data=junk;
   var IsFemale IsYoungFemale OnlyYoungFemale IsTallYoungFemale;
   table IsFemale IsYoungFemale IsTallYoungFemale OnlyYoungFemale,
         Sum='Count'*f=best4. Mean='%'*f=percent8.1;
   label
      IsFemale ='Females'
      IsYoungFemale ="Young Females of pop"
      OnlyYoungFemale= "Young of Females"
      IsTallYoungFemale="Young Tall Females of Pop"
   ;
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 233 views
  • 0 likes
  • 3 in conversation