Hi SAS Discussion Forum, Attached longitudinal data set_1 shows how 5 accounts in a single bank (bank_number=10) is swinging from one delinquency bucket to the other (original data set has over million records and 9 banks). Alpey has generously created the hash code below which cleverly returns the sum of variable "balance" for each arrears_band grouped by current_date subject to a few conditions (conditions are presented at the end to avoid the reader getting sick). Hash code (by Alpey, 2012) data temp; if _n_ = 1 then do; declare hash h(ordered:'a'); h.defineKey('Current_Date'); h.defineData('Current_Date','Bank_Number','Account_Number','Balance','Product','Arrears_Band','NPNA_Start_Month'); h.defineDone(); declare hiter hi('h'); end; NPNA_Start_Month = .; do until(last.Account_Number); set b; /*inputting our data set*/ where Product IN ('Personal OD','Personal Loan','Res. Mortgage') and Bank_number = 10; /*when have more than one bank, we can remove this condition */ by Account_Number; if Arrears_Band = 'NPNA' and NPNA_Start_Month = . then NPNA_Start_Month = Current_Date; else if Arrears_Band ne 'NPNA' then NPNA_Start_Month = .; h.replace(); end; Last_NPNA_Start_Month = NPNA_Start_Month; /* Added 6/18/2012 */ rc = hi.first(); do while(rc=0); Current = 0; One_to_30 = 0; Thirty_to_60 = 0; Sixty_to_90 = 0; Ninety_plus = 0; NPNA = .; if NPNA_Start_Month = . or ( NPNA_Start_Month = Current_Date and /* NPNA_Start_Month > '28Feb2010'd */ Last_NPNA_Start_Month ne '28Feb2010'd) then do; select (Arrears_Band); when ('Current') do;Current = Balance;end; when ('1 - 30') do;One_to_30 = Balance;end; when ('30 - 60') do;Thirty_to_60 = Balance;end; when ('60 - 90') do;Sixty_to_90 = Balance;end; when ('90 +') do;Ninety_plus = Balance;end; when ('NPNA') do;NPNA = Balance;end; end; end; output; rc = hi.next(); end; h.clear(); run; proc summary data=temp nway missing; class Current_Date; var Current One_to_30 Thirty_to_60 Sixty_to_90 Ninety_plus NPNA; output out=want(drop=_:) sum=; run; Applying above code Step 1: I sorted the set_1 data set: proc sort data=set_1 out = b; by account_number current_date; run; Step 2: Then ran the hash code. Step 3: It generated the following results which is very correct given my conditions. These are the results: Obs Current_date Current 1_to_30 30_to_60 60_to_90 90+ NPNA (these are vairable headings) 1 28FEB2010 24 0 0 0 0 25 2 31MAR2010 25 23 0 0 0 . 3 30APR2010 25 0 23 0 0 . 4 31MAY2010 25 0 0 23 0 . 5 30JUN2010 0 0 0 0 23 . 6 31JUL2010 0 0 0 0 0 85 7 31DEC2010 0 0 0 0 0 . 8 31JAN2011 25 18 0 0 0 . 9 28FEB2011 0 0 0 0 0 43 10 31MAR2011 0 0 0 0 0 . 11 30APR2011 0 0 0 0 0 . 12 31MAY2011 85 0 0 0 0 . New problem: I added another bank (bank_number=20) to the set_1 and populated the same data values found in bank=10 in set_1 for the new bank too, and created a new data set (see attached data set_2). This time I should do the sorting taking "bank_number account_number current_date" as sorting variables. Why? Because these are the 3 variables that would uniquely identify a record (this is sure!). proc sort data=set_2 out = c; by bank_number account_number current_date; run; Question: When I now try to apply the hash code for data set "c", log gives this error message. "ERROR: BY variables are not properly sorted on data set WORK.C." Could an expert please help me to tweak this hash code to avoid this problem? (may be we need to incorporate my new sorting order into hash code which is "bank_number account_number current_date" One more thing, if I just sort the data set_2 by "account_number current_date" and apply the hash code, then the replicated bank_number 20's data are ignored from summing up. (Optional reading) These are the conditions Hash code cleverly captures Condition 1: "Balances corresponding to NPNA Arrears_Band have to be summed up for only those accounts that have fallen into NPNA status after 28FEB2010. NPNA balances in 28FEB2010 and in all subsequent months of those accounts that were already in NPNA status in 28FEB2010 should be omitted from summing up". E.g. In bank number 10, Account 1111111111 's NPNA balance should be excluded from summing up in 28FEB2010 and in all subsequent months because it was already in NPNA status when we begin our modeling period in 28FEB2010 and this account continued to be NPNA throughout its life. Condition 2: However, if an account has started with "NPNA" status from 28FEB2010 but it has not continued its NPNA status throughout its life, then these type of accounts should be excluded from the condition 1 above although these accounts also have started with "NPNA" status from 28FEB2010 itself. E.g. A/C 4444444444 has started its life with "arreres_band=NPNA in 28FEB2010 but then improved in 31Jan2011 into “1-30” arrears bucket. Again fallen into NPNA delq. bucket in 28FEB2011. Then again improved to "current" in 31May2011. So, this accounts NPNA balance in 28Feb2010 has to be taken for summing up. And its NPNA balance in 28Feb2011 has to be taken for summing up. Your help is greatly apprecaited! Mirisage
... View more