Solved: How to group account IDs with multiple account names

buddha_d · Posted 10-04-2022 10:55 AM

I have the following sample data.

data have;
input AcctID $4. ACCTNAME $12.;
cards;
1234 checking
1234 saving
1234 HSA-loan
5678 checking
8907 saving
7658 HSA-loan
4356 checking
4356 saving
9874 checking
9874 HSA-loan
5437 creditcard
5437 checking
5461 checking
5461 heloc-loan
2543 saving
2543 HSA-loan

7891 HSA-loan

7111 saving
;
run;

I want to get the account IDs which has these conditions

1. Get all account IDs which has only 'checking' account name (no other account name is associated eg: 5678 checking)

2. If there is 'checking' account name along with 'saving' and 'HSA-loan' then I want to get those account IDs (eg: 1234 checking , 1234 saving & 1234 HSA-loan).

3. If there is 'checking' account name along with 'saving' then get those account IDs (eg: 4356 checking & 4356 saving).

4. If there is 'checking' account name along with 'HSA-loan' then I want to get those account IDs (eg: 9874 checking & 9874 HSA-loan)

I don't want to get any other account IDs especially if they are associated with account name of 'checking' and 'creditcard' etc or any Individual Account IDs which has account name as 'saving' or 'HSA-loan' individually or together.

The result should have

AcctID ACCOUNT

1234 checking
1234 saving
1234 HSA-loan
5678 checking

4356 checking
4356 saving
9874 checking
9874 HSA-loan

I used the code

proc sql;
create table want as
select acctID,ACCTNAME
from HAVE
group by acctID,ACCTNAME
having (count(distinct ACCTNAME)=1 and UPCASE(ACCTNAME)='CHECKING')
;
QUIT;

This is yielding all the account IDs with checking and other account names are there. Could someone help me to get the result I want.

Thanks in advance.

FreelanceReinh · Posted 10-04-2022 04:15 PM

As a Boolean expression,

upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN')

is either 1 (TRUE) or 0 (FALSE) for an observation. Hence, the summary statistic

sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))

is the number of observations in an AcctID "BY group" (note the GROUP BY clause) where that Boolean expression is TRUE. We are interested in the case that this number is zero because this means that no observation with an UPCASE(ACCTNAME) value other than 'CHECKING', 'SAVING' or 'HSA-LOAN' exists in the BY group. The possible values of that sum are 0, 1, 2, etc. Therefore, the condition

sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))=0

can also be written as

~sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))

because 0 -- interpreted as a Boolean value -- is FALSE, whereas all other non-missing numeric values are interpreted as TRUE. Feel free to replace ~sum(...) with sum(...)=0 if it's easier to understand.

The ORDER BY clause is not necessary if the (virtually unpredictable) default order of observations produced by PROC SQL is good enough. My idea was to sort dataset WANT by AcctID and within an AcctID "BY group" by Account -- yet not alphabetically, but so that observations with UPCASE(Account)='CHECKING' come first, followed by observations with UPCASE(Account)='SAVING', followed by observations with UPCASE(Account)='HSA-LOAN'. The FINDC function searches the constant string "csH" and if it finds the first character of variable Account (i.e., the result of first(account)), irrespective of case (see modifier 'i'), it returns the position of that character in the string: 1 for a "C", 2 for an "S" and 3 for an "H" -- so that sorting by this position number yields the intended order of the three possible values of Account.

View solution in original post

FreelanceReinh · Posted 10-04-2022 11:54 AM

Hello @buddha_d,

So the condition is: All values of upcase(acctname) for an AcctId must be from the set {'CHECKING', 'SAVING', 'HSA-LOAN'} and 'CHECKING' must occur. Correct?

If so, this should select the observations in question:

proc sql;
create table want as
select AcctID, acctname as ACCOUNT
from have
group by AcctID
having max(upcase(acctname)='CHECKING') & ~sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))
order by AcctID, findc('csH',first(account),'i');
quit;

buddha_d · Posted 10-04-2022 02:49 PM

I am running the query in DB. Once I pull the records, let me try this code. It works for me with the dummy data provided. Thank you so much, FreelanceReinhard.

buddha_d · Posted 10-04-2022 03:27 PM

Could you please explain me what this code doing here

Thanks

 ~sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))
order by AcctID, findc('csH',first(account),'i')

FreelanceReinh · Posted 10-04-2022 04:15 PM

As a Boolean expression,

upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN')

is either 1 (TRUE) or 0 (FALSE) for an observation. Hence, the summary statistic

sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))

is the number of observations in an AcctID "BY group" (note the GROUP BY clause) where that Boolean expression is TRUE. We are interested in the case that this number is zero because this means that no observation with an UPCASE(ACCTNAME) value other than 'CHECKING', 'SAVING' or 'HSA-LOAN' exists in the BY group. The possible values of that sum are 0, 1, 2, etc. Therefore, the condition

sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))=0

can also be written as

~sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))

because 0 -- interpreted as a Boolean value -- is FALSE, whereas all other non-missing numeric values are interpreted as TRUE. Feel free to replace ~sum(...) with sum(...)=0 if it's easier to understand.

The ORDER BY clause is not necessary if the (virtually unpredictable) default order of observations produced by PROC SQL is good enough. My idea was to sort dataset WANT by AcctID and within an AcctID "BY group" by Account -- yet not alphabetically, but so that observations with UPCASE(Account)='CHECKING' come first, followed by observations with UPCASE(Account)='SAVING', followed by observations with UPCASE(Account)='HSA-LOAN'. The FINDC function searches the constant string "csH" and if it finds the first character of variable Account (i.e., the result of first(account)), irrespective of case (see modifier 'i'), it returns the position of that character in the string: 1 for a "C", 2 for an "S" and 3 for an "H" -- so that sorting by this position number yields the intended order of the three possible values of Account.

How to group account IDs with multiple account names

Re: How to group account IDs with multiple account names

Re: How to group account IDs with multiple account names

Re: How to group account IDs with multiple account names

Re: How to group account IDs with multiple account names

Re: How to group account IDs with multiple account names

How to group account IDs with multiple account names

Re: How to group account IDs with multiple account names

Re: How to group account IDs with multiple account names

Re: How to group account IDs with multiple account names

Re: How to group account IDs with multiple account names

Re: How to group account IDs with multiple account names

Registration is open

SAS Training: Just a Click Away