I have the following sample data.
data have;
input AcctID $4. ACCTNAME $12.;
cards;
1234 checking
1234 saving
1234 HSA-loan
5678 checking
8907 saving
7658 HSA-loan
4356 checking
4356 saving
9874 checking
9874 HSA-loan
5437 creditcard
5437 checking
5461 checking
5461 heloc-loan
2543 saving
2543 HSA-loan
7891 HSA-loan
7111 saving
;
run;
I want to get the account IDs which has these conditions
1. Get all account IDs which has only 'checking' account name (no other account name is associated eg: 5678 checking)
2. If there is 'checking' account name along with 'saving' and 'HSA-loan' then I want to get those account IDs (eg: 1234 checking , 1234 saving & 1234 HSA-loan).
3. If there is 'checking' account name along with 'saving' then get those account IDs (eg: 4356 checking & 4356 saving).
4. If there is 'checking' account name along with 'HSA-loan' then I want to get those account IDs (eg: 9874 checking & 9874 HSA-loan)
I don't want to get any other account IDs especially if they are associated with account name of 'checking' and 'creditcard' etc or any Individual Account IDs which has account name as 'saving' or 'HSA-loan' individually or together.
The result should have
AcctID ACCOUNT
1234 checking
1234 saving
1234 HSA-loan
5678 checking
4356 checking
4356 saving
9874 checking
9874 HSA-loan
I used the code
proc sql;
create table want as
select acctID,ACCTNAME
from HAVE
group by acctID,ACCTNAME
having (count(distinct ACCTNAME)=1 and UPCASE(ACCTNAME)='CHECKING')
;
QUIT;
This is yielding all the account IDs with checking and other account names are there. Could someone help me to get the result I want.
Thanks in advance.
As a Boolean expression,
upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN')
is either 1 (TRUE) or 0 (FALSE) for an observation. Hence, the summary statistic
sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))
is the number of observations in an AcctID "BY group" (note the GROUP BY clause) where that Boolean expression is TRUE. We are interested in the case that this number is zero because this means that no observation with an UPCASE(ACCTNAME) value other than 'CHECKING', 'SAVING' or 'HSA-LOAN' exists in the BY group. The possible values of that sum are 0, 1, 2, etc. Therefore, the condition
sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))=0
can also be written as
~sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))
because 0 -- interpreted as a Boolean value -- is FALSE, whereas all other non-missing numeric values are interpreted as TRUE. Feel free to replace ~sum(...) with sum(...)=0 if it's easier to understand.
The ORDER BY clause is not necessary if the (virtually unpredictable) default order of observations produced by PROC SQL is good enough. My idea was to sort dataset WANT by AcctID and within an AcctID "BY group" by Account -- yet not alphabetically, but so that observations with UPCASE(Account)='CHECKING' come first, followed by observations with UPCASE(Account)='SAVING', followed by observations with UPCASE(Account)='HSA-LOAN'. The FINDC function searches the constant string "csH" and if it finds the first character of variable Account (i.e., the result of first(account)), irrespective of case (see modifier 'i'), it returns the position of that character in the string: 1 for a "C", 2 for an "S" and 3 for an "H" -- so that sorting by this position number yields the intended order of the three possible values of Account.
Hello @buddha_d,
So the condition is: All values of upcase(acctname) for an AcctId must be from the set {'CHECKING', 'SAVING', 'HSA-LOAN'} and 'CHECKING' must occur. Correct?
If so, this should select the observations in question:
proc sql;
create table want as
select AcctID, acctname as ACCOUNT
from have
group by AcctID
having max(upcase(acctname)='CHECKING') & ~sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))
order by AcctID, findc('csH',first(account),'i');
quit;
I am running the query in DB. Once I pull the records, let me try this code. It works for me with the dummy data provided. Thank you so much, FreelanceReinhard.
Could you please explain me what this code doing here
Thanks
~sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))
order by AcctID, findc('csH',first(account),'i')
As a Boolean expression,
upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN')
is either 1 (TRUE) or 0 (FALSE) for an observation. Hence, the summary statistic
sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))
is the number of observations in an AcctID "BY group" (note the GROUP BY clause) where that Boolean expression is TRUE. We are interested in the case that this number is zero because this means that no observation with an UPCASE(ACCTNAME) value other than 'CHECKING', 'SAVING' or 'HSA-LOAN' exists in the BY group. The possible values of that sum are 0, 1, 2, etc. Therefore, the condition
sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))=0
can also be written as
~sum(upcase(acctname) ~in ('CHECKING', 'SAVING', 'HSA-LOAN'))
because 0 -- interpreted as a Boolean value -- is FALSE, whereas all other non-missing numeric values are interpreted as TRUE. Feel free to replace ~sum(...) with sum(...)=0 if it's easier to understand.
The ORDER BY clause is not necessary if the (virtually unpredictable) default order of observations produced by PROC SQL is good enough. My idea was to sort dataset WANT by AcctID and within an AcctID "BY group" by Account -- yet not alphabetically, but so that observations with UPCASE(Account)='CHECKING' come first, followed by observations with UPCASE(Account)='SAVING', followed by observations with UPCASE(Account)='HSA-LOAN'. The FINDC function searches the constant string "csH" and if it finds the first character of variable Account (i.e., the result of first(account)), irrespective of case (see modifier 'i'), it returns the position of that character in the string: 1 for a "C", 2 for an "S" and 3 for an "H" -- so that sorting by this position number yields the intended order of the three possible values of Account.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.