I have a dataset which has account IDs (acct_id) and Account Names (acct_name). I have a pattern to figure out if the account exactly belongs to a customer. There are suffixes like JR or SR or I or V or MD which I don't need to take it to account for account name. I made up the data that almost matches my original data.
For example:
MICHAEL D KUELKAR JR Account Name has account id of KUELK2M6D.
The account ID is calculated like below
first 5 letters of last name (KUELK)+how many more letters are left in the last name (2 more letters, AR in this case)+
first letter of the first name(M as in Michael)+how many letters are left after the first letter of the first name (ICHAEL, 6 Letters)+first letter of middle name (D in this case)
If there are any last name are lower than 5 bytes (length less than 5), there will be periods (like .). Since the length of the last name is less than 5 letter, there will not be any letters left so we have 0 after the last name and periods.
If the middle name is missing then you would have . as well.
Like acct_name (PING-PONG LI/2121/) acct_id is LI...0P3P
I need to find the mismatch acct_name and acct_id.
Here is the code I tried, Please suggest me how to go forward from here. Thanks in advance.
PROC SQL;
create table WORK.HAVE ( bufsize=65536 )
(
acct_id char(27),
acct_name char(240)
);
QUIT;
PROC SQL;
INSERT INTO have
VALUES ("KING.0R6K","RICHARD K KING JR/1111/")
VALUES ("KO...0C4.","CAMEY JOE/2282/")
VALUES ("KUELK2M6D","MICHAEL D KUELKAR JR/1110/")
VALUES ("FUSS.0J4.","JACOB JOHN FUSS VI/8889/")
VALUES ("LI...0P3P", "PING-PONG LI/2121/")
VALUES ("LO...0T6Z","TSARA YEU ZEE LO/6666/")
VALUES ("LOYD.0W5.","WESLEY BRUCE LOYD SR/1181/")
VALUES ("LUCER7M4.","MARIA K LUCERA SR/3333/")
VALUES ("BUCKE2W0K","WK BUCKETT JR/1151/")
VALUES ("LUM..0R5H","ROBERTS H S/1141/")
VALUES ("M....0V4.","VUING DO/5595/")
VALUES ("MILLS0S4C","STEVE MILLS IV/9898/")
VALUES ("SUTTE1D3.","DARK N SUTTER M D/3331/")
VALUES ("ATKIN1R5D","ROBERT ATKINS SR/5656/")
VALUES ("ATKIN1R5D","ROBERT D ATKINS SR/1212/")
VALUES ("ANGUI9G7.","GAMELIEL S ANGUIAND SR/1897/")
VALUES ("ARNIT1J5M","JESSIE M ARNITT JR/4564/")
VALUES ("DAXTE1F4L","FRANK LEVIS DAXTER V/6781/")
VALUES ("BEASO1R2S","RON SHARE BEASON II/9872/")
VALUES ("CASTI6A8Z","ALEXA Z CASTILLEMOS SR/9861/")
VALUES ("CESTR1G6C","GERARDI CAMACHO-CESTRO IV/5671/")
VALUES ("BROWN0J5.","JACKIE BROWN M D/5671/")
VALUES ("CRAPS2D5J","CRAPSEN DIANIA JOE/0981/")
VALUES ("DEBAR6M4H","MERCY H DEE/0001/")
VALUES ("DEDIO1F4R","FELEX V DEDIOS JR/0001/")
VALUES ("KRIEK6S3T","SEAN THOMAS KREIKEMAIAR/0001/")
VALUES ("CASTR1G6C","GERARDI CAMACHI-CASTRO IV/0001/")
;
QUIT;
DATA HAVE;
SET HAVE;
NEW_VAR=SCAN(acct_name,1,'/');
RUN;
data want ;
set have;
do i=1 to countw(new_var,' ');
var1=scan(new_var,i,' ');
output;
end;
drop i new_var ;
run;
proc sql;
create table want1 as
select *
from want
where acct_id not in
(select distinct acct_id
from want
where strip(compress(substr(acct_id,1,5),,'p')) in (select var1 from want) )
;
quit;
Thanks
Thank you for providing usable data, even if it seems imperfect.
This should get you started:
data WANT ;
set HAVE;
STR=scan(ACCT_NAME,1,'/');
STR=prxchange('s/\b(I|V|MD|JR|SR|M|II)\b//',-1,STR);
FN =scan(STR,1);
N2 =scan(STR,2);
N3 =scan(STR,3);
if N3 ne ' ' then do;
LN=N3;
MN=N2;
end;
else LN=N2;
STR2=cats(substr(LN,1,5)
,substrn('....',1,5-length(LN))
,max(length(LN)-5,0)
,first(FN)
,length(FN)-1
,first(MN||'.')
);
if STR2 ne ACCT_ID;
run;
Obs | acct_id | acct_name | STR2 |
---|---|---|---|
1 | KO...0C4. | CAMEY JOE/2282/ | JOE..0C4 |
2 | FUSS.0J4. | JACOB JOHN FUSS VI/8889/ | FUSS.0J4J |
3 | LO...0T6Z | TSARA YEU ZEE LO/6666/ | ZEE..0T4Y |
4 | LOYD.0W5. | WESLEY BRUCE LOYD SR/1181/ | LOYD.0W5B |
5 | LUCER7M4. | MARIA K LUCERA SR/3333/ | LUCER1M4K |
6 | BUCKE2W0K | WK BUCKETT JR/1151/ | BUCKE2W1 |
7 | LUM..0R5H | ROBERTS H S/1141/ | S....0R6H |
8 | M....0V4. | VUING DO/5595/ | DO...0V4 |
9 | MILLS0S4C | STEVE MILLS IV/9898/ | IV...0S4M |
10 | SUTTE1D3. | DARK N SUTTER M D/3331/ | SUTTE1D3N |
11 | ATKIN1R5D | ROBERT ATKINS SR/5656/ | ATKIN1R5 |
12 | ANGUI9G7. | GAMELIEL S ANGUIAND SR/1897/ | ANGUI3G7S |
13 | ARNIT1J5M | JESSIE M ARNITT JR/4564/ | ARNIT1J5 |
14 | CASTI6A8Z | ALEXA Z CASTILLEMOS SR/9861/ | CASTI6A4Z |
15 | BROWN0J5. | JACKIE BROWN M D/5671/ | D....0J5B |
16 | CRAPS2D5J | CRAPSEN DIANIA JOE/0981/ | JOE..0C6D |
17 | DEBAR6M4H | MERCY H DEE/0001/ | DEE..0M4H |
18 | DEDIO1F4R | FELEX V DEDIOS JR/0001/ | DEDIO1F4 |
19 | KRIEK6S3T | SEAN THOMAS KREIKEMAIAR/0001/ | KREIK6S3T |
The output should have the below records only.
acct_id acct_name
KO...0C4. CAMEY JOE/2282/
LO...0T6Z TSARA YEU ZEE LO/6666/
LOYD.0W5. WESLEY BRUCE LOYD SR/1181/
LUCER7M4. MARIA K LUCERA SR/3333/
LUM..0R5H ROBERTS H S/1141/
M....0V4. VUING DO/5595/
MILLS0S4C STEVE MILLS IV/9898/
SUTTE1D3. DARK N SUTTER M D/3331/
ATKIN1R5D ROBERT ATKINS SR/5656/
ANGUI9G7. GAMELIEL S ANGUIAND SR/1897/
CASTI6A8Z ALEXA Z CASTILLEMOS SR/9861/
DEBAR6M4H MERCY H DEE/0001/
DEDIO1F4R FELEX V DEDIOS JR/0001/
Thank you for providing usable data, even if it seems imperfect.
This should get you started:
data WANT ;
set HAVE;
STR=scan(ACCT_NAME,1,'/');
STR=prxchange('s/\b(I|V|MD|JR|SR|M|II)\b//',-1,STR);
FN =scan(STR,1);
N2 =scan(STR,2);
N3 =scan(STR,3);
if N3 ne ' ' then do;
LN=N3;
MN=N2;
end;
else LN=N2;
STR2=cats(substr(LN,1,5)
,substrn('....',1,5-length(LN))
,max(length(LN)-5,0)
,first(FN)
,length(FN)-1
,first(MN||'.')
);
if STR2 ne ACCT_ID;
run;
Obs | acct_id | acct_name | STR2 |
---|---|---|---|
1 | KO...0C4. | CAMEY JOE/2282/ | JOE..0C4 |
2 | FUSS.0J4. | JACOB JOHN FUSS VI/8889/ | FUSS.0J4J |
3 | LO...0T6Z | TSARA YEU ZEE LO/6666/ | ZEE..0T4Y |
4 | LOYD.0W5. | WESLEY BRUCE LOYD SR/1181/ | LOYD.0W5B |
5 | LUCER7M4. | MARIA K LUCERA SR/3333/ | LUCER1M4K |
6 | BUCKE2W0K | WK BUCKETT JR/1151/ | BUCKE2W1 |
7 | LUM..0R5H | ROBERTS H S/1141/ | S....0R6H |
8 | M....0V4. | VUING DO/5595/ | DO...0V4 |
9 | MILLS0S4C | STEVE MILLS IV/9898/ | IV...0S4M |
10 | SUTTE1D3. | DARK N SUTTER M D/3331/ | SUTTE1D3N |
11 | ATKIN1R5D | ROBERT ATKINS SR/5656/ | ATKIN1R5 |
12 | ANGUI9G7. | GAMELIEL S ANGUIAND SR/1897/ | ANGUI3G7S |
13 | ARNIT1J5M | JESSIE M ARNITT JR/4564/ | ARNIT1J5 |
14 | CASTI6A8Z | ALEXA Z CASTILLEMOS SR/9861/ | CASTI6A4Z |
15 | BROWN0J5. | JACKIE BROWN M D/5671/ | D....0J5B |
16 | CRAPS2D5J | CRAPSEN DIANIA JOE/0981/ | JOE..0C6D |
17 | DEBAR6M4H | MERCY H DEE/0001/ | DEE..0M4H |
18 | DEDIO1F4R | FELEX V DEDIOS JR/0001/ | DEDIO1F4 |
19 | KRIEK6S3T | SEAN THOMAS KREIKEMAIAR/0001/ | KREIK6S3T |
ChrisNZ,
It is working for the records that have middle name, but if the middle name is missing then it is not picking the period in the end. For example:
Below example doesn't have middle name. So the str2 should have DO...0V4. but it shows as DO...0V4. How can we fix it please?
Thanks,
M....0V4. | VUING DO/5595/ | DO...0V4 |
Replace
,first(MN||'.')
with
,coalecsec(first(MN),'.')
?
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.