Solved: Filtering data based on two columns with conditions

buddha_d · Posted 05-27-2022 01:41 PM

I have a dataset which has account IDs (acct_id) and Account Names (acct_name). I have a pattern to figure out if the account exactly belongs to a customer. There are suffixes like JR or SR or I or V or MD which I don't need to take it to account for account name. I made up the data that almost matches my original data.
For example:
MICHAEL D KUELKAR JR Account Name has account id of KUELK2M6D.
The account ID is calculated like below
first 5 letters of last name (KUELK)+how many more letters are left in the last name (2 more letters, AR in this case)+
first letter of the first name(M as in Michael)+how many letters are left after the first letter of the first name (ICHAEL, 6 Letters)+first letter of middle name (D in this case)

If there are any last name are lower than 5 bytes (length less than 5), there will be periods (like .). Since the length of the last name is less than 5 letter, there will not be any letters left so we have 0 after the last name and periods.
If the middle name is missing then you would have . as well.
Like acct_name (PING-PONG LI/2121/) acct_id is LI...0P3P

I need to find the mismatch acct_name and acct_id.

Here is the code I tried, Please suggest me how to go forward from here. Thanks in advance.

PROC SQL;
create table WORK.HAVE ( bufsize=65536 )
  (
   acct_id char(27),
   acct_name char(240)
  );
QUIT;
PROC SQL;
INSERT INTO have
VALUES ("KING.0R6K","RICHARD K KING JR/1111/")
VALUES ("KO...0C4.","CAMEY JOE/2282/")
VALUES ("KUELK2M6D","MICHAEL D KUELKAR JR/1110/")
VALUES ("FUSS.0J4.","JACOB JOHN FUSS VI/8889/")
VALUES ("LI...0P3P", "PING-PONG LI/2121/")
VALUES ("LO...0T6Z","TSARA YEU ZEE LO/6666/")
VALUES ("LOYD.0W5.","WESLEY BRUCE LOYD SR/1181/")
VALUES ("LUCER7M4.","MARIA K LUCERA SR/3333/")
VALUES ("BUCKE2W0K","WK BUCKETT JR/1151/")
VALUES ("LUM..0R5H","ROBERTS H S/1141/")
VALUES ("M....0V4.","VUING DO/5595/")
VALUES ("MILLS0S4C","STEVE MILLS IV/9898/")
VALUES ("SUTTE1D3.","DARK N SUTTER M D/3331/")
VALUES ("ATKIN1R5D","ROBERT ATKINS SR/5656/")
VALUES ("ATKIN1R5D","ROBERT D ATKINS SR/1212/")
VALUES ("ANGUI9G7.","GAMELIEL S ANGUIAND SR/1897/")
VALUES ("ARNIT1J5M","JESSIE M ARNITT JR/4564/")
VALUES ("DAXTE1F4L","FRANK LEVIS DAXTER V/6781/")
VALUES ("BEASO1R2S","RON SHARE BEASON II/9872/")
VALUES ("CASTI6A8Z","ALEXA Z CASTILLEMOS SR/9861/")
VALUES ("CESTR1G6C","GERARDI CAMACHO-CESTRO IV/5671/")
VALUES ("BROWN0J5.","JACKIE BROWN M D/5671/")
VALUES ("CRAPS2D5J","CRAPSEN DIANIA JOE/0981/")
VALUES ("DEBAR6M4H","MERCY H DEE/0001/")
VALUES ("DEDIO1F4R","FELEX V DEDIOS JR/0001/")
VALUES ("KRIEK6S3T","SEAN THOMAS KREIKEMAIAR/0001/")
VALUES ("CASTR1G6C","GERARDI CAMACHI-CASTRO IV/0001/")
;
QUIT;


	
DATA HAVE;
	SET HAVE;
	NEW_VAR=SCAN(acct_name,1,'/');
RUN;


data want ;
  set have;
  do i=1 to countw(new_var,' ');
     var1=scan(new_var,i,' ');
     output;
  end;
  drop i new_var ;
run;

proc sql;
	create table want1 as
	select *
	from want 
	where acct_id not in  
	(select distinct acct_id 
	from want 
	where strip(compress(substr(acct_id,1,5),,'p')) in (select var1 from want) )

;
quit;

Thanks

ChrisNZ · Posted 05-30-2022 04:57 AM

Thank you for providing usable data, even if it seems imperfect.

This should get you started:

data WANT ;
 set HAVE;
  STR=scan(ACCT_NAME,1,'/'); 
  STR=prxchange('s/\b(I|V|MD|JR|SR|M|II)\b//',-1,STR); 
  FN =scan(STR,1);
  N2 =scan(STR,2);
  N3 =scan(STR,3);
  if N3 ne ' ' then do;
    LN=N3;
    MN=N2;
  end;
  else LN=N2;
  STR2=cats(substr(LN,1,5)
           ,substrn('....',1,5-length(LN))
           ,max(length(LN)-5,0)
           ,first(FN)
           ,length(FN)-1
           ,first(MN||'.')
           ); 
  if STR2 ne ACCT_ID;
run;


Obs	acct_id	acct_name	STR2
1	KO...0C4.	CAMEY JOE/2282/	JOE..0C4
2	FUSS.0J4.	JACOB JOHN FUSS VI/8889/	FUSS.0J4J
3	LO...0T6Z	TSARA YEU ZEE LO/6666/	ZEE..0T4Y
4	LOYD.0W5.	WESLEY BRUCE LOYD SR/1181/	LOYD.0W5B
5	LUCER7M4.	MARIA K LUCERA SR/3333/	LUCER1M4K
6	BUCKE2W0K	WK BUCKETT JR/1151/	BUCKE2W1
7	LUM..0R5H	ROBERTS H S/1141/	S....0R6H
8	M....0V4.	VUING DO/5595/	DO...0V4
9	MILLS0S4C	STEVE MILLS IV/9898/	IV...0S4M
10	SUTTE1D3.	DARK N SUTTER M D/3331/	SUTTE1D3N
11	ATKIN1R5D	ROBERT ATKINS SR/5656/	ATKIN1R5
12	ANGUI9G7.	GAMELIEL S ANGUIAND SR/1897/	ANGUI3G7S
13	ARNIT1J5M	JESSIE M ARNITT JR/4564/	ARNIT1J5
14	CASTI6A8Z	ALEXA Z CASTILLEMOS SR/9861/	CASTI6A4Z
15	BROWN0J5.	JACKIE BROWN M D/5671/	D....0J5B
16	CRAPS2D5J	CRAPSEN DIANIA JOE/0981/	JOE..0C6D
17	DEBAR6M4H	MERCY H DEE/0001/	DEE..0M4H
18	DEDIO1F4R	FELEX V DEDIOS JR/0001/	DEDIO1F4
19	KRIEK6S3T	SEAN THOMAS KREIKEMAIAR/0001/	KREIK6S3T

High-Performance SAS Coding - Third Edition

View solution in original post

buddha_d · Posted 05-27-2022 04:49 PM

The output should have the below records only.

acct_id acct_name

KO...0C4. CAMEY JOE/2282/
LO...0T6Z TSARA YEU ZEE LO/6666/
LOYD.0W5. WESLEY BRUCE LOYD SR/1181/
LUCER7M4. MARIA K LUCERA SR/3333/
LUM..0R5H ROBERTS H S/1141/
M....0V4. VUING DO/5595/
MILLS0S4C STEVE MILLS IV/9898/
SUTTE1D3. DARK N SUTTER M D/3331/
ATKIN1R5D ROBERT ATKINS SR/5656/
ANGUI9G7. GAMELIEL S ANGUIAND SR/1897/
CASTI6A8Z ALEXA Z CASTILLEMOS SR/9861/
DEBAR6M4H MERCY H DEE/0001/
DEDIO1F4R FELEX V DEDIOS JR/0001/

ChrisNZ · Posted 05-30-2022 04:57 AM

Thank you for providing usable data, even if it seems imperfect.

This should get you started:

data WANT ;
 set HAVE;
  STR=scan(ACCT_NAME,1,'/'); 
  STR=prxchange('s/\b(I|V|MD|JR|SR|M|II)\b//',-1,STR); 
  FN =scan(STR,1);
  N2 =scan(STR,2);
  N3 =scan(STR,3);
  if N3 ne ' ' then do;
    LN=N3;
    MN=N2;
  end;
  else LN=N2;
  STR2=cats(substr(LN,1,5)
           ,substrn('....',1,5-length(LN))
           ,max(length(LN)-5,0)
           ,first(FN)
           ,length(FN)-1
           ,first(MN||'.')
           ); 
  if STR2 ne ACCT_ID;
run;


Obs	acct_id	acct_name	STR2
1	KO...0C4.	CAMEY JOE/2282/	JOE..0C4
2	FUSS.0J4.	JACOB JOHN FUSS VI/8889/	FUSS.0J4J
3	LO...0T6Z	TSARA YEU ZEE LO/6666/	ZEE..0T4Y
4	LOYD.0W5.	WESLEY BRUCE LOYD SR/1181/	LOYD.0W5B
5	LUCER7M4.	MARIA K LUCERA SR/3333/	LUCER1M4K
6	BUCKE2W0K	WK BUCKETT JR/1151/	BUCKE2W1
7	LUM..0R5H	ROBERTS H S/1141/	S....0R6H
8	M....0V4.	VUING DO/5595/	DO...0V4
9	MILLS0S4C	STEVE MILLS IV/9898/	IV...0S4M
10	SUTTE1D3.	DARK N SUTTER M D/3331/	SUTTE1D3N
11	ATKIN1R5D	ROBERT ATKINS SR/5656/	ATKIN1R5
12	ANGUI9G7.	GAMELIEL S ANGUIAND SR/1897/	ANGUI3G7S
13	ARNIT1J5M	JESSIE M ARNITT JR/4564/	ARNIT1J5
14	CASTI6A8Z	ALEXA Z CASTILLEMOS SR/9861/	CASTI6A4Z
15	BROWN0J5.	JACKIE BROWN M D/5671/	D....0J5B
16	CRAPS2D5J	CRAPSEN DIANIA JOE/0981/	JOE..0C6D
17	DEBAR6M4H	MERCY H DEE/0001/	DEE..0M4H
18	DEDIO1F4R	FELEX V DEDIOS JR/0001/	DEDIO1F4
19	KRIEK6S3T	SEAN THOMAS KREIKEMAIAR/0001/	KREIK6S3T

High-Performance SAS Coding - Third Edition

buddha_d · Posted 05-30-2022 12:15 PM

ChrisNZ,
This would definitely a good starting point for my project. Thanks for your help.

buddha_d · Posted 06-01-2022 12:39 PM

ChrisNZ,

It is working for the records that have middle name, but if the middle name is missing then it is not picking the period in the end. For example:

Below example doesn't have middle name. So the str2 should have DO...0V4. but it shows as DO...0V4. How can we fix it please?

Thanks,

M....0V4.

VUING DO/5595/

DO...0V4

ChrisNZ · Posted 06-04-2022 02:51 AM

Replace

,first(MN||'.')

with

,coalecsec(first(MN),'.')

?

High-Performance SAS Coding - Third Edition

buddha_d · Posted 06-06-2022 11:31 AM

ChrisNZ,

Thanks for the idea. It works as per the logic. I appreciate your help.
Thanks

Filtering data based on two columns with conditions

Re: Filtering data based on two columns with conditions

Re: Filtering data based on two columns with conditions

Re: Filtering data based on two columns with conditions

Re: Filtering data based on two columns with conditions

Re: Filtering data based on two columns with conditions

Re: Filtering data based on two columns with conditions

Re: Filtering data based on two columns with conditions

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away