BookmarkSubscribeRSS Feed
anu1999
Obsidian | Level 7

Hi All,

I am trying to do a data quality check on similar datasets like below. I would appreciate any help on implementing this logic:

 

There are two data sets with account no, start date, end date, year of transaction, month of transaction. Both datasets have different account no. but due to some reason accounts which used to be same no. are appearing in both tables with different account no. For example below in data "one", account_id 1 and account_id 8 from data "two" are same by looking at their similar start dt, end_dt AND continuation of months. I would like to combine these two accounts together.

 

I want to aggregate  such account numbers (append in continuation) in to separate dataset with all 5 variables.

 

Appreciate your help.

 

data one;
format start_dt end_dt date9.;
input acc_id start_dt date9. @13 end_dt date9. yr mth;@13 end_dt date9. yr mth;
datalines;
1 01Aug2015 21JUN2021 2009 01
1 01Aug2015 21JUN2021 2009 02
1 01Aug2015 21Jun2021 2009 03
1 01Aug2015 21Jun2021 2009 04
1 01Aug2015 21Jun2021 2009 05
1 01Aug2015 21Jun2021 2009 06
2 19SEP2005 16JAN2019 2005 10
2 19SEP2005 16JAN2019 2005 11
2 19SEP2005 16JAN2019 2005 12
3 27JAN2007 14JAN2021 2007 02
3 27JAN2007 14JAN2021 2007 03
3 27JAN2007 14JAN2021 2007 04
4 23FEB2009 05JAN2032 2009 03
4 23FEB2009 05JAN2032 2009 04
4 23FEB2009 05JAN2032 2009 05
5 23AUG2009 29JAN2027 2009 09
5 23AUG2009 29JAN2027 2009 10
5 23AUG2009 29JAN2027 2009 11
;
run;

data two;
format start_dt end_dt date9.;
input acc_id start_dt date9. @13 end_dt date9. yr mth;@13 end_dt date9. yr mth;
datalines;
8 01Aug2015 21JUN2021 2009 07
8 01Aug2015 21JUN2021 2009 08
8 01Aug2015 21Jun2021 2009 09
8 01Aug2015 21Jun2021 2009 10
8 01Aug2015 21Jun2021 2009 11
8 01Aug2015 21Jun2021 2009 12
4 09SEP2005 16JAN2019 2005 10
4 09SEP2005 16JAN2019 2005 11
4 09SEP2005 16JAN2019 2005 12
7 07JAN2007 14JAN2021 2007 02
7 07JAN2007 14JAN2021 2007 03
7 07JAN2007 14JAN2021 2007 04
6 13FEB2009 05JAN2032 2009 03
6 13FEB2009 05JAN2032 2009 04
6 13FEB2009 05JAN2032 2009 05
9 23AUG2009 29JAN2027 2009 12
9 23AUG2009 29JAN2027 2010 01
9 23AUG2009 29JAN2027 2010 01
;
run;

7 REPLIES 7
ballardw
Super User

What do you expect the results to look like? Are there other variables that might exist in each set? Do want only one of those other variables in the result or both? If only one value, what is the rule for selecting it?

anu1999
Obsidian | Level 7

I want to aggregate  such account numbers (append in continuation) in to separate dataset with all 5 variables.

Astounding
PROC Star

Seriously???

 

If earlier programs assigned the wrong account ID some of the time (maybe all of the time?), go back and fix the earlier programs.  It's not just that the account IDs are wrong.  Maybe all the variables are wrong.  How would you know?  Who would ever trust the results?

anu1999
Obsidian | Level 7

I want to produce a separate dataset for investigation purpose and not to produce any result for using it. Wrong account number is not a programmatic error, its a situation where bank ABC underwent acquisition and account number got changed. So idea is to investigate such pattern if they are same accounts. Thanks for your comments.

Astounding
PROC Star

If you want to investigate cases where the dates match but the account numbers don't match, you are better off leaving the data as is. 

 

Is there any other objective here, other than locating those cases to investigate?  If not, you could try:

 

proc sql;

create table mismatch as

   select one.*, two.acct_id as acct_2

   where (spell out that the dates match and the IDs do not)

   ;

quit;

 

It's untested code (and SQL isn't my strong point), but it's likely a step in the right direction.

Reeza
Super User

Yeah, don't do this. 

 

This is where some leg work and telephone conversations are more important. Look into if there was a system to changing the account numbers. I would be shocked if there wasn't an account look up map. At the very least determine their process for changing account numbers to see if you can replicate it. 

 

And if not, which bank(s) is this, I want to make sure I never, ever, ever, bank with them. 

PoornimaRavishankar
Quartz | Level 8

Ignoring all kinds of adverse results which may be caused by linking the accounts blindly by just dates, here's one approach (Code attached)

1. I have assumed you want to replace the acc_id in two with acc_id in one. If it is the other way round, the code will change slightly in the PROC SQL part. (acct_id_map.sas)

 

There is a problem with the results I have attached and I will let you figure out why. Check acc_id = 4. If you aren't able to figure this out, reply to this message. This is precisely why this kind of approach is wrong as all the others have pointed out.

 

 

2. If you are simply attempting to cluster the records and not mapping of accounts, use cluster.sas.

 

 

Good luck!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 2171 views
  • 1 like
  • 5 in conversation