I have a 'have' and 'want' table below. The have represents two separate accounts, one has single digits, the other double. Looking at the single digit account: Notice the cust_ref field associates the first with the second, the second with the third, and the third back to the second, but the third has no link direclty back to the first. I have a dataset that is over 5 million and growing. I'm interested in a solution that will link all of these accounts and get me a good orig_account. If it is the first one based on the date field that's great but no necessary.
data HAVE;
infile cards dsd;
input ACCTNUM cust_ref date;
cards;
1, 1, 1
1, 1, 1
1, 2, 1
2, 1, 1
2, 2, 2
2, 3, 2
3, 2, 2
3, 2, 3
3, 3, 3
10, 10, 1
10, 10, 1
10, 11, 1
11, 11, 1
11, 11, 2
11, 12, 2
12, 11, 2
12, 11, 3
12, 12, 3
;run;
data want;
infile cards dsd;
input ACCTNUM cust_ref date orig_acct;
cards;
1, 1, 1, 1
1, 1, 1, 1
1, 2, 1, 1
2, 1, 1, 1
2, 2, 2, 1
2, 3, 2, 1
3, 2, 2, 1
3, 2, 3, 1
3, 3, 3, 1
10, 10, 1, 10
10, 10, 1, 10
10, 11, 1, 10
11, 11, 1, 10
11, 11, 2, 10
11, 12, 2, 10
12, 11, 2, 10
12, 11, 3, 10
12, 12, 3, 10
;run;
Thank you for any help
In your actual data is the "date" an actual date where order can actually be determined? One might think that orig_acct should be the one with the earliest date in the chain but the value you present appears to be something else. If the orig_acct should be the earliest dated one then that provides a lot more information as otherwise I am not sure how you know to use acctnum=1 as orig_acct for cust_ref=1 picking between acctnum 1 and 2 since the "date" is the same value.
And for identification purposes there really should not be any duplicate data.
I would start with something like:
Proc summary data=have nway;
class acctnum cust_ref;
var date;
output out=reduced(drop= _: ) min=;
run;
And then start analyzing chains.
Note this post may help: https://communities.sas.com/t5/SAS-Communities-Library/Finding-all-single-linked-chains-the-allChain...
In your actual data is the "date" an actual date where order can actually be determined? One might think that orig_acct should be the one with the earliest date in the chain but the value you present appears to be something else. If the orig_acct should be the earliest dated one then that provides a lot more information as otherwise I am not sure how you know to use acctnum=1 as orig_acct for cust_ref=1 picking between acctnum 1 and 2 since the "date" is the same value.
And for identification purposes there really should not be any duplicate data.
I would start with something like:
Proc summary data=have nway;
class acctnum cust_ref;
var date;
output out=reduced(drop= _: ) min=;
run;
And then start analyzing chains.
Note this post may help: https://communities.sas.com/t5/SAS-Communities-Library/Finding-all-single-linked-chains-the-allChain...
Yes the dates are real dates, I through this table together and didn't intend to have duplicate records. The current table I'm working with has distinct records.
I'm taking a look at the macro you reference. This is s transactional dataset, there will be many account number per reference number. If I understand how I'm supposed to set this up with from = account number I'm getting the error that there is a duplicate key.
Did you start with your raw data or a summary dataset such as I created above with one record per customer account?
I tried the raw data first, then the summary, getting the same error for both. Maybe a better example of the dummy data is this:
Account | Ref | Date |
260 | 260 | 070731 |
260 | 099 | 080603 |
260 | 104 | 080603 |
104 | 260 | 080603 |
104 | 099 | 080603 |
104 | 104 | 080603 |
The are unique rows but the account number is still duplicated, even with out the date it would still have a duplicate account.
So you have multiple Customers in each account? That isn't what you showed in the original data. Or is your "ref" a transaction identifier and not a customer identfier?
No, it's one customer. This represents a credit card, the account number is the number on the plastic. The ref number is any number associated with that customer. Could be old plastics, or a cross refrence number from an acquired account. There is a field called original_account but I have found it is not accurate and customers who lose their cards and get replacements over the years don't have a good account number to capture the entire history in one shot.
I ran this again with a small population, I used sql to get a distinct row of only the columns needed. I still get the same error but there is output. It looks like the '_to_' field populates correctly but the '_from_' field doesn't. If you have any ideas I'd love to hear them. I'm going to keep working on this to try to debug, also run it with my entire population. The _from_ field would be a nice to have, but _to_ suits me just fine. This is a great solution!!!
Cheers,
Are you very familiar with the 'all chains' macro? Would you expect it to work with a table of 33 million rows? I'm getting the following error pop up, with nothing in the log:
An error occured executing the workspace job "All Chains".
The server is disconnected.
The macro has limits such as the non-duplicating and non-cycle nature of chains and a depth of 100 (which would be 100 "links" a->b, b->c etc.
If the macro works for subsets of the data then it may mean it would work for more links but the program would need to be modified. I suggested it as a somewhat more developed bit of code to start from then I would have time to write in my time without pay.
I have to admit that from some of your data, realizing full well that you can't share all the details, I would have a hard time figuring out what some of those paths would look like in the final result.
When you posted this data;
Account | Ref | Date |
260 | 260 | 070731 |
260 | 099 | 080603 |
260 | 104 | 080603 |
104 | 260 | 080603 |
104 | 099 | 080603 |
104 | 104 | 080603 |
my feeling was without some external value such as a tax identification number that doesn't change this isn't going to be easily tractable. The comment "This represents a credit card, the account number is the number on the plastic. The ref number is any number associated with that customer" is very disconcerting.
When you have 260-> 104 and 104->260 it makes me question if there is actually a difference since the dates are the same.
Have considered a trial where you order each pair such that the 260->104 on 080603 is made into a 104-260. Then reduce the set to only the earliest version of the 104->260 pair.
That would reduce the "cycle" issue and possibly significantly reduce the size of the issue. A brief example:
data example; informat Account Ref $5. Date yymmdd6.; format date mmddyy10.; input Account Ref Date ; datalines; 260 260 070731 260 099 080603 260 104 080603 104 260 080603 104 099 080603 104 104 080603 ; run; data swapped; set example; if account<ref then do; low=account; high=ref; end; else do; low=ref; high=account; end; run; proc summary data=swapped nway; class low high; var date; output out=swapsum(drop= _:) min=; run;
The dates aren't extremely important. When I used your macro with a small test sample I removed the date field. It worked great, the date would be ideal to pull the acct number with the lowest date.
The numbers aren't sequential either, which makes another problem. So the min(account) doesn't necessarily have to be the earliest one issued. I'm not sure how to QC this after the fact. I have my own solution that isn't great and very slow, I'm taking known accounts that are good examples: Say someone was acquired from a different company, so they have a reference number that ties it to the account but was never a real account number, then loses their card every weekend at a bar, so they have 20 different card numbers.
Using an example like that I can track down all the individual cross reference numbers manually and see if my process works, but that doesn't lend a great deal of confidence to the population.
I have resolved an issue like this (to get the original id's of collateral default obligations) using PROC OPTNET. Do you have SAS/OR?
I have seen some good examples using that proc as well. A good terrorist example with code is available online. Unfortunately I do not have that license. Do you know if that license generally comes with Enterprise Data Miner, or is it something totally different? I will be getting miner and can put a rush on the license if it will help with this.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.