BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Steelers_In_DC
Barite | Level 11

I have a 'have' and 'want' table below.  The have represents two separate accounts, one has single digits, the other double.  Looking at the single digit account:  Notice the cust_ref field associates the first with the second, the second with the third, and the third back to the second, but the third has no link direclty back to the first.  I have a dataset that is over 5 million and growing.  I'm interested in a solution that will link all of these accounts and get me a good orig_account.  If it is the first one based on the date field that's great but no necessary.

 

data HAVE;
infile cards dsd;
input ACCTNUM    cust_ref date;
cards;
1,    1,    1
1,    1,    1
1,    2,    1
2,    1,    1
2,    2,    2
2,    3,    2
3,    2,    2
3,    2,    3
3,    3,    3
10,    10,    1
10,    10,    1
10,    11,    1
11,    11,    1
11,    11,    2
11,    12,    2
12,    11,    2
12,    11,    3
12,    12,    3

;run;

data want;
infile cards dsd;
input ACCTNUM    cust_ref date orig_acct;
cards;
1,    1,    1,    1
1,    1,    1,    1
1,    2,    1,    1
2,    1,    1,    1
2,    2,    2,    1
2,    3,    2,    1
3,    2,    2,    1
3,    2,    3,    1
3,    3,    3,    1
10,    10,    1,    10
10,    10,    1,    10
10,    11,    1,    10
11,    11,    1,    10
11,    11,    2,    10
11,    12,    2,    10
12,    11,    2,    10
12,    11,    3,    10
12,    12,    3,    10
;run;

 

Thank you for any help

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

In your actual data is the "date" an actual date where order can actually be determined? One might think that orig_acct should be the one with the earliest date in the chain but the value you present appears to be something else. If the orig_acct should be the earliest dated one then that provides a lot more information as otherwise I am not sure how you know to use acctnum=1 as orig_acct for cust_ref=1 picking between acctnum 1 and 2 since the "date" is the same value.

 

And for identification purposes there really should not be any duplicate data.

I would start with something like:

Proc summary data=have nway;

   class acctnum cust_ref;

   var date;

   output out=reduced(drop= _: ) min=;

run;

 

And then start analyzing chains.

Note this post may help: https://communities.sas.com/t5/SAS-Communities-Library/Finding-all-single-linked-chains-the-allChain...

View solution in original post

13 REPLIES 13
ballardw
Super User

In your actual data is the "date" an actual date where order can actually be determined? One might think that orig_acct should be the one with the earliest date in the chain but the value you present appears to be something else. If the orig_acct should be the earliest dated one then that provides a lot more information as otherwise I am not sure how you know to use acctnum=1 as orig_acct for cust_ref=1 picking between acctnum 1 and 2 since the "date" is the same value.

 

And for identification purposes there really should not be any duplicate data.

I would start with something like:

Proc summary data=have nway;

   class acctnum cust_ref;

   var date;

   output out=reduced(drop= _: ) min=;

run;

 

And then start analyzing chains.

Note this post may help: https://communities.sas.com/t5/SAS-Communities-Library/Finding-all-single-linked-chains-the-allChain...

Steelers_In_DC
Barite | Level 11

Yes the dates are real dates, I through this table together and didn't intend to have duplicate records.  The current table I'm working with has distinct records. 

Steelers_In_DC
Barite | Level 11

I'm taking a look at the macro you reference.  This is s transactional dataset, there will be many account number per reference number.  If I understand how I'm supposed to set this up with from = account number I'm getting the error that there is a duplicate key. 

ballardw
Super User

Did you start with your raw data or a summary dataset such as I created above with one record per customer account?

Steelers_In_DC
Barite | Level 11

I tried the raw data first, then the summary, getting the same error for both.  Maybe a better example of the dummy data is this:

 

Account Ref Date
260 260 070731
260 099 080603
260 104 080603
104 260 080603
104 099 080603
104 104 080603

 

The are unique rows but the account number is still duplicated, even with out the date it would still have a duplicate account.

ballardw
Super User

So you have multiple Customers in each account? That isn't what you showed in the original data. Or is your "ref" a transaction identifier and not a customer identfier?

Steelers_In_DC
Barite | Level 11

No, it's one customer.  This represents a credit card, the account number is the number on the plastic.  The ref number is any number associated with that customer.  Could be old plastics, or a cross refrence number from an acquired account.  There is a field called original_account but I have found it is not accurate and customers who lose their cards and get replacements over the years don't have a good account number to capture the entire history in one shot. 

Steelers_In_DC
Barite | Level 11

I ran this again with a small population, I used sql to get a distinct row of only the columns needed.  I still get the same error but there is output.  It looks like the '_to_' field populates correctly but the '_from_' field doesn't.  If you have any ideas I'd love to hear them.  I'm going to keep working on this to try to debug, also run it with my entire population.  The _from_ field would be a nice to have, but _to_ suits me just fine.  This is a great solution!!!

 

Cheers,

Steelers_In_DC
Barite | Level 11

Are you very familiar with the 'all chains' macro?  Would you expect it to work with a table of 33 million rows? I'm getting the following error pop up, with nothing in the log:

 

An error occured executing the workspace job "All Chains".
The server is disconnected.

ballardw
Super User

The macro has limits such as the non-duplicating and non-cycle nature of chains and a depth of 100 (which would be 100 "links" a->b, b->c etc.

If the macro works for subsets of the data then it may mean it would work for more links but the program would need to be modified. I suggested it as a somewhat more developed bit of code to start from then I would have time to write in my time without pay.

 

I have to admit that from some of your data, realizing full well that you can't share all the details, I would have a hard time figuring out what some of those paths would look like in the final result.

 

When you posted this data;

Account Ref Date
260 260 070731
260 099 080603
260 104 080603
104 260 080603
104 099 080603
104 104 080603

 my feeling was without some external value such as a tax identification number that doesn't change this isn't going to be easily tractable. The comment "This represents a credit card, the account number is the number on the plastic.  The ref number is any number associated with that customer" is very disconcerting.

When you have 260-> 104 and 104->260 it makes me question if there is actually a difference since the dates are the same.

 

Have considered a trial where you order each pair such that the 260->104 on 080603 is made into a 104-260. Then reduce the set to only the earliest version of the 104->260 pair.

That would reduce the "cycle" issue and possibly significantly reduce the size of the issue. A brief example:

data example;
   informat  Account Ref $5. Date yymmdd6.;
   format date mmddyy10.;
   input Account Ref Date ;
datalines;
260 260 070731 
260 099 080603 
260 104 080603 
104 260 080603 
104 099 080603 
104 104 080603 
;
run;

data swapped;
   set example;
   if account<ref then do; 
      low=account;
      high=ref;
   end;
   else do;
      low=ref;
      high=account;
   end;
run;

proc summary data=swapped nway;
   class low high;
   var date;
   output out=swapsum(drop= _:) min=;
run;

 

 

Which lets us identfy an ordered chain of 99-104-260 (possibly more amenable to the chain macro). If the scope is reduced enough you may not need to pull out the "same" data. Otherwise you reduce the same to earliest same pare. If the dates were more realistic (read not all the same) then sorting by date would have the order that the pairs appeared. If that is the actual data then "first" or "orginal" seems to be a bit moot. Identify the chain using the low-high pairs, set "chain value" that is the same for each record in the chaing, find the earliest pair which really hopefully should indicate the "original" record pairt and then pull in the original account and ref data. I think this account should be the original. Match on the "chain value" to update the records to include the "original account". This would require having the low-high pair as a start for the merge.
 
Not simple, I'm not sure how robust and if I misunderstand the data and especially the role that date plays then all bets are off.
Steelers_In_DC
Barite | Level 11

The dates aren't extremely important.  When I used your macro with a small test sample I removed the date field.  It worked great, the date would be ideal to pull the acct number with the lowest date. 

 

The numbers aren't sequential either, which makes another problem.  So the min(account) doesn't necessarily have to be the earliest one issued.  I'm not sure how to QC this after the fact.  I have my own solution that isn't great and very slow, I'm taking known accounts that are good examples:  Say someone was acquired from a different company, so they have a reference number that ties it to the account but was never a real account number, then loses their card every weekend at a bar, so they have 20 different card numbers. 

 

Using an example like that I can track down all the individual cross reference numbers manually and see if my process works, but that doesn't lend a great deal of confidence to the population. 

mkeintz
PROC Star

I have resolved an issue like this (to get the original id's of collateral default obligations) using PROC OPTNET.  Do you have SAS/OR?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Steelers_In_DC
Barite | Level 11

I have seen some good examples using that proc as well.  A good terrorist example with code is available online.  Unfortunately I do not have that license.  Do you know if that license generally comes with Enterprise Data Miner, or is it something totally different?  I will be getting miner and can put a rush on the license if it will help with this.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1262 views
  • 0 likes
  • 3 in conversation