I have the below dataset. Acctnbr is the customer current number. Vintage_id is a derived field pulling from several different possible cross reference account ids accross different systems. I want to get a common 'cust_id' from this table. I tried to do min or max(vintage_id) group by acctnbr but it is not consistent. In this example I would want the vintage_id ending in 2532 to be the customer id for each acctnbr because it is the only one that goes with all three acctnbr's.
Have:
ACCTNBR | VINTAGE_ID |
4311963130842532 | 4311963130842532 |
4311963138028183 | 4311963130842532 |
4311963138028183 | 4311963138028183 |
4311963138028183 | 4311963139332832 |
4311963139332832 | 4311963130842532 |
4311963139332832 | 4311963138028183 |
4311963139332832 | 4311963139332832 |
Want:
ACCTNBR | cust_id |
4311963130842532 | 4311963130842532 |
4311963138028183 | 4311963130842532 |
4311963139332832 | 4311963130842532 |
If there are multiple vintage_id's that match each acctnbr it doesn't matter which one is selected. I suppose I would prefer a min to stay consistent but it is arbitrary.
Any thoughts or suggestions on how to do this?
edit: This comes from a very large file with 100 million+ of acctnbrs, the number of acctnbrs and vintage_id's per acctnbr will not be consistent
@Steelers_In_DC wrote:
If there are multiple vintage_id's that match each acctnbr it doesn't matter which one is selected. I suppose I would prefer a min to stay consistent but it is arbitrary.
By "min" do you mean the minimum numeric value of the vintage_id or the minimum count?
When I wrote it I was thinking min numeric value, but if min/max count is more efficient I would just as soon go with that.
I am trying this on my own and ran into another situation. I apologize for not having thought of this before. If there are two options for cust_id I would need to select the one with the min(opendate) to get the earliest record.
ACCTNBR | VINTAGE_ID | OPENDATE |
4311963130842532 | 4311963130842532 | 07Jun2013 |
4311963138028183 | 4311963130842532 | 25Nov2015 |
4311963138028183 | 4311963138028183 | 25Nov2015 |
4311963138028183 | 4311963139332832 | 25Nov2015 |
4311963139332832 | 4311963130842532 | 07Jun2013 |
4311963139332832 | 4311963138028183 | 07Jun2013 |
4311963139332832 | 4311963139332832 | 07Jun2013 |
The solution I found:
proc sql;
create table vintage_id as
select distinct a.acctnbr,a.vintage_id,b.opendate
from start a left join
start b on
a.vintage_id = b.acctnbr
order by acctnbr,opendate;
/*Transpose step will pull the most recent opendate into col1, this will be used for cust_id*/
proc transpose data=vintage_id out=tran_vintage(rename=(col1=cust_id)drop=_NAME_ );by acctnbr;var vintage_id;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.