Hi
I have a data set of card history as below. For each customer, they may have applied for one or multiple cards on the same day. However, due to various reasons, their cards get replaced. Card issue date is when a card is issued. New Card ID is the ID for the replaced card. For example, for customer A, his card was firstly issue on 2/1/2017 and the card ID is 1234. 3 days later, he lost his card and a new card (1235) is issued on 5/2/2017.
Customer ID | First Issue Date | Card Issue Date | Card ID | New Card ID |
A | 2/1/2017 | 2/1/2017 | 1234 | 1235 |
A | 2/1/2017 | 5/2/2017 | 1235 | |
B | 5/2/2017 | 5/2/2017 | 1245 | 1248 |
B | 5/2/2017 | 5/2/2017 | 1236 | 1249 |
B | 5/2/2017 | 10/3/2017 | 1248 | 1250 |
B | 5/2/2017 | 5/3/2017 | 1249 | 1251 |
B | 5/2/2017 | 10/4/2017 | 1250 | |
B | 5/2/2017 | 5/4/2017 | 1251 |
What I want is to group the original card and all the replacement together. For example, customer B applied for two cards on 5/2/217. Card ID 1245, 1248 and 1250 are in the same group (Seq No 1) and Card ID 1236, 1249 and 1251 are in the same group (Seq No 2).
Customer ID | Open Date | Card Issue Date | Card ID | Seq No |
A | 2/1/2017 | 2/1/2017 | 1234 | 1 |
A | 2/1/2017 | 5/2/2017 | 1235 | 1 |
B | 5/2/2017 | 5/2/2017 | 1245 | 1 |
B | 5/2/2017 | 10/3/2017 | 1248 | 1 |
B | 5/2/2017 | 10/4/2017 | 1250 | 1 |
B | 5/2/2017 | 5/2/2017 | 1236 | 2 |
B | 5/2/2017 | 5/3/2017 | 1249 | 2 |
B | 5/2/2017 | 5/4/2017 | 1251 | 2 |
Please help me with this data transformation.
Here is the data step for the input file
data test;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: $4.
New_Card_ID: $4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
;
I guess this will do it:
data have;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: $4.
New_Card_ID: $4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
;
run;
proc sort data=have;
by customer card_issue_date card_id;
run;
data int;
set have;
by customer;
array card_chain{5} $100 cc1-cc5;
retain cc1-cc5;
if first.customer
then do;
do i = 2 to 5;
card_chain{i} = ' ';
end;
card_chain{1} = catx(',',card_id,new_card_id);
end;
else do;
i = 1;
do while (i <= 5);
if card_chain{i} = ' '
then do;
card_chain{i} = catx(',',card_id,new_card_id);
i = 6;
end;
else if index(card_chain{i},card_id) > 0
then do;
card_chain{i} = catx(',',card_chain{i},new_card_id);
i = 6;
end;
else i + 1;
end;
end;
if last.customer then do seq_no = 1 to 5;
if card_chain{seq_no} > ' '
then do i = 1 to countw(card_chain{seq_no},',');
card_id = scan(card_chain{seq_no},i,',');
output;
end;
end;
keep customer card_id seq_no;
run;
proc sort data=have;
by customer card_id;
run;
proc sort data=int;
by customer card_id;
run;
data want;
merge
have (drop=new_card_id)
int
;
by customer card_id;
run;
proc sort data=want;
by customer seq_no card_id;
run;
Note that the strings in the array need to be long enough to accomodate your maximum size of a sequence of cards.
The first sort makes sure that the first card issued will end up in seq_no = 1, and that the seq_no will follow the timeline.
Edit: removed put statements used during debugging.
Sorry, I can't see any logic here, why is "Card ID 1245, 1248 and 1250 are in the same group (Seq No 1)"? I can see no logical reasoning here. For me, cardid 1236 would be the one in group 1, it is the lowest card number found in group:
5/2/2017 | 5/2/2017 |
First step would be, and you need to provide test data in the form of a datastep, follow this post if unsure:
To get all the Ids in one column:
data want; set have (keep=id first date id) have (keep=id first date id rename=(newid=id)); run;
Next sort by your grouping, so the two dates. Then do a datastep with a by group:
data want; set want; by first date; retain seq; if first.date then seq=1; else seq1=seq+1; run;
Hi RW9, thanks for your reply. I have added the data step to generate input file in the post.
First of all, I need to clarify that the Card ID are generated randomly and they are not in sequence.
Secondly, the sequence of card id is described by information in the "New Card ID" column. For example, customer B got his first card 1245 on 5/2/2017, which is row 4. On 10/3/2017, his card got replaced and the new card id is 1248. Meanwhile, a recorded is generated in the "New Card ID" column on row 4 to indicate the new id is 1248. Therefore, 1245 and 1248 are in the same group. The same process is repeated one more time and 1250 replaced 1248. Therefore, 1245,1248 and 1250 are in the same group.
Still note sure why 1251 is seq 2.
Anyways, this is someway there:
data test; infile datalines dsd truncover ; input Customer:$1. First_Issue_Date: ddmmyy10. Card_Issue_Date: ddmmyy10. Card_ID: $4. New_Card_ID: $4. ; format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.; datalines; A,02/01/2017,02/01/2017,1234,1235, A,02/01/2017,05/02/2017,1235,, B,05/02/2017,05/02/2017,1245,1248, B,05/02/2017,05/02/2017,1236,1249, B,05/02/2017,10/03/2017,1248,1250, B,05/02/2017,05/02/2017,1249,1251, B,05/02/2017,10/04/2017,1250,, B,05/02/2017,05/03/2017,1251,, ; run; proc sort data=test out=want (drop=new_card_id); by customer card_issue_date; run; data want; set want; by customer card_issue_date; if first.card_issue_date then seq=1; else seq=2; run;
Hi RW9.
I have updated the input dateset a bit to make it clear.
New Card ID couldn't be simply dropped as it tells you the sequence of cards.
1251 is in seq 2 because customer B got card 1236 on 5/2/2017. He replaced the card with 1249 on 5/3/2017. On 5/4/2017, he replaced 1249 with 1251.
I guess this will do it:
data have;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: $4.
New_Card_ID: $4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
;
run;
proc sort data=have;
by customer card_issue_date card_id;
run;
data int;
set have;
by customer;
array card_chain{5} $100 cc1-cc5;
retain cc1-cc5;
if first.customer
then do;
do i = 2 to 5;
card_chain{i} = ' ';
end;
card_chain{1} = catx(',',card_id,new_card_id);
end;
else do;
i = 1;
do while (i <= 5);
if card_chain{i} = ' '
then do;
card_chain{i} = catx(',',card_id,new_card_id);
i = 6;
end;
else if index(card_chain{i},card_id) > 0
then do;
card_chain{i} = catx(',',card_chain{i},new_card_id);
i = 6;
end;
else i + 1;
end;
end;
if last.customer then do seq_no = 1 to 5;
if card_chain{seq_no} > ' '
then do i = 1 to countw(card_chain{seq_no},',');
card_id = scan(card_chain{seq_no},i,',');
output;
end;
end;
keep customer card_id seq_no;
run;
proc sort data=have;
by customer card_id;
run;
proc sort data=int;
by customer card_id;
run;
data want;
merge
have (drop=new_card_id)
int
;
by customer card_id;
run;
proc sort data=want;
by customer seq_no card_id;
run;
Note that the strings in the array need to be long enough to accomodate your maximum size of a sequence of cards.
The first sort makes sure that the first card issued will end up in seq_no = 1, and that the seq_no will follow the timeline.
Edit: removed put statements used during debugging.
You are trying to establish groups of "connected components", similar to a problem presented to the forum last week, and is described and solved in How to check duplicate customer IDs using SAS.
If your card id's can all be converted to integers, then the primary difference between your problem and that one is the presence of "by groups", i.e.you have to solve the problem for each customer, not just the entire dataset:
data test;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: 4.
New_Card_ID: 4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
run;
data want;
set test (in=firstpass where=(new_card_id^=.))
test (in=secondpass);
by customer;
array src {1200:1300} _temporary_;
array seq {1200:1300} _temporary_;
if first.customer then call missing(of src{*},of seq{*},sequence);
if firstpass then src{new_card_id}=card_id;
if secondpass;
source=card_id;
do while (src{source}^=.);
source=src{source};
end;
if seq{source}=. then do;
sequence+1;
seq{source}=sequence;
end;
else sequence=seq{source};
drop source new_card_id;
run;
proc sort; by customer sequence card_id;run;
Notes:
Thank you for providing with the solution. However, the actual card id is 16 digits long and with a wide range. I don't have sufficient memory to implement it.
Sorry for my late entry, slept off If my understanding of your requirement is right, it seems rather straight forward with hashes:
/* I took the Data have from @Kurt_Bremser Thanks to him*/
data have;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: $4.
New_Card_ID: $4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
;
run;
data want;
if _N_ = 1 then do;
if 0 then set have;
declare hash h(dataset: "have", multidata:'y',ordered:'y');
h.defineKey('card_id');
h.defineData(all:'yes');
h.defineDone();
declare hash h1();
h1.defineKey('card_id');
h1.defineDone();
end;
set have;
by customer;
if h1.check()=0 then delete;
if first.customer then call missing(seq);
seq+1;
output;
h.add();
if not missing(new_card_id) then
do;
do while(not missing(new_card_id));
rc=h.find(key:new_card_id);
if rc=0 then do;
output;
h1.add();
end;
end;
end;
drop rc new_card_id;
run;
Regards,
Naveen Srinivasan
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.