BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
TJwang
Calcite | Level 5

Hi 

I have a data set of card history as below. For each customer, they may have applied for one or multiple cards on the same day. However, due to various reasons, their cards get replaced. Card issue date is when a card is issued. New Card ID is the ID for the replaced card. For example, for customer A, his card was firstly issue on 2/1/2017 and the card ID is 1234. 3 days later, he lost his card and a new card (1235) is issued on 5/2/2017. 

Customer IDFirst Issue DateCard Issue DateCard IDNew Card ID
A2/1/20172/1/201712341235
A2/1/20175/2/20171235 
B5/2/20175/2/201712451248
B5/2/20175/2/201712361249
B5/2/201710/3/201712481250
B5/2/20175/3/201712491251
B5/2/201710/4/20171250 
B5/2/20175/4/20171251 

 

What I want is to group the original card and all the replacement together. For example, customer B applied for two cards on 5/2/217. Card ID 1245, 1248 and 1250 are in the same group (Seq No 1) and Card ID 1236, 1249 and 1251 are in the same group (Seq No 2).

Customer IDOpen DateCard Issue DateCard IDSeq No
A2/1/20172/1/201712341
A2/1/20175/2/201712351
B5/2/20175/2/201712451
B5/2/201710/3/201712481
B5/2/201710/4/201712501
B5/2/20175/2/201712362
B5/2/20175/3/201712492
B5/2/20175/4/201712512

 

Please help me with this data transformation. 

 

Here is the data step for the input file

data test;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: $4.
New_Card_ID: $4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
;

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

I guess this will do it:

data have;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: $4.
New_Card_ID: $4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
;
run;

proc sort data=have;
by customer card_issue_date card_id;
run;

data int;
set have;
by customer;
array card_chain{5} $100 cc1-cc5;
retain cc1-cc5;
if first.customer
then do;
  do i = 2 to 5;
    card_chain{i} = ' ';
  end;
  card_chain{1} = catx(',',card_id,new_card_id);
end;
else do;
  i = 1;
  do while (i <= 5);
    if card_chain{i} = ' '
    then do;
      card_chain{i} = catx(',',card_id,new_card_id);
      i = 6;
    end;
    else if index(card_chain{i},card_id) > 0
    then do;
      card_chain{i} = catx(',',card_chain{i},new_card_id);
      i = 6;
    end;
    else i + 1;
  end;
end;
if last.customer then do seq_no = 1 to 5;
  if card_chain{seq_no} > ' '
  then do i = 1 to countw(card_chain{seq_no},',');
    card_id = scan(card_chain{seq_no},i,',');
    output;
  end;
end;
keep customer card_id seq_no;
run;

proc sort data=have;
by customer card_id;
run;

proc sort data=int;
by customer card_id;
run;

data want;
merge
  have (drop=new_card_id)
  int
;
by customer card_id;
run;

proc sort data=want;
by customer seq_no card_id;
run;

Note that the strings in the array need to be long enough to accomodate your maximum size of a sequence of cards.

The first sort makes sure that the first card issued will end up in seq_no = 1, and that the seq_no will follow the timeline.

 

Edit: removed put statements used during debugging.

View solution in original post

8 REPLIES 8
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Sorry, I can't see any logic here, why is "Card ID 1245, 1248 and 1250 are in the same group (Seq No 1)"?  I can see no logical reasoning here.  For me, cardid 1236 would be the one in group 1, it is the lowest card number found in group: 

5/2/2017 5/2/2017

 

First step would be, and you need to provide test data in the form of a datastep, follow this post if unsure:

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

 

To get all the Ids in one column:

data want;
  set have (keep=id first date id)
        have (keep=id first date id rename=(newid=id));
run;

Next sort by your grouping, so the two dates.  Then do a datastep with a by group:

data want;  
  set want;
  by first date;
  retain seq;
  if first.date then seq=1;
  else seq1=seq+1;
run;
TJwang
Calcite | Level 5

Hi RW9, thanks for your reply. I have added the data step to generate input file in the post.

First of all, I need to clarify that the Card ID are generated randomly and they are not in sequence. 

Secondly, the sequence of card id is described by information in the "New Card ID" column. For example,  customer B got his first card 1245 on 5/2/2017, which is row 4. On 10/3/2017, his card got replaced and the new card id is 1248. Meanwhile, a recorded is generated in the "New Card ID" column on row 4 to indicate the new id is 1248. Therefore, 1245 and 1248 are in the same group. The same process is repeated one more time and 1250 replaced 1248. Therefore, 1245,1248 and 1250 are in the same group.  

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Still note sure why 1251 is seq 2.

Anyways, this is someway there:

data test;
  infile datalines dsd truncover ;
  input Customer:$1.
  First_Issue_Date: ddmmyy10.
  Card_Issue_Date: ddmmyy10.
  Card_ID: $4.
  New_Card_ID: $4. ;
  format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/02/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/03/2017,1251,,
;
run;

proc sort data=test out=want (drop=new_card_id);
  by customer card_issue_date;
run;

data want;
  set want;
  by customer card_issue_date;
  if first.card_issue_date then seq=1;
  else seq=2;
run;
TJwang
Calcite | Level 5

Hi RW9.

I have updated the input dateset a bit to make it clear. 

New Card ID couldn't be simply dropped as it tells you the sequence of cards.

1251 is in seq 2 because customer B got card 1236 on 5/2/2017. He replaced the card with 1249 on 5/3/2017. On 5/4/2017, he replaced 1249 with 1251. 

Kurt_Bremser
Super User

I guess this will do it:

data have;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: $4.
New_Card_ID: $4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
;
run;

proc sort data=have;
by customer card_issue_date card_id;
run;

data int;
set have;
by customer;
array card_chain{5} $100 cc1-cc5;
retain cc1-cc5;
if first.customer
then do;
  do i = 2 to 5;
    card_chain{i} = ' ';
  end;
  card_chain{1} = catx(',',card_id,new_card_id);
end;
else do;
  i = 1;
  do while (i <= 5);
    if card_chain{i} = ' '
    then do;
      card_chain{i} = catx(',',card_id,new_card_id);
      i = 6;
    end;
    else if index(card_chain{i},card_id) > 0
    then do;
      card_chain{i} = catx(',',card_chain{i},new_card_id);
      i = 6;
    end;
    else i + 1;
  end;
end;
if last.customer then do seq_no = 1 to 5;
  if card_chain{seq_no} > ' '
  then do i = 1 to countw(card_chain{seq_no},',');
    card_id = scan(card_chain{seq_no},i,',');
    output;
  end;
end;
keep customer card_id seq_no;
run;

proc sort data=have;
by customer card_id;
run;

proc sort data=int;
by customer card_id;
run;

data want;
merge
  have (drop=new_card_id)
  int
;
by customer card_id;
run;

proc sort data=want;
by customer seq_no card_id;
run;

Note that the strings in the array need to be long enough to accomodate your maximum size of a sequence of cards.

The first sort makes sure that the first card issued will end up in seq_no = 1, and that the seq_no will follow the timeline.

 

Edit: removed put statements used during debugging.

mkeintz
PROC Star

You are trying to establish groups of "connected components", similar to a problem presented to the forum last week, and is described and solved in How to check duplicate customer IDs using SAS.

 

If your card id's can all be converted to integers, then the primary difference between your problem and that one is the presence of "by groups", i.e.you have to solve the problem for each customer, not just the entire dataset:

 

data test; 
  infile datalines dsd truncover ; 
  input Customer:$1.
        First_Issue_Date: ddmmyy10.
        Card_Issue_Date: ddmmyy10.
        Card_ID: 4.
        New_Card_ID: 4. ;
  format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
run;

data want;
  set test (in=firstpass  where=(new_card_id^=.))  
      test (in=secondpass);
  by customer;

  array src {1200:1300} _temporary_;
  array seq {1200:1300} _temporary_;

  if first.customer then call missing(of src{*},of seq{*},sequence);

  if firstpass then src{new_card_id}=card_id;

  if secondpass;
  source=card_id;
  do while (src{source}^=.);
    source=src{source};
  end;
  if seq{source}=. then do;
    sequence+1;
    seq{source}=sequence;
  end;
  else sequence=seq{source};
  drop source new_card_id;
run;

proc sort; by customer sequence card_id;run;

 

Notes:

  1. The set statement interleaves data set test with itself on an id by id basis.
  2. The SRC array links each new_card_id to card_id (i.e. sink to srouce).  Iteratively following these links until no link is available allows grouping of id's.  The array lower and upper bounds are set by the known integer ranges of the card id's.
  3. The SEQ array assign a sequence number to each final source.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
TJwang
Calcite | Level 5

Thank you for providing with the solution. However, the actual card id is 16 digits long and with a wide range. I don't have sufficient memory to implement it.

 

novinosrin
Tourmaline | Level 20

Sorry for my late entry, slept off   Smiley Sad If my understanding of your requirement is right, it seems rather straight forward with hashes:

/* I took the Data have from @Kurt_Bremser  Thanks to him*/

data have;

infile datalines dsd truncover ;

input Customer:$1.

First_Issue_Date: ddmmyy10.

Card_Issue_Date: ddmmyy10.

Card_ID: $4.

New_Card_ID: $4. ;

format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;

datalines;

A,02/01/2017,02/01/2017,1234,1235,

A,02/01/2017,05/02/2017,1235,,

B,05/02/2017,05/02/2017,1245,1248,

B,05/02/2017,05/02/2017,1236,1249,

B,05/02/2017,10/03/2017,1248,1250,

B,05/02/2017,05/03/2017,1249,1251,

B,05/02/2017,10/04/2017,1250,,

B,05/02/2017,05/04/2017,1251,,

;

run;

 

 

data want;

 if _N_ = 1 then do;

       if 0 then set have;

      declare hash h(dataset: "have", multidata:'y',ordered:'y');    

      h.defineKey('card_id');

      h.defineData(all:'yes');

      h.defineDone();

       declare hash h1();    

      h1.defineKey('card_id');

      h1.defineDone();

       end;

set  have;

     by customer;

     if h1.check()=0 then delete;

     if first.customer then  call missing(seq);

           seq+1;

           output;

           h.add();

     if not missing(new_card_id) then

           do;

                do while(not missing(new_card_id));

                rc=h.find(key:new_card_id);

                if rc=0 then do;

                     output;

                     h1.add();

                     end;

                end;

           end;

drop rc new_card_id;

run;

 

Regards,

Naveen Srinivasan

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 900 views
  • 0 likes
  • 5 in conversation