DATA Step, Macro, Functions and more

Complex data restructure issue

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

Complex data restructure issue

[ Edited ]

Hi 

I have a data set of card history as below. For each customer, they may have applied for one or multiple cards on the same day. However, due to various reasons, their cards get replaced. Card issue date is when a card is issued. New Card ID is the ID for the replaced card. For example, for customer A, his card was firstly issue on 2/1/2017 and the card ID is 1234. 3 days later, he lost his card and a new card (1235) is issued on 5/2/2017. 

Customer IDFirst Issue DateCard Issue DateCard IDNew Card ID
A2/1/20172/1/201712341235
A2/1/20175/2/20171235 
B5/2/20175/2/201712451248
B5/2/20175/2/201712361249
B5/2/201710/3/201712481250
B5/2/20175/3/201712491251
B5/2/201710/4/20171250 
B5/2/20175/4/20171251 

 

What I want is to group the original card and all the replacement together. For example, customer B applied for two cards on 5/2/217. Card ID 1245, 1248 and 1250 are in the same group (Seq No 1) and Card ID 1236, 1249 and 1251 are in the same group (Seq No 2).

Customer IDOpen DateCard Issue DateCard IDSeq No
A2/1/20172/1/201712341
A2/1/20175/2/201712351
B5/2/20175/2/201712451
B5/2/201710/3/201712481
B5/2/201710/4/201712501
B5/2/20175/2/201712362
B5/2/20175/3/201712492
B5/2/20175/4/201712512

 

Please help me with this data transformation. 

 

Here is the data step for the input file

data test;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: $4.
New_Card_ID: $4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
;


Accepted Solutions
Solution
‎12-18-2017 08:19 PM
Super User
Posts: 10,570

Re: Complex data restructure issue

[ Edited ]

I guess this will do it:

data have;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: $4.
New_Card_ID: $4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
;
run;

proc sort data=have;
by customer card_issue_date card_id;
run;

data int;
set have;
by customer;
array card_chain{5} $100 cc1-cc5;
retain cc1-cc5;
if first.customer
then do;
  do i = 2 to 5;
    card_chain{i} = ' ';
  end;
  card_chain{1} = catx(',',card_id,new_card_id);
end;
else do;
  i = 1;
  do while (i <= 5);
    if card_chain{i} = ' '
    then do;
      card_chain{i} = catx(',',card_id,new_card_id);
      i = 6;
    end;
    else if index(card_chain{i},card_id) > 0
    then do;
      card_chain{i} = catx(',',card_chain{i},new_card_id);
      i = 6;
    end;
    else i + 1;
  end;
end;
if last.customer then do seq_no = 1 to 5;
  if card_chain{seq_no} > ' '
  then do i = 1 to countw(card_chain{seq_no},',');
    card_id = scan(card_chain{seq_no},i,',');
    output;
  end;
end;
keep customer card_id seq_no;
run;

proc sort data=have;
by customer card_id;
run;

proc sort data=int;
by customer card_id;
run;

data want;
merge
  have (drop=new_card_id)
  int
;
by customer card_id;
run;

proc sort data=want;
by customer seq_no card_id;
run;

Note that the strings in the array need to be long enough to accomodate your maximum size of a sequence of cards.

The first sort makes sure that the first card issued will end up in seq_no = 1, and that the seq_no will follow the timeline.

 

Edit: removed put statements used during debugging.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code

View solution in original post


All Replies
Super User
Super User
Posts: 9,840

Re: Complex data restructure issue

Sorry, I can't see any logic here, why is "Card ID 1245, 1248 and 1250 are in the same group (Seq No 1)"?  I can see no logical reasoning here.  For me, cardid 1236 would be the one in group 1, it is the lowest card number found in group: 

5/2/2017 5/2/2017

 

First step would be, and you need to provide test data in the form of a datastep, follow this post if unsure:

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

 

To get all the Ids in one column:

data want;
  set have (keep=id first date id)
        have (keep=id first date id rename=(newid=id));
run;

Next sort by your grouping, so the two dates.  Then do a datastep with a by group:

data want;  
  set want;
  by first date;
  retain seq;
  if first.date then seq=1;
  else seq1=seq+1;
run;
New Contributor
Posts: 4

Re: Complex data restructure issue

Hi RW9, thanks for your reply. I have added the data step to generate input file in the post.

First of all, I need to clarify that the Card ID are generated randomly and they are not in sequence. 

Secondly, the sequence of card id is described by information in the "New Card ID" column. For example,  customer B got his first card 1245 on 5/2/2017, which is row 4. On 10/3/2017, his card got replaced and the new card id is 1248. Meanwhile, a recorded is generated in the "New Card ID" column on row 4 to indicate the new id is 1248. Therefore, 1245 and 1248 are in the same group. The same process is repeated one more time and 1250 replaced 1248. Therefore, 1245,1248 and 1250 are in the same group.  

Super User
Super User
Posts: 9,840

Re: Complex data restructure issue

Still note sure why 1251 is seq 2.

Anyways, this is someway there:

data test;
  infile datalines dsd truncover ;
  input Customer:$1.
  First_Issue_Date: ddmmyy10.
  Card_Issue_Date: ddmmyy10.
  Card_ID: $4.
  New_Card_ID: $4. ;
  format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/02/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/03/2017,1251,,
;
run;

proc sort data=test out=want (drop=new_card_id);
  by customer card_issue_date;
run;

data want;
  set want;
  by customer card_issue_date;
  if first.card_issue_date then seq=1;
  else seq=2;
run;
New Contributor
Posts: 4

Re: Complex data restructure issue

Hi RW9.

I have updated the input dateset a bit to make it clear. 

New Card ID couldn't be simply dropped as it tells you the sequence of cards.

1251 is in seq 2 because customer B got card 1236 on 5/2/2017. He replaced the card with 1249 on 5/3/2017. On 5/4/2017, he replaced 1249 with 1251. 

Solution
‎12-18-2017 08:19 PM
Super User
Posts: 10,570

Re: Complex data restructure issue

[ Edited ]

I guess this will do it:

data have;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: $4.
New_Card_ID: $4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
;
run;

proc sort data=have;
by customer card_issue_date card_id;
run;

data int;
set have;
by customer;
array card_chain{5} $100 cc1-cc5;
retain cc1-cc5;
if first.customer
then do;
  do i = 2 to 5;
    card_chain{i} = ' ';
  end;
  card_chain{1} = catx(',',card_id,new_card_id);
end;
else do;
  i = 1;
  do while (i <= 5);
    if card_chain{i} = ' '
    then do;
      card_chain{i} = catx(',',card_id,new_card_id);
      i = 6;
    end;
    else if index(card_chain{i},card_id) > 0
    then do;
      card_chain{i} = catx(',',card_chain{i},new_card_id);
      i = 6;
    end;
    else i + 1;
  end;
end;
if last.customer then do seq_no = 1 to 5;
  if card_chain{seq_no} > ' '
  then do i = 1 to countw(card_chain{seq_no},',');
    card_id = scan(card_chain{seq_no},i,',');
    output;
  end;
end;
keep customer card_id seq_no;
run;

proc sort data=have;
by customer card_id;
run;

proc sort data=int;
by customer card_id;
run;

data want;
merge
  have (drop=new_card_id)
  int
;
by customer card_id;
run;

proc sort data=want;
by customer seq_no card_id;
run;

Note that the strings in the array need to be long enough to accomodate your maximum size of a sequence of cards.

The first sort makes sure that the first card issued will end up in seq_no = 1, and that the seq_no will follow the timeline.

 

Edit: removed put statements used during debugging.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Trusted Advisor
Posts: 1,392

Re: Complex data restructure issue

You are trying to establish groups of "connected components", similar to a problem presented to the forum last week, and is described and solved in How to check duplicate customer IDs using SAS.

 

If your card id's can all be converted to integers, then the primary difference between your problem and that one is the presence of "by groups", i.e.you have to solve the problem for each customer, not just the entire dataset:

 

data test; 
  infile datalines dsd truncover ; 
  input Customer:$1.
        First_Issue_Date: ddmmyy10.
        Card_Issue_Date: ddmmyy10.
        Card_ID: 4.
        New_Card_ID: 4. ;
  format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
run;

data want;
  set test (in=firstpass  where=(new_card_id^=.))  
      test (in=secondpass);
  by customer;

  array src {1200:1300} _temporary_;
  array seq {1200:1300} _temporary_;

  if first.customer then call missing(of src{*},of seq{*},sequence);

  if firstpass then src{new_card_id}=card_id;

  if secondpass;
  source=card_id;
  do while (src{source}^=.);
    source=src{source};
  end;
  if seq{source}=. then do;
    sequence+1;
    seq{source}=sequence;
  end;
  else sequence=seq{source};
  drop source new_card_id;
run;

proc sort; by customer sequence card_id;run;

 

Notes:

  1. The set statement interleaves data set test with itself on an id by id basis.
  2. The SRC array links each new_card_id to card_id (i.e. sink to srouce).  Iteratively following these links until no link is available allows grouping of id's.  The array lower and upper bounds are set by the known integer ranges of the card id's.
  3. The SEQ array assign a sequence number to each final source.
New Contributor
Posts: 4

Re: Complex data restructure issue

Thank you for providing with the solution. However, the actual card id is 16 digits long and with a wide range. I don't have sufficient memory to implement it.

 

Super User
Posts: 2,061

Re: Complex data restructure issue

Sorry for my late entry, slept off   Smiley Sad If my understanding of your requirement is right, it seems rather straight forward with hashes:

/* I took the Data have from @KurtBremser  Thanks to him*/

data have;

infile datalines dsd truncover ;

input Customer:$1.

First_Issue_Date: ddmmyy10.

Card_Issue_Date: ddmmyy10.

Card_ID: $4.

New_Card_ID: $4. ;

format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;

datalines;

A,02/01/2017,02/01/2017,1234,1235,

A,02/01/2017,05/02/2017,1235,,

B,05/02/2017,05/02/2017,1245,1248,

B,05/02/2017,05/02/2017,1236,1249,

B,05/02/2017,10/03/2017,1248,1250,

B,05/02/2017,05/03/2017,1249,1251,

B,05/02/2017,10/04/2017,1250,,

B,05/02/2017,05/04/2017,1251,,

;

run;

 

 

data want;

 if _N_ = 1 then do;

       if 0 then set have;

      declare hash h(dataset: "have", multidata:'y',ordered:'y');    

      h.defineKey('card_id');

      h.defineData(all:'yes');

      h.defineDone();

       declare hash h1();    

      h1.defineKey('card_id');

      h1.defineDone();

       end;

set  have;

     by customer;

     if h1.check()=0 then delete;

     if first.customer then  call missing(seq);

           seq+1;

           output;

           h.add();

     if not missing(new_card_id) then

           do;

                do while(not missing(new_card_id));

                rc=h.find(key:new_card_id);

                if rc=0 then do;

                     output;

                     h1.add();

                     end;

                end;

           end;

drop rc new_card_id;

run;

 

Regards,

Naveen Srinivasan

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 182 views
  • 0 likes
  • 5 in conversation