Solved: How to retrieve the last value in a "chain" of relationships without u...

lmignone · Posted 09-27-2016 05:05 PM

Hi all,

I have a dataset with three columns with this kind of information:

OBS	DATE	CLIENT_ID	OLD_CLIENT_ID
1	27-Sep-16	10	6
2	27-Sep-16	7	5
3	17-Jun-16	6	4
4	10-Sep-16	9	8
5	5-Mar-16	4	2

Where "date" column is the date where some client has made a change in his account id, and the other columns are the respective values at that moment.

As you can see in the example dataset posted, many (and unknown) number of relationships could exist in the data history. For example, the client "10" was before client "6" (obs 1) but before was client "4" and before of that was client "2".

Therefore, what I want is a final dataset with the client_id-old_client_id pair taking account of what was said before, which will be like this:

OBS	CLIENT_ID	OLD_CLIENT_ID
1	10	2
2	7	5
3	9	8

Is there a way to accomplish this without using reflexive joins? As I said, number of relationships are unknown so making reflexive joins could be worst-performing.

Thanks in advance!

Ksharp · Posted 09-28-2016 07:18 AM

As Patrick did.
Assuming One client_id corresponding to One old_client_id.


data chain;
   infile datalines truncover;
   input date date9. client_id old_client_id;
   format date date9.;
   datalines;
27-sep=16 10 6
27-sep-16 7 5
17-jun-16 6 4
10-sep-16 9 8
05-mar-16 4 2
; 
run;
data start;
 if _n_=1 then do;
  if 0 then set chain;
  declare hash h(dataset:'chain');
  h.definekey('old_client_id');
  h.definedone();
 end;
set chain;
if h.check(key:client_id) ne 0;
keep client_id date;
run;
data want;
 if _n_=1 then do;
  if 0 then set chain;
  declare hash h(dataset:'chain');
  h.definekey('client_id');
  h.definedata('old_client_id');
  h.definedone(); 
 end;
set start;
rc=h.find();
do while(rc=0);
 rc=h.find(key:old_client_id);
end;
drop rc;
run;

View solution in original post

Shmuel · Posted 09-27-2016 07:46 PM

I have the feeling that some in memory methods are the right mean to solve your problem.

Before tryning to code the program, it is good to know how big is the dataset?

or more specifically:

- how many observations are in the dataset?

- how many clients are expected in the result dataset ?

one more question - is there any other variable that can be used to identify client uniquely ?

having such variable will simplify the code.

Shmuel · Posted 09-27-2016 10:09 PM

With a file up to some thousands of clients you can use the attached tested program

lmignone · Posted 09-28-2016 09:29 AM

Hi Shmuel,

Regarding your questions:

- The dataset has ~ 4.000.000 obs.

- The clients expected are ~95% of the obs of the dataset, as most of the relationships has one change only.

There is no other variable to identify the client uniquely.

PGStats · Posted 09-27-2016 10:33 PM

It can be done with indexed random access:

data have;
input OBS	DATE :anydtdte.	CLIENT_ID	OLD_CLIENT_ID;
format date yymmdd10.;
datalines;
1	27-Sep-16	10	6
2	27-Sep-16	7	5
3	17-Jun-16	6	4
4	10-Sep-16	9	8
5	5-Mar-16	4	2
;

proc sql;
create unique index client_id on have(client_id);
create table current_clients as
select client_id as last_id
from have 
where client_id not in (select old_client_id from have);
quit;

data want;
set current_clients;
client_id = last_id;
do i = 1 to 100 until(_error_); /* Prevent cycling */
    set have key=client_id / unique;
    if _error_ then do;
        first_id = client_id;
        output;
        end;
    else client_id = old_client_id;
    end;
_error_ = 0;
keep first_id last_id;
run;

PG

Patrick · Posted 09-27-2016 11:25 PM

Below should work as long as you have enough memory for the hashes AND each parent has only one child (so it's a single chain and not a tree with branches).

data chain;
   infile datalines truncover;
   input date date9. client_id old_client_id;
   format date date9.;
   datalines;
27-sep=16 10 6
27-sep-16 7 5
17-jun-16 6 4
10-sep-16 9 8
05-mar-16 4 2
05-mar-16 20 
; 
run;

data want(keep=date client_id old_client_id);
  set chain(keep=date client_id old_client_id);
  if _n_=1 then
    do;
      if 0 then set chain(keep=client_id old_client_id rename=(client_id=child old_client_id=parent));
      dcl hash root (dataset:'chain(keep=client_id old_client_id rename=(client_id=child old_client_id=parent))');
      _rc=root.defineKey('child');
      _rc=root.defineData('parent');
      _rc=root.defineDone();
      dcl hash leaf (dataset:'chain(keep=client_id old_client_id rename=(client_id=child old_client_id=parent) where=(parent ne .))');
      _rc=leaf.defineKey('parent');
      _rc=leaf.defineData('child');
      _rc=leaf.defineDone();

    end;
  
  parent=old_client_id;
  do while(root.find(key:parent)=0);
  end;

  child=client_id;
  do while(leaf.find(key:child)=0);
  end;
  
  if client_id=child then 
    do;
      old_client_id=parent;
      output;
    end;
run;

In case your data volumes are high and performance becomes an issue then it would be possible to delete entries from the hashes once we found root and parent - but it would require additional coding and I've tried to keep things simple.

Ksharp · Posted 09-28-2016 07:18 AM

As Patrick did.
Assuming One client_id corresponding to One old_client_id.


data chain;
   infile datalines truncover;
   input date date9. client_id old_client_id;
   format date date9.;
   datalines;
27-sep=16 10 6
27-sep-16 7 5
17-jun-16 6 4
10-sep-16 9 8
05-mar-16 4 2
; 
run;
data start;
 if _n_=1 then do;
  if 0 then set chain;
  declare hash h(dataset:'chain');
  h.definekey('old_client_id');
  h.definedone();
 end;
set chain;
if h.check(key:client_id) ne 0;
keep client_id date;
run;
data want;
 if _n_=1 then do;
  if 0 then set chain;
  declare hash h(dataset:'chain');
  h.definekey('client_id');
  h.definedata('old_client_id');
  h.definedone(); 
 end;
set start;
rc=h.find();
do while(rc=0);
 rc=h.find(key:old_client_id);
end;
drop rc;
run;

lmignone · Posted 09-28-2016 09:34 AM

After testing the different solutions that were posted, I have to choose this one due to performance (45% faster processing than using reflexive sql joins).

Thanks to all

How to retrieve the last value in a "chain" of relationships without using reflexing joins?

Re: How to retrieve the last value in a "chain" of relationships without using reflexing j

Re: How to retrieve the last value in a "chain" of relationships without using reflexing j

Re: How to retrieve the last value in a "chain" of relationships without using reflexing j

Re: How to retrieve the last value in a "chain" of relationships without using reflexing j

Re: How to retrieve the last value in a "chain" of relationships without using reflexing j

Re: How to retrieve the last value in a "chain" of relationships without using reflexing j

Re: How to retrieve the last value in a "chain" of relationships without using reflexing j

Re: How to retrieve the last value in a "chain" of relationships without using reflexing j

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away