BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Hi all,

I have a dataset with three columns with this kind of information:

 

OBS DATE CLIENT_ID OLD_CLIENT_ID
1 27-Sep-16 10 6
2 27-Sep-16 7 5
3 17-Jun-16 6 4
4 10-Sep-16 9 8
5 5-Mar-16 4 2

 

Where "date" column is the date where some client has made a change in his account id, and the other columns are the respective values at that moment.

As you can see in the example dataset posted, many (and unknown) number of relationships could exist in the data history. For example, the client "10" was before client "6" (obs 1) but before was client "4" and before of that was client "2".

Therefore, what I want is a final dataset with the client_id-old_client_id pair taking account of what was said before, which will be like this:

 

OBS

CLIENT_ID

OLD_CLIENT_ID
1 10 2
2 7 5
3 9 8

 

Is there a way to accomplish this without using reflexive joins? As I said, number of relationships are unknown so making reflexive joins could be worst-performing.

 

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
As Patrick did.
Assuming One client_id corresponding to One old_client_id.


data chain;
   infile datalines truncover;
   input date date9. client_id old_client_id;
   format date date9.;
   datalines;
27-sep=16 10 6
27-sep-16 7 5
17-jun-16 6 4
10-sep-16 9 8
05-mar-16 4 2
; 
run;
data start;
 if _n_=1 then do;
  if 0 then set chain;
  declare hash h(dataset:'chain');
  h.definekey('old_client_id');
  h.definedone();
 end;
set chain;
if h.check(key:client_id) ne 0;
keep client_id date;
run;
data want;
 if _n_=1 then do;
  if 0 then set chain;
  declare hash h(dataset:'chain');
  h.definekey('client_id');
  h.definedata('old_client_id');
  h.definedone(); 
 end;
set start;
rc=h.find();
do while(rc=0);
 rc=h.find(key:old_client_id);
end;
drop rc;
run;


View solution in original post

7 REPLIES 7
Shmuel
Garnet | Level 18

I have the feeling that some in memory methods are the right mean to solve your problem.

Before tryning to code the program, it is good to know how big is the dataset?

or more specifically:

- how many observations are in the dataset?

- how many clients are expected in the result dataset ?

 

one more question - is there any other variable that can be used to identify client uniquely ?

having such variable will simplify the code.

Shmuel
Garnet | Level 18

With a file up to some thousands of clients you can use the attached tested program

lmignone
SAS Employee

Hi Shmuel,

 

Regarding your questions:

 

- The dataset has ~ 4.000.000 obs.

- The clients expected are ~95% of the obs of the dataset, as most of the relationships has one change only.

There is no other variable to identify the client uniquely.

PGStats
Opal | Level 21

It can be done with indexed random access:

 

data have;
input OBS	DATE :anydtdte.	CLIENT_ID	OLD_CLIENT_ID;
format date yymmdd10.;
datalines;
1	27-Sep-16	10	6
2	27-Sep-16	7	5
3	17-Jun-16	6	4
4	10-Sep-16	9	8
5	5-Mar-16	4	2
;

proc sql;
create unique index client_id on have(client_id);
create table current_clients as
select client_id as last_id
from have 
where client_id not in (select old_client_id from have);
quit;

data want;
set current_clients;
client_id = last_id;
do i = 1 to 100 until(_error_); /* Prevent cycling */
    set have key=client_id / unique;
    if _error_ then do;
        first_id = client_id;
        output;
        end;
    else client_id = old_client_id;
    end;
_error_ = 0;
keep first_id last_id;
run;
PG
Patrick
Opal | Level 21

Below should work as long as you have enough memory for the hashes AND each parent has only one child (so it's a single chain and not a tree with branches).

 

data chain;
   infile datalines truncover;
   input date date9. client_id old_client_id;
   format date date9.;
   datalines;
27-sep=16 10 6
27-sep-16 7 5
17-jun-16 6 4
10-sep-16 9 8
05-mar-16 4 2
05-mar-16 20 
; 
run;

data want(keep=date client_id old_client_id);
  set chain(keep=date client_id old_client_id);
  if _n_=1 then
    do;
      if 0 then set chain(keep=client_id old_client_id rename=(client_id=child old_client_id=parent));
      dcl hash root (dataset:'chain(keep=client_id old_client_id rename=(client_id=child old_client_id=parent))');
      _rc=root.defineKey('child');
      _rc=root.defineData('parent');
      _rc=root.defineDone();
      dcl hash leaf (dataset:'chain(keep=client_id old_client_id rename=(client_id=child old_client_id=parent) where=(parent ne .))');
      _rc=leaf.defineKey('parent');
      _rc=leaf.defineData('child');
      _rc=leaf.defineDone();

    end;
  
  parent=old_client_id;
  do while(root.find(key:parent)=0);
  end;

  child=client_id;
  do while(leaf.find(key:child)=0);
  end;
  
  if client_id=child then 
    do;
      old_client_id=parent;
      output;
    end;
run; 

 

In case your data volumes are high and performance becomes an issue then it would be possible to delete entries from the hashes once we found root and parent - but it would require additional coding and I've tried to keep things simple.

 

Ksharp
Super User
As Patrick did.
Assuming One client_id corresponding to One old_client_id.


data chain;
   infile datalines truncover;
   input date date9. client_id old_client_id;
   format date date9.;
   datalines;
27-sep=16 10 6
27-sep-16 7 5
17-jun-16 6 4
10-sep-16 9 8
05-mar-16 4 2
; 
run;
data start;
 if _n_=1 then do;
  if 0 then set chain;
  declare hash h(dataset:'chain');
  h.definekey('old_client_id');
  h.definedone();
 end;
set chain;
if h.check(key:client_id) ne 0;
keep client_id date;
run;
data want;
 if _n_=1 then do;
  if 0 then set chain;
  declare hash h(dataset:'chain');
  h.definekey('client_id');
  h.definedata('old_client_id');
  h.definedone(); 
 end;
set start;
rc=h.find();
do while(rc=0);
 rc=h.find(key:old_client_id);
end;
drop rc;
run;


lmignone
SAS Employee

After testing the different solutions that were posted, I have to choose this one due to performance (45% faster processing than using reflexive sql joins).

 

Thanks to all Smiley Happy

sas-innovate-2024.png

 

Time is running out to save with the early bird rate. Register by Friday, March 1 for just $695 - $100 off the standard rate.

 

Check out the agenda and get ready for a jam-packed event featuring workshops, super demos, breakout sessions, roundtables, inspiring keynotes and incredible networking events. 

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1479 views
  • 0 likes
  • 5 in conversation