Hi all,
I have a dataset with three columns with this kind of information:
OBS | DATE | CLIENT_ID | OLD_CLIENT_ID |
1 | 27-Sep-16 | 10 | 6 |
2 | 27-Sep-16 | 7 | 5 |
3 | 17-Jun-16 | 6 | 4 |
4 | 10-Sep-16 | 9 | 8 |
5 | 5-Mar-16 | 4 | 2 |
Where "date" column is the date where some client has made a change in his account id, and the other columns are the respective values at that moment.
As you can see in the example dataset posted, many (and unknown) number of relationships could exist in the data history. For example, the client "10" was before client "6" (obs 1) but before was client "4" and before of that was client "2".
Therefore, what I want is a final dataset with the client_id-old_client_id pair taking account of what was said before, which will be like this:
OBS |
CLIENT_ID |
OLD_CLIENT_ID |
1 | 10 | 2 |
2 | 7 | 5 |
3 | 9 | 8 |
Is there a way to accomplish this without using reflexive joins? As I said, number of relationships are unknown so making reflexive joins could be worst-performing.
Thanks in advance!
As Patrick did. Assuming One client_id corresponding to One old_client_id. data chain; infile datalines truncover; input date date9. client_id old_client_id; format date date9.; datalines; 27-sep=16 10 6 27-sep-16 7 5 17-jun-16 6 4 10-sep-16 9 8 05-mar-16 4 2 ; run; data start; if _n_=1 then do; if 0 then set chain; declare hash h(dataset:'chain'); h.definekey('old_client_id'); h.definedone(); end; set chain; if h.check(key:client_id) ne 0; keep client_id date; run; data want; if _n_=1 then do; if 0 then set chain; declare hash h(dataset:'chain'); h.definekey('client_id'); h.definedata('old_client_id'); h.definedone(); end; set start; rc=h.find(); do while(rc=0); rc=h.find(key:old_client_id); end; drop rc; run;
I have the feeling that some in memory methods are the right mean to solve your problem.
Before tryning to code the program, it is good to know how big is the dataset?
or more specifically:
- how many observations are in the dataset?
- how many clients are expected in the result dataset ?
one more question - is there any other variable that can be used to identify client uniquely ?
having such variable will simplify the code.
With a file up to some thousands of clients you can use the attached tested program
Hi Shmuel,
Regarding your questions:
- The dataset has ~ 4.000.000 obs.
- The clients expected are ~95% of the obs of the dataset, as most of the relationships has one change only.
There is no other variable to identify the client uniquely.
It can be done with indexed random access:
data have;
input OBS DATE :anydtdte. CLIENT_ID OLD_CLIENT_ID;
format date yymmdd10.;
datalines;
1 27-Sep-16 10 6
2 27-Sep-16 7 5
3 17-Jun-16 6 4
4 10-Sep-16 9 8
5 5-Mar-16 4 2
;
proc sql;
create unique index client_id on have(client_id);
create table current_clients as
select client_id as last_id
from have
where client_id not in (select old_client_id from have);
quit;
data want;
set current_clients;
client_id = last_id;
do i = 1 to 100 until(_error_); /* Prevent cycling */
set have key=client_id / unique;
if _error_ then do;
first_id = client_id;
output;
end;
else client_id = old_client_id;
end;
_error_ = 0;
keep first_id last_id;
run;
Below should work as long as you have enough memory for the hashes AND each parent has only one child (so it's a single chain and not a tree with branches).
data chain;
infile datalines truncover;
input date date9. client_id old_client_id;
format date date9.;
datalines;
27-sep=16 10 6
27-sep-16 7 5
17-jun-16 6 4
10-sep-16 9 8
05-mar-16 4 2
05-mar-16 20
;
run;
data want(keep=date client_id old_client_id);
set chain(keep=date client_id old_client_id);
if _n_=1 then
do;
if 0 then set chain(keep=client_id old_client_id rename=(client_id=child old_client_id=parent));
dcl hash root (dataset:'chain(keep=client_id old_client_id rename=(client_id=child old_client_id=parent))');
_rc=root.defineKey('child');
_rc=root.defineData('parent');
_rc=root.defineDone();
dcl hash leaf (dataset:'chain(keep=client_id old_client_id rename=(client_id=child old_client_id=parent) where=(parent ne .))');
_rc=leaf.defineKey('parent');
_rc=leaf.defineData('child');
_rc=leaf.defineDone();
end;
parent=old_client_id;
do while(root.find(key:parent)=0);
end;
child=client_id;
do while(leaf.find(key:child)=0);
end;
if client_id=child then
do;
old_client_id=parent;
output;
end;
run;
In case your data volumes are high and performance becomes an issue then it would be possible to delete entries from the hashes once we found root and parent - but it would require additional coding and I've tried to keep things simple.
As Patrick did. Assuming One client_id corresponding to One old_client_id. data chain; infile datalines truncover; input date date9. client_id old_client_id; format date date9.; datalines; 27-sep=16 10 6 27-sep-16 7 5 17-jun-16 6 4 10-sep-16 9 8 05-mar-16 4 2 ; run; data start; if _n_=1 then do; if 0 then set chain; declare hash h(dataset:'chain'); h.definekey('old_client_id'); h.definedone(); end; set chain; if h.check(key:client_id) ne 0; keep client_id date; run; data want; if _n_=1 then do; if 0 then set chain; declare hash h(dataset:'chain'); h.definekey('client_id'); h.definedata('old_client_id'); h.definedone(); end; set start; rc=h.find(); do while(rc=0); rc=h.find(key:old_client_id); end; drop rc; run;
After testing the different solutions that were posted, I have to choose this one due to performance (45% faster processing than using reflexive sql joins).
Thanks to all
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.