Hi all,
I have a dataset with three columns with this kind of information:
OBS | DATE | CLIENT_ID | OLD_CLIENT_ID |
1 | 27-Sep-16 | 10 | 6 |
2 | 27-Sep-16 | 7 | 5 |
3 | 17-Jun-16 | 6 | 4 |
4 | 10-Sep-16 | 9 | 8 |
5 | 5-Mar-16 | 4 | 2 |
Where "date" column is the date where some client has made a change in his account id, and the other columns are the respective values at that moment.
As you can see in the example dataset posted, many (and unknown) number of relationships could exist in the data history. For example, the client "10" was before client "6" (obs 1) but before was client "4" (obs 3) and before of that was client "2".
Therefore, what I want is a final dataset with the client_id-old_client_id pairs taking account of what was said before, which will be like this:
OBS | DATE | CLIENT_ID | OLD_CLIENT_ID |
1 | 27-Sep-16 | 10 | 6 |
2 | 17-Jun-16 | 10 | 4 |
3 | 5-Mar-16 | 10 | 2 |
4 | 27-Sep-16 | 7 | 5 |
5 | 10-Sep-16 | 9 | 8 |
The dataset has ~ 4.000.000 obs.
There is no other variable to identify the client uniquely.
Is there a way to accomplish this without using reflexive joins? As I said, number of relationships are unknown so making reflexive joins could be worst-performing.
Thanks in advance!
There are a couple of solved questions on here with different HASH solutions to this problem.
This is based on a macro written by PGStats. You didn't show what you want as your final results, but this will get you closer, if its not what you want as an end result. Note the last parameter controls how 'deep' the search goes so you do need some idea or set it to an arbitrarily high number.
%macro SubGraphs(arcs,from=from,to=to,out=Clusters,exp=8);
data _null_;
if 0 then set &arcs(keep=&from rename=(&from=node)); /* get node data type */
length clust 8;
declare hash nodes(hashexp:&exp);
nodes.defineKey('node');
nodes.defineData('node', 'clust');
nodes.defineDone();
declare hiter nodeList('nodes');
do newClust = 1 by 1 while(not endLoop);
set &arcs end=endLoop;
call missing(clust); node = &from;
if 0^=nodes.find() then nodes.add();
fromClust = clust;
call missing(clust); node = &to;
if 0^=nodes.find() then nodes.add();
toClust = clust;
if n(fromClust, toClust) = 0 then do;
nodes.replace(key:&from, data:&from, data:newClust);
nodes.replace(key:&to, data:&to, data:newClust);
end;
else if missing(toClust) then
nodes.replace(key:&to, data:&to, data:fromClust);
else if missing(fromClust) then
nodes.replace(key:&from, data:&from, data:toClust);
else if fromClust ne toClust then do;
rc = nodeList.first();
do while (rc = 0);
if clust = fromClust then
nodes.replace(key:node, data:node, data:toClust);
rc = nodeList.next();
end;
end;
end;
nodes.output(dataset:"&out");
stop;
run;
%mend SubGraphs;
data have;
informat obs 8. date date9.;
format date date9.;
input OBS DATE CLIENT_ID OLD_CLIENT_ID;
cards;
1 27Sep16 10 6
2 17Jun16 10 4
3 5Mar16 10 2
4 27Sep16 7 5
5 10Sep16 9 8
;
run;
%SubGraphs(have, from=old_client_id, to=client_id, out=want,exp=4);
proc sort data=want;
by clust node;
run;
How is this question different from your previous (solved) question
Hi PGStats,
the expected result is not the same, as the output of the solved answer were 3 observations, meanwhile, for this question, the expected output has to got 5 observations.
@lmignone wrote:
Hi PGStats,
the expected result is not the same, as the output of the solved answer were 3 observations, meanwhile, for this question, the expected output has to got 5 observations.
I'm confused. How does the 3/5 change the situation?
Those solutions don't seem dependent on the depth of the linkage. Can you explain with an example of how it doesn't work?
In the previous solution, the expected output was just the oldest client id for each client_id:
OBS |
CLIENT_ID |
OLD_CLIENT_ID |
1 | 10 | 2 |
2 | 7 | 5 |
3 | 9 | 8 |
Meanwhile, in this question I'm asking for all the relationships that, for example, client_id 10 has in the input data. The expected output is this:
OBS | DATE | CLIENT_ID | OLD_CLIENT_ID |
1 | 27-Sep-16 | 10 | 6 |
2 | 17-Jun-16 | 10 | 4 |
3 | 5-Mar-16 | 10 | 2 |
4 | 27-Sep-16 | 7 | 5 |
5 | 10-Sep-16 | 9 | 8 |
Note that I don't need the relationship between 6-4 or between 4-2, as those ones has a newer client ID (10).
So the solution I provided doesn't work for you?
It needs a further transformation in order to reach the desired result, but it works...
Thanks Reeza!
There are a couple of solved questions on here with different HASH solutions to this problem.
This is based on a macro written by PGStats. You didn't show what you want as your final results, but this will get you closer, if its not what you want as an end result. Note the last parameter controls how 'deep' the search goes so you do need some idea or set it to an arbitrarily high number.
%macro SubGraphs(arcs,from=from,to=to,out=Clusters,exp=8);
data _null_;
if 0 then set &arcs(keep=&from rename=(&from=node)); /* get node data type */
length clust 8;
declare hash nodes(hashexp:&exp);
nodes.defineKey('node');
nodes.defineData('node', 'clust');
nodes.defineDone();
declare hiter nodeList('nodes');
do newClust = 1 by 1 while(not endLoop);
set &arcs end=endLoop;
call missing(clust); node = &from;
if 0^=nodes.find() then nodes.add();
fromClust = clust;
call missing(clust); node = &to;
if 0^=nodes.find() then nodes.add();
toClust = clust;
if n(fromClust, toClust) = 0 then do;
nodes.replace(key:&from, data:&from, data:newClust);
nodes.replace(key:&to, data:&to, data:newClust);
end;
else if missing(toClust) then
nodes.replace(key:&to, data:&to, data:fromClust);
else if missing(fromClust) then
nodes.replace(key:&from, data:&from, data:toClust);
else if fromClust ne toClust then do;
rc = nodeList.first();
do while (rc = 0);
if clust = fromClust then
nodes.replace(key:node, data:node, data:toClust);
rc = nodeList.next();
end;
end;
end;
nodes.output(dataset:"&out");
stop;
run;
%mend SubGraphs;
data have;
informat obs 8. date date9.;
format date date9.;
input OBS DATE CLIENT_ID OLD_CLIENT_ID;
cards;
1 27Sep16 10 6
2 17Jun16 10 4
3 5Mar16 10 2
4 27Sep16 7 5
5 10Sep16 9 8
;
run;
%SubGraphs(have, from=old_client_id, to=client_id, out=want,exp=4);
proc sort data=want;
by clust node;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.