BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Hi all,

I have a dataset with three columns with this kind of information:

 

OBS DATE CLIENT_ID OLD_CLIENT_ID
1 27-Sep-16 10 6
2 27-Sep-16 7 5
3 17-Jun-16 6 4
4 10-Sep-16 9 8
5 5-Mar-16 4 2

 

Where "date" column is the date where some client has made a change in his account id, and the other columns are the respective values at that moment.

As you can see in the example dataset posted, many (and unknown) number of relationships could exist in the data history. For example, the client "10" was before client "6" (obs 1) but before was client "4" (obs 3) and before of that was client "2".

Therefore, what I want is a final dataset with the client_id-old_client_id pairs taking account of what was said before, which will be like this:

 

OBS DATE CLIENT_ID OLD_CLIENT_ID
1 27-Sep-16 10 6
2 17-Jun-16 10 4
3 5-Mar-16 10 2
4 27-Sep-16 7 5
5 10-Sep-16 9 8

 

The dataset has ~ 4.000.000 obs.

There is no other variable to identify the client uniquely.

 

Is there a way to accomplish this without using reflexive joins? As I said, number of relationships are unknown so making reflexive joins could be worst-performing.

 

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

There are a couple of solved questions on here with different HASH solutions to this problem.

 

This is based on a macro written by PGStats. You didn't show what you want as your final results, but this will get you closer, if its not what you want as an end result. Note the last parameter controls how 'deep' the search goes so you do need some idea or set it to an arbitrarily high number.

 

%macro SubGraphs(arcs,from=from,to=to,out=Clusters,exp=8);
data _null_;
if 0 then set &arcs(keep=&from rename=(&from=node)); /* get node data type */
length clust 8;
declare hash nodes(hashexp:&exp);
nodes.defineKey('node');
nodes.defineData('node', 'clust');
nodes.defineDone();
declare hiter nodeList('nodes');

do newClust = 1 by 1 while(not endLoop);
	set &arcs end=endLoop;
	call missing(clust); node = &from;
	if 0^=nodes.find() then nodes.add(); 
	fromClust = clust; 
	call missing(clust); node = &to;
	if 0^=nodes.find() then nodes.add(); 
	toClust = clust;
	if n(fromClust, toClust) = 0 then do;
		nodes.replace(key:&from, data:&from, data:newClust);
		nodes.replace(key:&to, data:&to, data:newClust);
	end;
	else if missing(toClust) then 
		nodes.replace(key:&to, data:&to, data:fromClust);
	else if missing(fromClust) then 
		nodes.replace(key:&from, data:&from, data:toClust);
	else if fromClust ne toClust then do;
		rc = nodeList.first();
		do while (rc = 0);
			if clust = fromClust then 
				nodes.replace(key:node, data:node, data:toClust);
			rc = nodeList.next();
		end;
	end;
end;
nodes.output(dataset:"&out");
stop;
run;
%mend SubGraphs;



data have;
informat obs 8. date date9.;
format date date9.;
input  OBS	DATE	CLIENT_ID	OLD_CLIENT_ID;
cards;
1	27Sep16	10	6
2	17Jun16	10	4
3	5Mar16	10	2
4	27Sep16	7	5
5	10Sep16	9	8
;
run;

%SubGraphs(have, from=old_client_id, to=client_id, out=want,exp=4);

proc sort data=want;
by clust node;
run;

View solution in original post

7 REPLIES 7
lmignone
SAS Employee

Hi PGStats,

the expected result is not the same, as the output of the solved answer were 3 observations, meanwhile, for this question, the expected output has to got 5 observations.

 

Reeza
Super User

@lmignone wrote:

Hi PGStats,

the expected result is not the same, as the output of the solved answer were 3 observations, meanwhile, for this question, the expected output has to got 5 observations.

 


I'm confused. How does the 3/5 change the situation?

Those solutions don't seem dependent on the depth of the linkage. Can you explain with an example of how it doesn't work?

lmignone
SAS Employee

In the previous solution, the expected output was just the oldest client id for each client_id:

 

OBS

CLIENT_ID

OLD_CLIENT_ID
1 10 2
2 7 5
3 9 8

 

Meanwhile, in this question I'm asking for all the relationships that, for example, client_id 10 has in the input data. The expected output is this:

 

OBS DATE CLIENT_ID OLD_CLIENT_ID
1 27-Sep-16 10 6
2 17-Jun-16 10 4
3 5-Mar-16 10 2
4 27-Sep-16 7 5
5 10-Sep-16 9 8

 

Note that I don't need the relationship between 6-4 or between 4-2, as those ones has a newer client ID (10).

Reeza
Super User

So the solution I provided doesn't work for you?

lmignone
SAS Employee

It needs a further transformation in order to reach the desired result, but it works...

 

Thanks Reeza!

Reeza
Super User

There are a couple of solved questions on here with different HASH solutions to this problem.

 

This is based on a macro written by PGStats. You didn't show what you want as your final results, but this will get you closer, if its not what you want as an end result. Note the last parameter controls how 'deep' the search goes so you do need some idea or set it to an arbitrarily high number.

 

%macro SubGraphs(arcs,from=from,to=to,out=Clusters,exp=8);
data _null_;
if 0 then set &arcs(keep=&from rename=(&from=node)); /* get node data type */
length clust 8;
declare hash nodes(hashexp:&exp);
nodes.defineKey('node');
nodes.defineData('node', 'clust');
nodes.defineDone();
declare hiter nodeList('nodes');

do newClust = 1 by 1 while(not endLoop);
	set &arcs end=endLoop;
	call missing(clust); node = &from;
	if 0^=nodes.find() then nodes.add(); 
	fromClust = clust; 
	call missing(clust); node = &to;
	if 0^=nodes.find() then nodes.add(); 
	toClust = clust;
	if n(fromClust, toClust) = 0 then do;
		nodes.replace(key:&from, data:&from, data:newClust);
		nodes.replace(key:&to, data:&to, data:newClust);
	end;
	else if missing(toClust) then 
		nodes.replace(key:&to, data:&to, data:fromClust);
	else if missing(fromClust) then 
		nodes.replace(key:&from, data:&from, data:toClust);
	else if fromClust ne toClust then do;
		rc = nodeList.first();
		do while (rc = 0);
			if clust = fromClust then 
				nodes.replace(key:node, data:node, data:toClust);
			rc = nodeList.next();
		end;
	end;
end;
nodes.output(dataset:"&out");
stop;
run;
%mend SubGraphs;



data have;
informat obs 8. date date9.;
format date date9.;
input  OBS	DATE	CLIENT_ID	OLD_CLIENT_ID;
cards;
1	27Sep16	10	6
2	17Jun16	10	4
3	5Mar16	10	2
4	27Sep16	7	5
5	10Sep16	9	8
;
run;

%SubGraphs(have, from=old_client_id, to=client_id, out=want,exp=4);

proc sort data=want;
by clust node;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1808 views
  • 0 likes
  • 3 in conversation