Hello,
I am trying to figure out a way to determine what rows of data are duplicates of each other. It is easiest to explain my problem through an example.
I have a dataset as follows:
Edge To From Flow Exposure
1 1 2 10 10
2 2 1 5 10
3 3 4 5 5
4 4 3 5 5
5 5 6 10 10
6 6 5 5 5
The edge value is a unique identifier for each data row.
I am trying to match the data based on the "to" and "from" column where in the case of this sample data set the matching pairs (by "edge" id) are as follows
1,2
3,4
5,6
The "to" and "from" values between edge 1 and 2 are a reciprocal of each other.
I want to figure out what edges are matches for each other and also sum up the flow and exposure value for the matching edges.
What I would like to do is end up with a table as follows
Edge To From Flow Exposure match flow_sum exposure_sum
1 1 2 10 10 1 15 20
2 2 1 5 10 1 15 20
3 3 4 5 5 2 10 10
4 4 3 5 5 2 10 10
5 5 6 10 10 3 15 15
6 6 5 5 5 3 15 15
In this case the "match" variable is just identifying what edges are pairs. It doesn't have to be a number, it can be a variable and can start at any value.
Is this possible?
The data set I am working with has about 300,000 rows of data with the majority of them having pairs.
Thank you for the help.
Cheers,
Scott
It can be done this way :
data test;
input Edge To From Flow Exposure;
datalines;
1 1 2 10 10
2 2 1 5 10
3 3 4 5 5
4 4 3 5 5
5 5 6 10 10
6 6 5 5 5
;
proc sql;
create table pairs as
select
t1.*,
catx("-",min(t1.Edge,t2.Edge),max(t1.Edge,t2.Edge)) as match length=13,
t1.flow+t2.flow as flow_sum,
t1.exposure+t2.exposure as exposure_sum
from test as t1 inner join test as t2 on
t1.from=t2.to and t1.to=t2.from;
quit;
PG
data have;
input Edge vTO vFrom Flow Exposure;
cards;
1 1 2 10 10
2 2 1 5 10
3 3 4 5 5
4 4 3 5 5
5 5 6 10 10
6 6 5 5 5
;
run;
proc sort in=have out=have;
by vTO;
run;
data matches(keep=matchkey vto vfrom);
length Matchkey 8.;
set have;
by vTo;
retain tmp . MatchKey .;
if _N_=1 then do;
tmp=vFrom;
MatchKey=1;
end;
if vTo eq tmp then do;
output;
MatchKey+1;
end;
tmp=vFrom;
run;
proc sql;
create table totals as
select
t2.matchkey,
t2.vto,
t2.vfrom,
sum(flow) as flow_sum,
sum(exposure) as exposure_sum
from
have t1
inner join matches t2
on (t1.vto = t2.vto and t1.vfrom = t2.vfrom )
or (t1.vto = t2.vfrom and t1.vfrom = t2.vto)
group by t2.matchkey,t2.vto, t2.vfrom;
create table want as
select t1.*, t2.matchkey, t2.flow_Sum, t2.exposure_sum
from
have t1
left outer join totals t2
on (t1.vto = t2.vto and t1.vfrom = t2.vfrom)
or (t1.vto = t2.vfrom and t1.vfrom = t2.vto);
quit;
12:30 Friday, February 1, 2013 1 |
v | exposure_ | ||||||||
Obs | Edge | vTO | From | Flow | Exposure | Matchkey | flow_sum | sum |
1 | 1 | 1 | 2 | 10 | 10 | 1 | 15 | 20 | |
2 | 2 | 2 | 1 | 5 | 10 | 1 | 15 | 20 | |
3 | 3 | 3 | 4 | 5 | 5 | 2 | 10 | 10 | |
4 | 4 | 4 | 3 | 5 | 5 | 2 | 10 | 10 | |
5 | 5 | 5 | 6 | 10 | 10 | 3 | 15 | 15 | |
6 | 6 | 6 | 5 | 5 | 5 | 3 | 15 | 15 |
Thank you for the suggestion. I tried something similar before but was unable to make it work because of the limits of storing the vast number of data in memory.
It would work if I had a smaller dataset
Cheers,
Scott
It can be done this way :
data test;
input Edge To From Flow Exposure;
datalines;
1 1 2 10 10
2 2 1 5 10
3 3 4 5 5
4 4 3 5 5
5 5 6 10 10
6 6 5 5 5
;
proc sql;
create table pairs as
select
t1.*,
catx("-",min(t1.Edge,t2.Edge),max(t1.Edge,t2.Edge)) as match length=13,
t1.flow+t2.flow as flow_sum,
t1.exposure+t2.exposure as exposure_sum
from test as t1 inner join test as t2 on
t1.from=t2.to and t1.to=t2.from;
quit;
PG
Thank you PG for the suggestion. It works and runs very quick.
I need to tweak it a bit as I seem to get some additional output pairs that shouldn't exist (about 1000 extra)
Thank you
Cheers,
Scott
Scott,
If there is a match, must the matching observations be consecutive or could they appear from anywhere within the original data set?
Is it possible that an observation will match more than one additional observation?
If the answers are (1) consecutive, and (2) no, some blazingly fast solutions should be possible.
Hello Astounding,
The matching can appear from anywhere within the original data set and in most cases are not consecutive.
If the data is perfect then one observation should only have match or have no match.
The data is a table representing the typology of a road network so for a two way road there will be a matching set of observations. One for each direction of travel. For one way roads there is only one observation representing the direction of travel.
The big issue is that the duplicate observations do not have a common identifier besides the to and from column.
The code provided by PG stats seems to work well and give me the results I was hoping for.
Cheers,
Scott
Try preceeding my query with something like :
proc sort data=test out=testNoDup nodupkey; by from to; run;
to get rid of duplicate from-to pairs.
PG
Hi PGstats.
Yeah that is what I am working on today to see if I can figure out the error in the data and figure out why the duplicates are present.
Thank you again for the help!
Cheers,
Scott
Scott,
Thanks for the clarification. On the one hand, it's an interesting problem and I want to get to it using another approach. On the other hand, it will take more time than I can devote for a few weeks. I will eventually come up with something (if only for my own learning experience) that will begin with the PGStats suggestion of sorting NODUPKEY. But it will use a hash table to match ... that's the area where I need more experience.
Hi Astounding,
I was originally going to use a hash table to see if I could work out this problem but it came down to the issue that the organization of the data limits that functionality of using a hash table and setting up the unique has tag to represent only one set of matching observations.
Good luck with your project!
Cheers,
Scott
I finally have a little bit of time to come back and program ... here is the version that I came up with. I left the variables intact and separate from one another, but if you examine the results you can easily see how they can be combined.
data have;
input edge to from flow exposure;
cards;
1 1 2 10 10
2 2 1 5 10
3 3 4 5 5
4 4 3 5 5
5 5 6 10 10
6 6 5 5 5
7 7 8 5 5
8 1 5 10 10
;
data matches;
set have;
rename edge=matching_edge
to=from
from=to
flow=matching_flow
exposure=matching_exposure;
run;
data combined;
if _n_=1 then do;
declare hash h (dataset: 'matches');
h.definekey('from', 'to');
h.definedata('matching_edge', 'matching_flow', 'matching_exposure');
h.definedone();
call missing (matching_edge, matching_flow, matching_exposure);
end;
set have;
if h.check()=0 then h.find();
run;
proc print;
run;
While Astounding is busy, and I happen to have some time in a long time, this is my attempt using Hash(), and I hope you have enough RAM to run it:
data test;
input Edge To From Flow Exposure;
datalines;
1 1 2 10 10
2 2 1 5 10
3 3 4 5 5
4 4 3 5 5
5 5 6 10 10
6 6 5 5 5
;
data want;
if _n_=1 then do;
if 0 then set test(rename=(flow=_f exposure=_e to=_to from=_from edge=_edge) keep=edge to from flow exposure);
declare hash h(dataset:'test(rename=(flow=_f exposure=_e to=_to from=_from edge=_edge) keep=edge to from flow exposure)', multidata:'y');
h.definekey('_from');
h.definedata('_f', '_e', '_to','_edge');
h.definedone();
end;
set test;
length group $20.;
if h.find(key:to)=0 and _to=from then do;
flow_sum=sum(flow,_f);
exposure_sum=sum(exposure,_e);
group=ifc(edge<_edge, catx('_',edge, _edge),catx('_',_edge,edge));
end;
drop _:;
run;
Haikuo
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.