BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sbennet
Calcite | Level 5

Hello,

I am trying to figure out a way to determine what rows of data are duplicates of each other.  It is easiest to explain my problem through an example.

I have a dataset as follows:

Edge     To     From     Flow     Exposure

1          1          2          10          10

2          2          1          5            10

3          3          4          5             5

4          4          3          5             5

5          5          6          10          10

6          6          5          5             5

The edge value is a unique identifier for each data row.

I am trying to match the data based on the "to" and "from" column where in the case of this sample data set the matching pairs (by "edge" id) are as follows

1,2

3,4

5,6

The "to" and "from" values between edge 1 and 2 are a reciprocal of each other.

I want to figure out what edges are matches for each other and also sum up the flow and exposure value for the matching edges.

What I would like to do is end up with a table as follows

Edge     To     From     Flow     Exposure      match      flow_sum     exposure_sum

1          1          2          10          10               1               15               20

2          2          1          5            10               1               15               20

3          3          4          5             5                2               10               10

4          4          3          5             5                2               10               10

5          5          6          10          10               3                15               15

6          6          5          5             5               3                 15               15

In this case the "match" variable is just identifying what edges are pairs. It doesn't have to be a number, it can be a variable and can start at any value.

Is this possible?

The data set I am working with has about 300,000 rows of data with the majority of them having pairs.

Thank you for the help.

Cheers,

Scott

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

It can be done this way :

data test;
input Edge To From Flow Exposure;
datalines;
1          1          2          10          10
2          2          1          5            10
3          3          4          5             5
4          4          3          5             5
5          5          6          10          10
6          6          5          5             5
;

proc sql;
create table pairs as
select
     t1.*,
     catx("-",min(t1.Edge,t2.Edge),max(t1.Edge,t2.Edge)) as match length=13,
     t1.flow+t2.flow as flow_sum,
     t1.exposure+t2.exposure as exposure_sum
from test as t1 inner join test as t2 on
     t1.from=t2.to and t1.to=t2.from;
quit;

PG

PG

View solution in original post

12 REPLIES 12
DBailey
Lapis Lazuli | Level 10

data have;

input Edge   vTO     vFrom Flow     Exposure;

cards;

1          1          2          10          10

2          2          1          5            10

3          3          4          5             5

4          4          3          5             5

5          5          6          10          10

6          6          5          5             5

;

run;

proc sort in=have out=have;

by vTO;

run;

data matches(keep=matchkey vto vfrom);

length Matchkey 8.;

set have;

by vTo;

retain tmp . MatchKey .;

if _N_=1 then do;

    tmp=vFrom;

    MatchKey=1;

    end;

if vTo eq tmp then do;

    output;

    MatchKey+1;

    end;

tmp=vFrom;

run;

proc sql;

create table totals as

select

    t2.matchkey,

    t2.vto,

    t2.vfrom,

    sum(flow) as flow_sum,

    sum(exposure) as exposure_sum

from

    have t1

    inner join matches t2

        on (t1.vto = t2.vto and t1.vfrom = t2.vfrom )

            or (t1.vto = t2.vfrom and t1.vfrom = t2.vto)

group by t2.matchkey,t2.vto, t2.vfrom;

create table want as

select t1.*, t2.matchkey, t2.flow_Sum, t2.exposure_sum

from

    have t1

    left outer join totals t2

        on (t1.vto = t2.vto and t1.vfrom = t2.vfrom)

            or (t1.vto = t2.vfrom and t1.vfrom = t2.vto);

quit;

                                                                                              12:30 Friday, February 1, 2013   1

                                            v                                             exposure_
                    ObsEdgevTOFromFlowExposureMatchkeyflow_sum   sum

                     1   1   10    10       1      15       20  
                     2   2   1   5    10       1      15       20  
                     3   3   4   5     5       2      10       10  
                     4   4   3   5     5       2      10       10  
                     5   5   10    10       3      15       15  
                     6   6   5   5     5       3      15       15  
sbennet
Calcite | Level 5

Thank you for the suggestion.  I tried something similar before but was unable to make it work because of the limits of storing the vast number of data in memory.

It would work if I had a smaller dataset

Cheers,

Scott

PGStats
Opal | Level 21

It can be done this way :

data test;
input Edge To From Flow Exposure;
datalines;
1          1          2          10          10
2          2          1          5            10
3          3          4          5             5
4          4          3          5             5
5          5          6          10          10
6          6          5          5             5
;

proc sql;
create table pairs as
select
     t1.*,
     catx("-",min(t1.Edge,t2.Edge),max(t1.Edge,t2.Edge)) as match length=13,
     t1.flow+t2.flow as flow_sum,
     t1.exposure+t2.exposure as exposure_sum
from test as t1 inner join test as t2 on
     t1.from=t2.to and t1.to=t2.from;
quit;

PG

PG
sbennet
Calcite | Level 5

Thank you PG for the suggestion. It works and runs very quick.

I need to tweak it a bit as I seem to get some additional output pairs that shouldn't exist (about 1000 extra)

Thank you

Cheers,

Scott

Astounding
PROC Star

Scott,

If there is a match, must the matching observations be consecutive or could they appear from anywhere within the original data set?

Is it possible that an observation will match more than one additional observation?

If the answers are (1) consecutive, and (2) no, some blazingly fast solutions should be possible.

sbennet
Calcite | Level 5

Hello Astounding,

The matching can appear from anywhere within the original data set and in most cases are not consecutive.

If the data is perfect then one observation should only have match or have no match.

The data is a table representing the typology of a road network so for a two way road there will be a matching set of observations.  One for each direction of travel.  For one way roads there is only one observation representing the direction of travel.

The big issue is that the duplicate observations do not have a common identifier besides the to and from column.

The code provided by PG stats seems to work well and give me the results I was hoping for.

Cheers,

Scott

PGStats
Opal | Level 21

Try preceeding my query with something like :


proc sort data=test out=testNoDup nodupkey; by from to; run;


to get rid of duplicate from-to pairs.

PG

PG
sbennet
Calcite | Level 5

Hi PGstats.

Yeah that is what I am working on today to see if I can figure out the error in the data and figure out why the duplicates are present.

Thank you again for the help!

Cheers,

Scott

Astounding
PROC Star

Scott,

Thanks for the clarification.  On the one hand, it's an interesting problem and I want to get to it using another approach.  On the other hand, it will take more time than I can devote for a few weeks.  I will eventually come up with something (if only for my own learning experience) that will begin with the PGStats suggestion of sorting NODUPKEY.  But it will use a hash table to match ... that's the area where I need more experience.

sbennet
Calcite | Level 5

Hi Astounding,

I was originally going to use a hash table to see if I could work out this problem but it came down to the issue that the organization of the data limits that functionality of using a hash table and setting up the unique has tag to represent only one set of matching observations.

Good luck with your project!

Cheers,

Scott

Astounding
PROC Star

I finally have a little bit of time to come back and program ... here is the version that I came up with.  I left the variables intact and separate from one another, but if you examine the results you can easily see how they can be combined.

data have;
   input edge to from flow exposure;
cards;
1 1 2 10 10
2 2 1  5 10
3 3 4  5  5
4 4 3  5  5
5 5 6 10 10
6 6 5  5  5
7 7 8  5  5
8 1 5 10 10
;

data matches;
   set have;
   rename edge=matching_edge
            to=from
          from=to
          flow=matching_flow
          exposure=matching_exposure;
run;

data combined;
   if _n_=1 then do;
      declare hash h (dataset:  'matches');
      h.definekey('from', 'to');
      h.definedata('matching_edge', 'matching_flow', 'matching_exposure');
      h.definedone();
      call missing (matching_edge, matching_flow, matching_exposure);
   end;
   set have;
   if h.check()=0 then h.find();
run;
proc print;
run;

Haikuo
Onyx | Level 15

While Astounding is busy, and I happen to have some time in a long time, this is my attempt using Hash(), and I hope you have enough RAM to run it:

data test;

input Edge To From Flow Exposure;

datalines;

1 1 2 10 10

2 2 1 5 10

3 3 4 5 5

4 4 3 5 5

5 5 6 10 10

6 6 5 5 5

;

data want;

  if _n_=1 then do;

  if 0 then set test(rename=(flow=_f exposure=_e to=_to from=_from edge=_edge) keep=edge to from flow exposure);

  declare hash h(dataset:'test(rename=(flow=_f exposure=_e to=_to from=_from edge=_edge) keep=edge to from flow exposure)', multidata:'y');

h.definekey('_from');

h.definedata('_f', '_e', '_to','_edge');

h.definedone();

end;

set test;

length group $20.;

  if h.find(key:to)=0 and _to=from then do;

  flow_sum=sum(flow,_f);

  exposure_sum=sum(exposure,_e);

  group=ifc(edge<_edge, catx('_',edge, _edge),catx('_',_edge,edge));

  end;

  drop _:;

run;

Haikuo

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 3620 views
  • 6 likes
  • 5 in conversation