Detect duplicate rows in a dataset

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 15
Accepted Solution

Detect duplicate rows in a dataset

Hello,

I am trying to figure out a way to determine what rows of data are duplicates of each other.  It is easiest to explain my problem through an example.

I have a dataset as follows:

Edge     To     From     Flow     Exposure

1          1          2          10          10

2          2          1          5            10

3          3          4          5             5

4          4          3          5             5

5          5          6          10          10

6          6          5          5             5

The edge value is a unique identifier for each data row.

I am trying to match the data based on the "to" and "from" column where in the case of this sample data set the matching pairs (by "edge" id) are as follows

1,2

3,4

5,6

The "to" and "from" values between edge 1 and 2 are a reciprocal of each other.

I want to figure out what edges are matches for each other and also sum up the flow and exposure value for the matching edges.

What I would like to do is end up with a table as follows

Edge     To     From     Flow     Exposure      match      flow_sum     exposure_sum

1          1          2          10          10               1               15               20

2          2          1          5            10               1               15               20

3          3          4          5             5                2               10               10

4          4          3          5             5                2               10               10

5          5          6          10          10               3                15               15

6          6          5          5             5               3                 15               15

In this case the "match" variable is just identifying what edges are pairs. It doesn't have to be a number, it can be a variable and can start at any value.

Is this possible?

The data set I am working with has about 300,000 rows of data with the majority of them having pairs.

Thank you for the help.

Cheers,

Scott


Accepted Solutions
Solution
‎02-01-2013 02:19 PM
Respected Advisor
Posts: 4,640

Re: Detect duplicate rows in a dataset

It can be done this way :

data test;
input Edge To From Flow Exposure;
datalines;
1          1          2          10          10
2          2          1          5            10
3          3          4          5             5
4          4          3          5             5
5          5          6          10          10
6          6          5          5             5
;

proc sql;
create table pairs as
select
     t1.*,
     catx("-",min(t1.Edge,t2.Edge),max(t1.Edge,t2.Edge)) as match length=13,
     t1.flow+t2.flow as flow_sum,
     t1.exposure+t2.exposure as exposure_sum
from test as t1 inner join test as t2 on
     t1.from=t2.to and t1.to=t2.from;
quit;

PG

PG

View solution in original post


All Replies
Super Contributor
Posts: 578

Re: Detect duplicate rows in a dataset

data have;

input Edge   vTO     vFrom Flow     Exposure;

cards;

1          1          2          10          10

2          2          1          5            10

3          3          4          5             5

4          4          3          5             5

5          5          6          10          10

6          6          5          5             5

;

run;

proc sort in=have out=have;

by vTO;

run;

data matches(keep=matchkey vto vfrom);

length Matchkey 8.;

set have;

by vTo;

retain tmp . MatchKey .;

if _N_=1 then do;

    tmp=vFrom;

    MatchKey=1;

    end;

if vTo eq tmp then do;

    output;

    MatchKey+1;

    end;

tmp=vFrom;

run;

proc sql;

create table totals as

select

    t2.matchkey,

    t2.vto,

    t2.vfrom,

    sum(flow) as flow_sum,

    sum(exposure) as exposure_sum

from

    have t1

    inner join matches t2

        on (t1.vto = t2.vto and t1.vfrom = t2.vfrom )

            or (t1.vto = t2.vfrom and t1.vfrom = t2.vto)

group by t2.matchkey,t2.vto, t2.vfrom;

create table want as

select t1.*, t2.matchkey, t2.flow_Sum, t2.exposure_sum

from

    have t1

    left outer join totals t2

        on (t1.vto = t2.vto and t1.vfrom = t2.vfrom)

            or (t1.vto = t2.vfrom and t1.vfrom = t2.vto);

quit;

                                                                                              12:30 Friday, February 1, 2013   1

                                            v                                             exposure_
                    ObsEdgevTOFromFlowExposureMatchkeyflow_sum   sum

                     1   1   10    10       1      15       20  
                     2   2   1   5    10       1      15       20  
                     3   3   4   5     5       2      10       10  
                     4   4   3   5     5       2      10       10  
                     5   5   10    10       3      15       15  
                     6   6   5   5     5       3      15       15  
Occasional Contributor
Posts: 15

Re: Detect duplicate rows in a dataset

Thank you for the suggestion.  I tried something similar before but was unable to make it work because of the limits of storing the vast number of data in memory.

It would work if I had a smaller dataset

Cheers,

Scott

Solution
‎02-01-2013 02:19 PM
Respected Advisor
Posts: 4,640

Re: Detect duplicate rows in a dataset

It can be done this way :

data test;
input Edge To From Flow Exposure;
datalines;
1          1          2          10          10
2          2          1          5            10
3          3          4          5             5
4          4          3          5             5
5          5          6          10          10
6          6          5          5             5
;

proc sql;
create table pairs as
select
     t1.*,
     catx("-",min(t1.Edge,t2.Edge),max(t1.Edge,t2.Edge)) as match length=13,
     t1.flow+t2.flow as flow_sum,
     t1.exposure+t2.exposure as exposure_sum
from test as t1 inner join test as t2 on
     t1.from=t2.to and t1.to=t2.from;
quit;

PG

PG
Occasional Contributor
Posts: 15

Re: Detect duplicate rows in a dataset

Thank you PG for the suggestion. It works and runs very quick.

I need to tweak it a bit as I seem to get some additional output pairs that shouldn't exist (about 1000 extra)

Thank you

Cheers,

Scott

Super User
Posts: 5,071

Re: Detect duplicate rows in a dataset

Scott,

If there is a match, must the matching observations be consecutive or could they appear from anywhere within the original data set?

Is it possible that an observation will match more than one additional observation?

If the answers are (1) consecutive, and (2) no, some blazingly fast solutions should be possible.

Occasional Contributor
Posts: 15

Re: Detect duplicate rows in a dataset

Hello Astounding,

The matching can appear from anywhere within the original data set and in most cases are not consecutive.

If the data is perfect then one observation should only have match or have no match.

The data is a table representing the typology of a road network so for a two way road there will be a matching set of observations.  One for each direction of travel.  For one way roads there is only one observation representing the direction of travel.

The big issue is that the duplicate observations do not have a common identifier besides the to and from column.

The code provided by PG stats seems to work well and give me the results I was hoping for.

Cheers,

Scott

Respected Advisor
Posts: 4,640

Re: Detect duplicate rows in a dataset

Try preceeding my query with something like :


proc sort data=test out=testNoDup nodupkey; by from to; run;


to get rid of duplicate from-to pairs.

PG

PG
Occasional Contributor
Posts: 15

Re: Detect duplicate rows in a dataset

Hi PGstats.

Yeah that is what I am working on today to see if I can figure out the error in the data and figure out why the duplicates are present.

Thank you again for the help!

Cheers,

Scott

Super User
Posts: 5,071

Re: Detect duplicate rows in a dataset

Scott,

Thanks for the clarification.  On the one hand, it's an interesting problem and I want to get to it using another approach.  On the other hand, it will take more time than I can devote for a few weeks.  I will eventually come up with something (if only for my own learning experience) that will begin with the PGStats suggestion of sorting NODUPKEY.  But it will use a hash table to match ... that's the area where I need more experience.

Occasional Contributor
Posts: 15

Re: Detect duplicate rows in a dataset

Hi Astounding,

I was originally going to use a hash table to see if I could work out this problem but it came down to the issue that the organization of the data limits that functionality of using a hash table and setting up the unique has tag to represent only one set of matching observations.

Good luck with your project!

Cheers,

Scott

Super User
Posts: 5,071

Re: Detect duplicate rows in a dataset

I finally have a little bit of time to come back and program ... here is the version that I came up with.  I left the variables intact and separate from one another, but if you examine the results you can easily see how they can be combined.

data have;
   input edge to from flow exposure;
cards;
1 1 2 10 10
2 2 1  5 10
3 3 4  5  5
4 4 3  5  5
5 5 6 10 10
6 6 5  5  5
7 7 8  5  5
8 1 5 10 10
;

data matches;
   set have;
   rename edge=matching_edge
            to=from
          from=to
          flow=matching_flow
          exposure=matching_exposure;
run;

data combined;
   if _n_=1 then do;
      declare hash h (dataset:  'matches');
      h.definekey('from', 'to');
      h.definedata('matching_edge', 'matching_flow', 'matching_exposure');
      h.definedone();
      call missing (matching_edge, matching_flow, matching_exposure);
   end;
   set have;
   if h.check()=0 then h.find();
run;
proc print;
run;

Respected Advisor
Posts: 3,124

Re: Detect duplicate rows in a dataset

While Astounding is busy, and I happen to have some time in a long time, this is my attempt using Hash(), and I hope you have enough RAM to run it:

data test;

input Edge To From Flow Exposure;

datalines;

1 1 2 10 10

2 2 1 5 10

3 3 4 5 5

4 4 3 5 5

5 5 6 10 10

6 6 5 5 5

;

data want;

  if _n_=1 then do;

  if 0 then set test(rename=(flow=_f exposure=_e to=_to from=_from edge=_edge) keep=edge to from flow exposure);

  declare hash h(dataset:'test(rename=(flow=_f exposure=_e to=_to from=_from edge=_edge) keep=edge to from flow exposure)', multidata:'y');

h.definekey('_from');

h.definedata('_f', '_e', '_to','_edge');

h.definedone();

end;

set test;

length group $20.;

  if h.find(key:to)=0 and _to=from then do;

  flow_sum=sum(flow,_f);

  exposure_sum=sum(exposure,_e);

  group=ifc(edge<_edge, catx('_',edge, _edge),catx('_',_edge,edge));

  end;

  drop _:;

run;

Haikuo

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 403 views
  • 6 likes
  • 5 in conversation