How to check duplicate customer IDs using SAS

Accepted Solution Solved
Reply
Regular Contributor
Posts: 196
Accepted Solution

How to check duplicate customer IDs using SAS

[ Edited ]

I have a dataset in sas like this:

 

ID1 ID2

11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43

 

I want to check for ID1 vs ID2 assignent, and find duplicate IDs, e.g. 11 is assigned to 22 and then 11>>34,22 >>35, 35>>9,9>>65 and 34 >> 43 . So all these IDs should be mapped to single ID, say 11(we can choose any of them). I want to create a 3rd variable, where I will assign all values to a single ID say ID_11

 

so Final output will be like:

ID1   ID2    ID3

11     22     ID_11

11     34     ID_11

22     35     ID_11

35     9       ID_11

41     10     ID_10

52     87     ID_87

9       65     ID_11

34     43     ID_11

 

I found out that, its a disjoint set problem. how to do this using sas?


Accepted Solutions
Solution
‎12-18-2017 06:26 AM
Trusted Advisor
Posts: 1,311

Re: How to check duplicate customer IDs using SAS

Posted in reply to munitech4u

If you know that all your id values are integers in some range (say from 1 through 100), you actually can do this with an array:

 

data have;
  input ID1 ID2;
datalines;
11 22
11 34
22 35
35  9
41 10
52 87
9  65
34 43
run;
data want;
  array src {100} _temporary_;
  if _n_=1 then do until (eod);
    set have end=eod;
    src{id2}=id1;
  end;

  set have ;
  source=id1;
  do Nlinks=1 by 1 while (src{source}^=.);
    source=src{source};
  end;
run;

 

Caveat:  This depends on each sequence of links going to a final source.  I.e. no collections of links forms a repeating cycle.  The program would require some minor changes to deal with cycles.

 

Notes:

  1. The 1-dimensional array SRC has the value of ID1 corresponding to the ID2 element of the array. 

  2. For each pair ID1 is treated as the source, and ID2 as the sink.  The program iteratively finds the source of a source, until an id without a source is encountered.  For example, SRC{11} is a missing value, even though SRC{22}=11  (as well as others).
  3. The variable Nlinks is the number of links required to go from ID2 to the ultimate source.

View solution in original post


All Replies
Respected Advisor
Posts: 3,167

Re: How to check duplicate customer IDs using SAS

[ Edited ]
Posted in reply to munitech4u

This can be solved using Hash, however, make sure you don't have infinite loops, if you do, set a loop limit (say the total number of your obs), when reached, exit the loop regardless.

 

data have;
input ID1 ID2
;
cards;
11     22
22     35
35     9
41     10
52     87
9       65
;

run;

data want;
if _n_=1 then do;
dcl hash h(dataset:'work.have');
h.definekey('id1');
h.definedata(all:'y');
h.definedone();
call missing(id1, id2);
end;

set have(rename=(id1=_id1 id2=_id2));
length id3 $20.;
rc=0;
id1=_id1;
do while (rc=0);
rc=h.find();
id1=id2;
end;
id3=catx('_', 'ID', PUT(id1,BEST.));
KEEP ID3 _id:;
run;
Regular Contributor
Posts: 196

Re: How to check duplicate customer IDs using SAS

[ Edited ]

sorry, There could be duplicate mapping in ID1, I updated the example case

Super User
Posts: 6,629

Re: How to check duplicate customer IDs using SAS

[ Edited ]
Posted in reply to munitech4u

Can we make these assumptions?

 

  1. ID1 and ID2 are character
  2. ID1 maps to a single ID2 value only, never to two different ID2 values

If that's the case, the programming is easy enough:

 

data temp;

set have;

fmtname='$test';

rename id1=start id2=label;

run;

 

proc format cntlin=temp;

run;

 

data want;

set have;

length id3 id4 $ 5;

id3 = put(id1, $test.);

id4 = id3;

do i=1 to 500 until (id3 = id4);

   if id3 ne id4 then id3 = id4;

   id4 = put(id3, $test.);

end;

id3 = 'ID_' || id3;

drop i  id4;

run;

 

EDITED to add the missing lines.

Regular Contributor
Posts: 196

Re: How to check duplicate customer IDs using SAS

Posted in reply to Astounding
sorry, There could be duplicate mapping in ID1, I updated the example case,
Super User
Posts: 10,689

Re: How to check duplicate customer IDs using SAS

Posted in reply to munitech4u

data have;
infile cards ;
input from $  to $ ;
cards;
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
Regular Contributor
Posts: 196

Re: How to check duplicate customer IDs using SAS

Thanks for your effort. Is there any alternative where we can avoid declaring hash?
Super User
Posts: 10,689

Re: How to check duplicate customer IDs using SAS

Posted in reply to munitech4u

Sorry. I am afraid you have to use Hash Table to search tree.

Or you could write some SAS/OR code, but you need SAS/OR.

And post your question at OR forum, @RobPratt is there .

Regular Contributor
Posts: 196

Re: How to check duplicate customer IDs using SAS

No, we don;t have SAS/OR. Can it be done using arrays? I saw a solution, but not sure how that works:

 

 

data unique_id;
      set x end = last;
      /*Create an array to store mapping and intitialize array*/
      array mapping{&n_obs,2} _temporary_;
      if _N_ = 1 then
      do;
      do i = 1 to &n_obs.;
                  
            mapping[i,1] = i;
            mapping[i,2] = i;
      end;
      end;
      /*Create an array to store pr_row_num and row_num*/
      array lookup{2} row_num pr_row_num;
        orig_val1 = mapping[lookup[1],2];
      orig_val2 = mapping[lookup[2],2];
      max_val = max(row_num,pr_row_num);
      replaced_val = min(lookup[1],lookup[2],mapping[lookup[1],2],mapping[lookup[2],2]);
      /*Loop through the mapping file and change orig_val1 or orig_val2 in the mapping array to replaced_val*/
      do i = 1 to &n_obs.;
            if mapping[i,2] = orig_val1 or mapping[i,2] = orig_val2 then
            mapping[i,2] = replaced_val;
      end;

      if last;
      do i = 1 to &n_obs.;
            key1 = mapping[i,1];
            key2 = mapping[i,2];
            output;
      end;
run;

 

Super User
Posts: 10,689

Re: How to check duplicate customer IDs using SAS

Posted in reply to munitech4u

Sorry. As far as I know, must Hash Table, no choice.

Or you could switch into SAS/OR ,just as @RobPratt said.

SAS Employee
Posts: 552

Re: How to check duplicate customer IDs using SAS

This functionality is called "connected components" and is available in:

  • PROC OPTNET or PROC OPTMODEL in SAS/OR
  • PROC OPTGRAPH in SAS Social Network Analysis
  • PROC NETWORK in SAS Visual Data Mining and Machine Learning (Viya)
  • PROC OPTNETWORK or PROC OPTMODEL in SAS Optimization (Viya)
Solution
‎12-18-2017 06:26 AM
Trusted Advisor
Posts: 1,311

Re: How to check duplicate customer IDs using SAS

Posted in reply to munitech4u

If you know that all your id values are integers in some range (say from 1 through 100), you actually can do this with an array:

 

data have;
  input ID1 ID2;
datalines;
11 22
11 34
22 35
35  9
41 10
52 87
9  65
34 43
run;
data want;
  array src {100} _temporary_;
  if _n_=1 then do until (eod);
    set have end=eod;
    src{id2}=id1;
  end;

  set have ;
  source=id1;
  do Nlinks=1 by 1 while (src{source}^=.);
    source=src{source};
  end;
run;

 

Caveat:  This depends on each sequence of links going to a final source.  I.e. no collections of links forms a repeating cycle.  The program would require some minor changes to deal with cycles.

 

Notes:

  1. The 1-dimensional array SRC has the value of ID1 corresponding to the ID2 element of the array. 

  2. For each pair ID1 is treated as the source, and ID2 as the sink.  The program iteratively finds the source of a source, until an id without a source is encountered.  For example, SRC{11} is a missing value, even though SRC{22}=11  (as well as others).
  3. The variable Nlinks is the number of links required to go from ID2 to the ultimate source.
Regular Contributor
Posts: 196

Re: How to check duplicate customer IDs using SAS

Thanks. The solution is simply elegant and simple
Trusted Advisor
Posts: 1,311

Re: How to check duplicate customer IDs using SAS

[ Edited ]
Posted in reply to munitech4u

Edited on 12/15.  Thanks to @Matthew_Galati_SAS, I've revised the program and my comments.  Please ignore all the text is strikethrough font.

 

If you have SAS/OR, you can get this directly by asking PROC OPTNET for "connected components".  It becomes a trivial task:

 

data have;
  input ID1 ID2;
  retain w 1;
datalines;
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
run;


proc optnet data_links=have (rename=(id1=from id2=to w=weight))
      graph_direction=undirected out_nodes=nodes ;
  concomp;
shortpath out_weights=shdist; run; proc sql; create table want as select h.*, n.concomp from have as h left join nodes as n on h.id1=n.node; quit;

 

Notes:

  1. I had to add a "weight" variable (var w) to dataset HAVE, which proc optnet uses as the "distance" between ID1 and ID2 (optnet is commonly used to generate the shortest distances between all pairs of connected nodes (id's in your case).  As long as W is a non-missing numeric value, it works for your problem, because you care about connectivity, not distance.  Of course if your constant W value is 1, then the "distance" is the number of hopes from ID2 to ID2.

    Because, in the absence of a weight variable (thanks @Matthew_Galati_SAS), the default "distance" from ID2 to ID1 is taken as 1.  As a result the variable path_weight in SHDIST is just the number of hops from ID2 to SOURCE.  Here's what the 1st 20 obs in SHDIST data set looks like:

                     path_

    Obs source sink weight

     1    11    22    1

     2    11    34    1

     3    11    35    2

     4    11     9    3

     5    11    65    4

     6    11    43    2

     7    22    11    1

     8    22    34    2

     9    22    35    1

    10    22     9    2

    11    22    65    3

    12    22    43    3

    13    34    11    1

    14    34    22    2

    15    34    35    3

    16    34     9    4

    17    34    65    5

    18    34    43    1

    19    35    11    2

    20    35    22    1

  2. The "concomp" statement asks proc optnet to assign each node (id) to a connected-component identifier.  In your case here's, what the "nodes" dataset looks like:

    Obs node concomp

     1   11    1

     2   22    1

     3   34    1

     4   35    1

     5    9    1

     6   41    2

     7   10    2

     8   52    3

     9   87    3

    10   65    1

    11   43    1

  3. And the resulting dataset WANT:

    Obs    ID1    ID2    w    concomp
    
     1       9     65    1       1
     2      11     22    1       1
     3      11     34    1       1
     4      22     35    1       1
     5      34     43    1       1
     6      35      9    1       1
     7      41     10    1       2
     8      52     87    1       3
    
    
    
    
    
    
    
    
    
    
    
SAS Employee
Posts: 49

Re: How to check duplicate customer IDs using SAS

You do not need to add the weight variable for optnet shortest path to get the 'level'. Unweighted graphs assume unit weights on all edges.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 14 replies
  • 388 views
  • 4 likes
  • 7 in conversation