DATA Step, Macro, Functions and more

identifying duplicates values across variables

Reply
Contributor
Posts: 23

identifying duplicates values across variables

I have a list of 53 variables. And I need to flag if any of them equal each other when they aren't missing. Does anyone know an efficient way to do this? I could do this 1 at time, but this would mean potentially 1431 statements of

if Pathogen_1_961712 not = "" and Pathogen_1_961712 = Pathogen_2_961712 then poscult = 1;

the variable names are listed

Pathogen_1_242396_10
Pathogen_1_242396_11
Pathogen_1_242396_2
Pathogen_1_242396_3
Pathogen_1_242396_4
Pathogen_1_242396_5
Pathogen_1_242396_6
Pathogen_1_242396_7
Pathogen_1_242396_8
Pathogen_1_242396_9
Pathogen_2_242396
Pathogen_2_242396_10
Pathogen_2_242396_11
Pathogen_2_242396_2
Pathogen_2_242396_3
Pathogen_2_242396_4
Pathogen_2_242396_5
Pathogen_2_242396_6
Pathogen_2_242396_7
Pathogen_2_242396_8
Pathogen_2_242396_9
APathogen_1_242396
APathogen_1_242396_10
APathogen_1_242396_11
APathogen_1_242396_2
APathogen_1_242396_3
APathogen_1_242396_4
APathogen_1_242396_5
APathogen_1_242396_6
APathogen_1_242396_7
APathogen_1_242396_8
APathogen_1_242396_9
BPathogen_1_242396
BPathogen_1_242396_10
BPathogen_1_242396_11
BPathogen_1_242396_2
BPathogen_1_242396_3
BPathogen_1_242396_4
BPathogen_1_242396_5
BPathogen_1_242396_6
BPathogen_1_242396_7
BPathogen_1_242396_8
BPathogen_1_242396_9
CPathogen_1_242396
CPathogen_1_242396_10
CPathogen_1_242396_11
CPathogen_1_242396_2
CPathogen_1_242396_3
CPathogen_1_242396_4
CPathogen_1_242396_5
CPathogen_1_242396_6
CPathogen_1_242396_7
CPathogen_1_242396_8
CPathogen_1_242396_9
Respected Advisor
Posts: 3,156

Re: identifying duplicates values across variables

Something like the following may get you started:

data have;

     input (t1-t4) (:$1.);

     cards;

1 2 3 4

1 2 3 2

run;

data want;

     set have;

     dup_flag=0;

     array t(4) t1-t4;

     array temp(4) $ 1 _temporary_;

     do i=1 to 4;

           temp(i)=t(i);

     end;

     do i=1 to 4;

           call missing (temp(i));

           if t(i) in temp then do;

                dup_flag=1;

                leave;

      end;

                

     end;

     drop i;

run;

Update: To end the loop once getting a hit.

Respected Advisor
Posts: 3,156

Re: identifying duplicates values across variables

If you have a big table and troubled by the efficiency, then following code can be considered 'being on steroid':

data have;

     input (t1-t4) (:$2.);

     cards;

1 2 3 4

1 2 3 2

run;

data want_fast;

     set have;

     length _cat $ 100;

     dup_flag=0;

     array t(4) t1-t4;

     _cat=peekclong(addrlong(t(1)), 8);

     do i=1 to 4;

           if count(_cat, t(i), 't')>1 then

                do;

                     dup_flag=1;

                     leave;

                end;

     end;

     drop _cat i;

run;

Super User
Posts: 5,518

Re: identifying duplicates values across variables

This is not so difficult if you use arrays.  But would it be useful?  All you get is a flag that is 0 or 1.  When it is 1 you still don't know where the match occurred.  At any rate, you may have to type all 53 names once:

data want;

   set have;

   array paths {53} list of pathogen names goes here;

   postcult=0;

   do i=1 to 52 until (postcult=1);

      if paths{i} ne ' ' then do j=i+1 to 53;
         if paths{i} = paths{j} then postcult=1;

      end;

   end;

   drop i j;

run;

Good luck.

Super User
Posts: 10,046

Re: identifying duplicates values across variables

Hash Table.

data have;
     input (t1-t4) (:$1.);
     cards;
1 2 3 4
1 2 3 2
. 3 . 6
. 4 4 5
;
run;
data want;
 if _n_ eq 1 then do;
  length k $ 40;
  declare hash h();
  h.definekey('k');
  h.definedone();
 end;
set have;
array x{*} $ t1-t4;
do i=1 to dim(x);
 if not missing(x{i}) then do;k=x{i};rc=h.add();end;
end;
if dim(x)-cmiss(of x{*})=h.num_items then poscult=0;
 else poscult=1;
h.clear();
drop i k rc;
run;

Xia Keshan

Ask a Question
Discussion stats
  • 4 replies
  • 273 views
  • 0 likes
  • 4 in conversation