DATA Step, Macro, Functions and more

de-dup array

Accepted Solution Solved
Reply
Respected Advisor
Posts: 4,173
Accepted Solution

de-dup array

Hi all

I'm having phone numbers per person ('data have'). What I need to do is to get rid of duplicate phone numbers per row (result should be like 'data want').

To have the phone numbers also "left aligned" like in 'data want' would be a "nice to have" but is not crucial.

What is crucial: The de-duping must happen within a single data step (so in below example everything should happen within 'data have') because I need a solution which I then can apply on 40M rows (so using Proc Transpose or the like is not an option).

....and the code must work with SAS9.1.3

data have;
  infile datalines dsd truncover;
  input person $ phone1-phone3;
  datalines;
A,111,222,111
B,,444,444
C,111,,222,
;
run;


data want;
  infile datalines dsd truncover;
  input person $ phone1-phone3;
  datalines;
A,111,222
B,444
C,111,222
;
run;

Thanks

Patrick

Message was edited by: Patrick
@Art: It was a typo. Data fixed.

Just to add one more piece of information which might help: You can work under the assumption that the phone numbers only contain digits


Accepted Solutions
Solution
‎08-16-2012 11:44 PM
Super User
Posts: 10,028

Re: de-dup array

Patrick,

For your question, there is no need to use HashTable yet. Since HashTable is a associated arrays. You also can a simple array to get it . I believe my code is very fast . Have a NICE day!

data have;
  infile datalines dsd truncover;
  input person $ phone1-phone3;
  datalines;
A,111,222,111
B,,444,444
C,111,,222,
;
run;
data want(drop=i j k _:);
 set have;
 array _p{*} phone1-phone3 ;
 array _pp{*} _phone1-_phone3;
 j=0;
 do i=1 to dim(_p);
  if _p{i} not in _pp then do;j+1;_pp{j}=_p{i};end;
 end;
 do k=1 to dim(_p);
  _p{k}=_pp{k};
 end;
run;

Ksharp

View solution in original post


All Replies
PROC Star
Posts: 7,474

Re: de-dup array

Patrick,  Unless that was a typo, for the first record of want, I think we need more explanation of what you want.  I would have expected that you would want:

A,111,222

Respected Advisor
Posts: 3,156

Re: de-dup array

Patrick,

The will be a brute force method, but don't blame me, blame Mark Keintz :smileysilly:. Since you are dealing with phone numbers, so there will be a scope, And if your phone numbers do not have '-', () embedded, you can try the following, otherwise, you will need to convert your data a little bit:

data have;

  infile datalines dsd truncover;

  input person $ phone1-phone3;

array p1(1:1000) _temporary_ (1000*0);

array p2 phone:;

do i=1 to dim(p2);

if not missing(p2(i)) then p1(p2(i))=1;

end;

call missing(of p2(*));

do  i=1 to 1000;

  if p1(i)=1 then do ;

    j+1;

    p2(j)=i;

  end;

end;

call missing(of i,j,of p1(*));

drop i j;

  datalines;

A,111,222,111

B,,444,444

C,111,,222,

;

run;

proc print;run;

Haikuo

Respected Advisor
Posts: 4,173

Re: de-dup array

Hi Haikuo

My phone numbers contain only digits so no problems from this side. The code you've posted is quite clever but - and it's a big BUT - it is very wastefull with memory.

I'm having in real life much longer phone numbers. I've tried your code with 11 digits so I had to create 10**12 temporary array elements. It took only seconds to get an out-of-memory error (and I've got 8GB of RAM on my test laptop).

So: Very interesting solution but not fit-for-purpose unfortunately.

Thanks

Patrick

Respected Advisor
Posts: 3,156

Re: de-dup array

Patrick,

I know mine is hopelessly flawed for your real problem. The good thing is, Art 's solution is much much better.

Haikuo

PROC Star
Posts: 7,474

Re: de-dup array

Here is an alternative approach that was suggested by Paul Dorfman in an older (2005) SAS-L post.  The approach is quite useful as the only thing one has to change in order to deal with character variables is to make the array character.

Anyhow, it lets you sort in either ascending or descending order, as well as gets rid of duplicates:

data have;

  infile datalines dsd truncover;

  input person $ phone1-phone3;

  datalines;

A,111,222,111

B,,444,444

C,111,,222,

;

run;

data want;

  set have;

  array numbers(*) phone1-phone3;

** set sort parameters (hint: would-be macro parms) ;

   seq      = 'A' ; * A = ascending, D = descending ;

   nodupkey =  1  ; * 0 = duplicates allowed, 1 = duplicates not allowed ;

   dcl hash  _hh  (hashexp: 16, ordered: seq) ;

   dcl hiter _hi  ('_hh'       ) ;

   _hh.definekey  ('_k', '_n'  ) ; * _n - extra enumerating key ;

   _hh.definedata ('_k'        ) ; * _k automatically assumes array data type ;

   _hh.definedone (            ) ;

** load composite (_k _n) key on the table   ;

** if duplicates to be retained, set 0 <- _n ;

   do _j = lbound (numbers) to hbound (numbers) ;

      _n = _j * ^ nodupkey ;

      _k = numbers [_j] ;

      _hh.replace() ;

   end ;

** use iterator HI to reload array from HH table, now in order ;

   _n = lbound (numbers) - 1 ;

   do _rc = _hi.first() by 0 while ( _rc = 0 ) ;

      _n = _n + 1 ;

      numbers [_n] = _k ;

      _rc = _hi.next() ;

   end ;

   _q = _n ;

** fill array tail with missing values if duplicates are delete ;

   do _n = _q + 1 to hbound (numbers) ;

      numbers [_n] = . ;

   end ;

   drop _: ; * drop auxiliary variables ;

run;

Super User
Posts: 5,505

Re: de-dup array

Art,

If I understand this correctly (and I'm a ways from that at this point), you are rebuilding the hash table for each observation.  I'm going to have to study this a bit.

Patrick,

Traditionally, the biggest problem in these situations was putting all phone numbers into the same format.  Enhancements to COMPRESS (using kd as the third parameter) have made this step easy.  If you only have three phone numbers, you can always code:

if phone1=phone2 then phone2=' ';

if phone1=phone3 then phone3=' ';

if phone2=phone3 then phone3=' ';

That won't handle the order problem, but that could be tackled as a next step.  If you actually have more phone numbers, you might switch to arrays:

array phones {*} phone:;

do i=1 to dim(phone)-1;

   do j=i+1 to dim(phone);

       if phones{i}=phones{j} then phones{j}=' ';

   end;

end;

I like the hashing approach ... I'm just not up to speed on it yet.

Good luck.

PROC Star
Posts: 7,474

Re: de-dup array

Posted in reply to Astounding

Astounding: What I'd really like to see is a nodupkey option added to call sortn and call sortc, but I don't know if we're ever going to see that.

Respected Advisor
Posts: 3,156

Re: de-dup array

I like it, Art. I believe this is more suitable for Patrick's daunting 40M records. Haikuo

Valued Guide
Posts: 765

Re: de-dup array

hi Art ... I tried the HASH approach with a lot more observations ...

data have;

infile datalines dsd truncover;

input person $ phone1-phone3;

datalines;

A,111,222,111

B,,444,444

C,111,,222,

;

data have;

set have;

do _n_=1 to 1000;

  output;

end;

run;


and on my PC it dies after about 3 seconds with the message ...

ERROR: Hash object added 0 items when memory failure occurred.

FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.

ERROR: The SAS System stopped processing this step because of insufficient memory.

NOTE: There were 2543 observations read from the data set WORK.HAVE.

WARNING: The data set WORK.WANT may be incomplete.  When this step was stopped there were 2542 observations and 6 variables.

WARNING: Data set WORK.WANT was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

      real time           3.07 seconds

      cpu time            3.04 seconds

So, what change(s) would make it work with 3,000 observations?

ps  I tried the de-deduping approach that I posted that uses _FILE_ and it took 0.01 seconds with the 3,000 observations. 

With 30,000 observations and 15 phone numbers, the _FILE_ approach took 0.95 seconds

PROC Star
Posts: 7,474

Re: de-dup array

NOTE: Updated code.  I think this works, tested it on 40 mil records with 15 phone numbers, and ran in 2.5 minutes.

data want (drop=i j);

  set have;

  array numbers(*) phone13-phone1;

  do until (lastmiss eq currentmiss);

    lastmiss=nmiss(of numbers(*));

    call sortn (of numbers(*));

    do i=lastmiss+1 to dim(numbers)-1;

      skip=0;

      do j=i+1 to dim(numbers);

        if numbers(i) eq numbers(j) then do;

          call missing(numbers(j));

          skip+1;

        end;

     else leave;

      end;

      i+skip;

    end;

    currentmiss=nmiss(of numbers(*));

  end;

run;

Respected Advisor
Posts: 4,173

Re: de-dup array

Art, Haikuo, Astounding

Thanks a lot for your fast and valuable replies.

@Art

I was actually thinking about using a hash/iter object combination but I couldn't figure out of how to do it in a way that I wouldn't have to create and destroy the objects for every single iteration of the data step. I even was looking for papers from the master of hash (Paul Dorfman) but couldn't find the right one.

Thanks a lot for this. I will still need some time to fully digest this code but considering the author I have no doubt that it's kind of final and can't be bettered.

PROC Star
Posts: 7,474

Re: de-dup array

Patrick: And, since Paul has been known to occasionally read and respond to the forum posts, if he has come up with a more optimal solution, I would guess that we'll hear from him directly.

Respected Advisor
Posts: 3,156

Re: de-dup array

Patrick,

clear() method can been found in  9.2, It clears hash() without delete the object instance. I have no idea its overhead comparing to generating new hash() each record, and I don't know if it is at your disposal for 9.1.3. Here is the example, please note the bold ones.

data want;

  set have;

  array numbers(*) phone1-phone3;

** set sort parameters (hint: would-be macro parms) ;

   seq      = 'A' ; * A = ascending, D = descending ;

   nodupkey =  1  ; * 0 = duplicates allowed, 1 = duplicates not allowed ;

  if _n_=1 then do;

     dcl hash  _hh  (hashexp: 16, ordered: seq) ;

     dcl hiter _hi  ('_hh'       ) ;

     _hh.definekey  ('_k', '_n'  ) ; * _n - extra enumerating key ;

     _hh.definedata ('_k'        ) ; * _k automatically assumes array data type ;

    _hh.definedone (            ) ;

  end;

** load composite (_k _n) key on the table   ;

** if duplicates to be retained, set 0 <- _n ;

   do _j = lbound (numbers) to hbound (numbers) ;

      _n = _j * ^ nodupkey ;

      _k = numbers [_j] ;

      _hh.replace() ;

   end ;

** use iterator HI to reload array from HH table, now in order ;

   _n = lbound (numbers) - 1 ;

   do _rc = _hi.first() by 0 while ( _rc = 0 ) ;

      _n = _n + 1 ;

      numbers [_n] = _k ;

      _rc = _hi.next() ;

   end ;

   _q = _n ;

** fill array tail with missing values if duplicates are delete ;

   do _n = _q + 1 to hbound (numbers) ;

      numbers [_n] = . ;

   end ;

   drop _: ; * drop auxiliary variables ;

   _hh.clear();

run;

Haikuo

Valued Guide
Posts: 765

Re: de-dup array

hi ... another approach (HEX16. maintains values through numeric-character-numeric conversion) ...

data have;

infile datalines dsd truncover;

input person $ phone1-phone3;

datalines;

A,111,222,111

B,,444,444

C,111,,222,

;

options missing = ' ';

filename d dummy;

data want;

file d;

set have;

array phone(3);

do j=3 to 1 by -1;

   put (phoneSmiley Happy (hex16. +1) @;

   if count(_file_ , put(phone(j),hex16.)) ge 2 then phone(j) = .;

   put;

end;

put (phoneSmiley Happy (hex16. +1) @;

do j=1 to 3;

   phone(j) = input(scan(_file_,j,' '),hex16.);

end;

drop j;

run;

options missing = '.';


results ...

person    phone1    phone2    phone3

  A         111       222        .

  B         444         .        .

  C         111       222        .

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 34 replies
  • 887 views
  • 6 likes
  • 10 in conversation