BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Patrick
Opal | Level 21

Hi all

I'm having phone numbers per person ('data have'). What I need to do is to get rid of duplicate phone numbers per row (result should be like 'data want').

To have the phone numbers also "left aligned" like in 'data want' would be a "nice to have" but is not crucial.

What is crucial: The de-duping must happen within a single data step (so in below example everything should happen within 'data have') because I need a solution which I then can apply on 40M rows (so using Proc Transpose or the like is not an option).

....and the code must work with SAS9.1.3

data have;
  infile datalines dsd truncover;
  input person $ phone1-phone3;
  datalines;
A,111,222,111
B,,444,444
C,111,,222,
;
run;


data want;
  infile datalines dsd truncover;
  input person $ phone1-phone3;
  datalines;
A,111,222
B,444
C,111,222
;
run;

Thanks

Patrick

Message was edited by: Patrick
@Art: It was a typo. Data fixed.

Just to add one more piece of information which might help: You can work under the assumption that the phone numbers only contain digits

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Patrick,

For your question, there is no need to use HashTable yet. Since HashTable is a associated arrays. You also can a simple array to get it . I believe my code is very fast . Have a NICE day!

data have;
  infile datalines dsd truncover;
  input person $ phone1-phone3;
  datalines;
A,111,222,111
B,,444,444
C,111,,222,
;
run;
data want(drop=i j k _:);
 set have;
 array _p{*} phone1-phone3 ;
 array _pp{*} _phone1-_phone3;
 j=0;
 do i=1 to dim(_p);
  if _p{i} not in _pp then do;j+1;_pp{j}=_p{i};end;
 end;
 do k=1 to dim(_p);
  _p{k}=_pp{k};
 end;
run;

Ksharp

View solution in original post

34 REPLIES 34
art297
Opal | Level 21

Patrick,  Unless that was a typo, for the first record of want, I think we need more explanation of what you want.  I would have expected that you would want:

A,111,222

Haikuo
Onyx | Level 15

Patrick,

The will be a brute force method, but don't blame me, blame Mark Keintz :smileysilly:. Since you are dealing with phone numbers, so there will be a scope, And if your phone numbers do not have '-', () embedded, you can try the following, otherwise, you will need to convert your data a little bit:

data have;

  infile datalines dsd truncover;

  input person $ phone1-phone3;

array p1(1:1000) _temporary_ (1000*0);

array p2 phone:;

do i=1 to dim(p2);

if not missing(p2(i)) then p1(p2(i))=1;

end;

call missing(of p2(*));

do  i=1 to 1000;

  if p1(i)=1 then do ;

    j+1;

    p2(j)=i;

  end;

end;

call missing(of i,j,of p1(*));

drop i j;

  datalines;

A,111,222,111

B,,444,444

C,111,,222,

;

run;

proc print;run;

Haikuo

Patrick
Opal | Level 21

Hi Haikuo

My phone numbers contain only digits so no problems from this side. The code you've posted is quite clever but - and it's a big BUT - it is very wastefull with memory.

I'm having in real life much longer phone numbers. I've tried your code with 11 digits so I had to create 10**12 temporary array elements. It took only seconds to get an out-of-memory error (and I've got 8GB of RAM on my test laptop).

So: Very interesting solution but not fit-for-purpose unfortunately.

Thanks

Patrick

Haikuo
Onyx | Level 15

Patrick,

I know mine is hopelessly flawed for your real problem. The good thing is, Art 's solution is much much better.

Haikuo

art297
Opal | Level 21

Here is an alternative approach that was suggested by Paul Dorfman in an older (2005) SAS-L post.  The approach is quite useful as the only thing one has to change in order to deal with character variables is to make the array character.

Anyhow, it lets you sort in either ascending or descending order, as well as gets rid of duplicates:

data have;

  infile datalines dsd truncover;

  input person $ phone1-phone3;

  datalines;

A,111,222,111

B,,444,444

C,111,,222,

;

run;

data want;

  set have;

  array numbers(*) phone1-phone3;

** set sort parameters (hint: would-be macro parms) ;

   seq      = 'A' ; * A = ascending, D = descending ;

   nodupkey =  1  ; * 0 = duplicates allowed, 1 = duplicates not allowed ;

   dcl hash  _hh  (hashexp: 16, ordered: seq) ;

   dcl hiter _hi  ('_hh'       ) ;

   _hh.definekey  ('_k', '_n'  ) ; * _n - extra enumerating key ;

   _hh.definedata ('_k'        ) ; * _k automatically assumes array data type ;

   _hh.definedone (            ) ;

** load composite (_k _n) key on the table   ;

** if duplicates to be retained, set 0 <- _n ;

   do _j = lbound (numbers) to hbound (numbers) ;

      _n = _j * ^ nodupkey ;

      _k = numbers [_j] ;

      _hh.replace() ;

   end ;

** use iterator HI to reload array from HH table, now in order ;

   _n = lbound (numbers) - 1 ;

   do _rc = _hi.first() by 0 while ( _rc = 0 ) ;

      _n = _n + 1 ;

      numbers [_n] = _k ;

      _rc = _hi.next() ;

   end ;

   _q = _n ;

** fill array tail with missing values if duplicates are delete ;

   do _n = _q + 1 to hbound (numbers) ;

      numbers [_n] = . ;

   end ;

   drop _: ; * drop auxiliary variables ;

run;

Astounding
PROC Star

Art,

If I understand this correctly (and I'm a ways from that at this point), you are rebuilding the hash table for each observation.  I'm going to have to study this a bit.

Patrick,

Traditionally, the biggest problem in these situations was putting all phone numbers into the same format.  Enhancements to COMPRESS (using kd as the third parameter) have made this step easy.  If you only have three phone numbers, you can always code:

if phone1=phone2 then phone2=' ';

if phone1=phone3 then phone3=' ';

if phone2=phone3 then phone3=' ';

That won't handle the order problem, but that could be tackled as a next step.  If you actually have more phone numbers, you might switch to arrays:

array phones {*} phone:;

do i=1 to dim(phone)-1;

   do j=i+1 to dim(phone);

       if phones{i}=phones{j} then phones{j}=' ';

   end;

end;

I like the hashing approach ... I'm just not up to speed on it yet.

Good luck.

art297
Opal | Level 21

Astounding: What I'd really like to see is a nodupkey option added to call sortn and call sortc, but I don't know if we're ever going to see that.

Haikuo
Onyx | Level 15

I like it, Art. I believe this is more suitable for Patrick's daunting 40M records. Haikuo

MikeZdeb
Rhodochrosite | Level 12

hi Art ... I tried the HASH approach with a lot more observations ...

data have;

infile datalines dsd truncover;

input person $ phone1-phone3;

datalines;

A,111,222,111

B,,444,444

C,111,,222,

;

data have;

set have;

do _n_=1 to 1000;

  output;

end;

run;


and on my PC it dies after about 3 seconds with the message ...

ERROR: Hash object added 0 items when memory failure occurred.

FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.

ERROR: The SAS System stopped processing this step because of insufficient memory.

NOTE: There were 2543 observations read from the data set WORK.HAVE.

WARNING: The data set WORK.WANT may be incomplete.  When this step was stopped there were 2542 observations and 6 variables.

WARNING: Data set WORK.WANT was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

      real time           3.07 seconds

      cpu time            3.04 seconds

So, what change(s) would make it work with 3,000 observations?

ps  I tried the de-deduping approach that I posted that uses _FILE_ and it took 0.01 seconds with the 3,000 observations. 

With 30,000 observations and 15 phone numbers, the _FILE_ approach took 0.95 seconds

art297
Opal | Level 21

NOTE: Updated code.  I think this works, tested it on 40 mil records with 15 phone numbers, and ran in 2.5 minutes.

data want (drop=i j);

  set have;

  array numbers(*) phone13-phone1;

  do until (lastmiss eq currentmiss);

    lastmiss=nmiss(of numbers(*));

    call sortn (of numbers(*));

    do i=lastmiss+1 to dim(numbers)-1;

      skip=0;

      do j=i+1 to dim(numbers);

        if numbers(i) eq numbers(j) then do;

          call missing(numbers(j));

          skip+1;

        end;

     else leave;

      end;

      i+skip;

    end;

    currentmiss=nmiss(of numbers(*));

  end;

run;

Patrick
Opal | Level 21

Art, Haikuo, Astounding

Thanks a lot for your fast and valuable replies.

@Art

I was actually thinking about using a hash/iter object combination but I couldn't figure out of how to do it in a way that I wouldn't have to create and destroy the objects for every single iteration of the data step. I even was looking for papers from the master of hash (Paul Dorfman) but couldn't find the right one.

Thanks a lot for this. I will still need some time to fully digest this code but considering the author I have no doubt that it's kind of final and can't be bettered.

art297
Opal | Level 21

Patrick: And, since Paul has been known to occasionally read and respond to the forum posts, if he has come up with a more optimal solution, I would guess that we'll hear from him directly.

Haikuo
Onyx | Level 15

Patrick,

clear() method can been found in  9.2, It clears hash() without delete the object instance. I have no idea its overhead comparing to generating new hash() each record, and I don't know if it is at your disposal for 9.1.3. Here is the example, please note the bold ones.

data want;

  set have;

  array numbers(*) phone1-phone3;

** set sort parameters (hint: would-be macro parms) ;

   seq      = 'A' ; * A = ascending, D = descending ;

   nodupkey =  1  ; * 0 = duplicates allowed, 1 = duplicates not allowed ;

  if _n_=1 then do;

     dcl hash  _hh  (hashexp: 16, ordered: seq) ;

     dcl hiter _hi  ('_hh'       ) ;

     _hh.definekey  ('_k', '_n'  ) ; * _n - extra enumerating key ;

     _hh.definedata ('_k'        ) ; * _k automatically assumes array data type ;

    _hh.definedone (            ) ;

  end;

** load composite (_k _n) key on the table   ;

** if duplicates to be retained, set 0 <- _n ;

   do _j = lbound (numbers) to hbound (numbers) ;

      _n = _j * ^ nodupkey ;

      _k = numbers [_j] ;

      _hh.replace() ;

   end ;

** use iterator HI to reload array from HH table, now in order ;

   _n = lbound (numbers) - 1 ;

   do _rc = _hi.first() by 0 while ( _rc = 0 ) ;

      _n = _n + 1 ;

      numbers [_n] = _k ;

      _rc = _hi.next() ;

   end ;

   _q = _n ;

** fill array tail with missing values if duplicates are delete ;

   do _n = _q + 1 to hbound (numbers) ;

      numbers [_n] = . ;

   end ;

   drop _: ; * drop auxiliary variables ;

   _hh.clear();

run;

Haikuo

MikeZdeb
Rhodochrosite | Level 12

hi ... another approach (HEX16. maintains values through numeric-character-numeric conversion) ...

data have;

infile datalines dsd truncover;

input person $ phone1-phone3;

datalines;

A,111,222,111

B,,444,444

C,111,,222,

;

options missing = ' ';

filename d dummy;

data want;

file d;

set have;

array phone(3);

do j=3 to 1 by -1;

   put (phone:) (hex16. +1) @;

   if count(_file_ , put(phone(j),hex16.)) ge 2 then phone(j) = .;

   put;

end;

put (phone:) (hex16. +1) @;

do j=1 to 3;

   phone(j) = input(scan(_file_,j,' '),hex16.);

end;

drop j;

run;

options missing = '.';


results ...

person    phone1    phone2    phone3

  A         111       222        .

  B         444         .        .

  C         111       222        .

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 34 replies
  • 5026 views
  • 6 likes
  • 10 in conversation