Solved: de-dup array

Patrick · Posted 08-16-2012 08:31 AM

Hi all

I'm having phone numbers per person ('data have'). What I need to do is to get rid of duplicate phone numbers per row (result should be like 'data want').

To have the phone numbers also "left aligned" like in 'data want' would be a "nice to have" but is not crucial.

What is crucial: The de-duping must happen within a single data step (so in below example everything should happen within 'data have') because I need a solution which I then can apply on 40M rows (so using Proc Transpose or the like is not an option).

....and the code must work with SAS9.1.3

data have;
infile datalines dsd truncover;
input person $ phone1-phone3;
datalines;
A,111,222,111
B,,444,444
C,111,,222,
;
run;

data want;
infile datalines dsd truncover;
input person $ phone1-phone3;
datalines;
A,111,222
B,444
C,111,222
;
run;

Thanks

Patrick

Message was edited by: Patrick
@Art: It was a typo. Data fixed.

Just to add one more piece of information which might help: You can work under the assumption that the phone numbers only contain digits

Ksharp · Posted 08-16-2012 11:44 PM

Patrick,

For your question, there is no need to use HashTable yet. Since HashTable is a associated arrays. You also can a simple array to get it . I believe my code is very fast . Have a NICE day!

data have;
  infile datalines dsd truncover;
  input person $ phone1-phone3;
  datalines;
A,111,222,111
B,,444,444
C,111,,222,
;
run;
data want(drop=i j k _:);
 set have;
 array _p{*} phone1-phone3 ;
 array _pp{*} _phone1-_phone3;
 j=0;
 do i=1 to dim(_p);
  if _p{i} not in _pp then do;j+1;_pp{j}=_p{i};end;
 end;
 do k=1 to dim(_p);
  _p{k}=_pp{k};
 end;
run;

Ksharp

View solution in original post

art297 · Posted 08-16-2012 08:40 AM

Patrick, Unless that was a typo, for the first record of want, I think we need more explanation of what you want. I would have expected that you would want:

A,111,222

Haikuo · Posted 08-16-2012 09:14 AM

Patrick,

The will be a brute force method, but don't blame me, blame Mark Keintz :smileysilly:. Since you are dealing with phone numbers, so there will be a scope, And if your phone numbers do not have '-', () embedded, you can try the following, otherwise, you will need to convert your data a little bit:

data have;

infile datalines dsd truncover;

input person $ phone1-phone3;

array p1(1:1000) _temporary_ (1000*0);

array p2 phone:;

do i=1 to dim(p2);

if not missing(p2(i)) then p1(p2(i))=1;

end;

call missing(of p2(*));

do i=1 to 1000;

if p1(i)=1 then do ;

j+1;

p2(j)=i;

end;

call missing(of i,j,of p1(*));

drop i j;

datalines;

A,111,222,111

B,,444,444

C,111,,222,

;

run;

proc print;run;

Haikuo

Patrick · Posted 08-16-2012 10:07 AM

Hi Haikuo

My phone numbers contain only digits so no problems from this side. The code you've posted is quite clever but - and it's a big BUT - it is very wastefull with memory.

I'm having in real life much longer phone numbers. I've tried your code with 11 digits so I had to create 10**12 temporary array elements. It took only seconds to get an out-of-memory error (and I've got 8GB of RAM on my test laptop).

So: Very interesting solution but not fit-for-purpose unfortunately.

Thanks

Patrick

Haikuo · Posted 08-16-2012 10:16 AM

Patrick,

I know mine is hopelessly flawed for your real problem. The good thing is, Art 's solution is much much better.

Haikuo

art297 · Posted 08-16-2012 09:46 AM

Here is an alternative approach that was suggested by Paul Dorfman in an older (2005) SAS-L post. The approach is quite useful as the only thing one has to change in order to deal with character variables is to make the array character.

Anyhow, it lets you sort in either ascending or descending order, as well as gets rid of duplicates:

data have;

infile datalines dsd truncover;

input person $ phone1-phone3;

datalines;

A,111,222,111

B,,444,444

C,111,,222,

;

run;

data want;

set have;

array numbers(*) phone1-phone3;

** set sort parameters (hint: would-be macro parms) ;

seq = 'A' ; * A = ascending, D = descending ;

nodupkey = 1 ; * 0 = duplicates allowed, 1 = duplicates not allowed ;

dcl hash _hh (hashexp: 16, ordered: seq) ;

dcl hiter _hi ('_hh' ) ;

_hh.definekey ('_k', '_n' ) ; * _n - extra enumerating key ;

_hh.definedata ('_k' ) ; * _k automatically assumes array data type ;

_hh.definedone ( ) ;

** load composite (_k _n) key on the table ;

** if duplicates to be retained, set 0 <- _n ;

do _j = lbound (numbers) to hbound (numbers) ;

_n = _j * ^ nodupkey ;

_k = numbers [_j] ;

_hh.replace() ;

end ;

** use iterator HI to reload array from HH table, now in order ;

_n = lbound (numbers) - 1 ;

do _rc = _hi.first() by 0 while ( _rc = 0 ) ;

_n = _n + 1 ;

numbers [_n] = _k ;

_rc = _hi.next() ;

end ;

_q = _n ;

** fill array tail with missing values if duplicates are delete ;

do _n = _q + 1 to hbound (numbers) ;

numbers [_n] = . ;

end ;

drop _: ; * drop auxiliary variables ;

run;

Astounding · Posted 08-16-2012 09:58 AM

Art,

If I understand this correctly (and I'm a ways from that at this point), you are rebuilding the hash table for each observation. I'm going to have to study this a bit.

Patrick,

Traditionally, the biggest problem in these situations was putting all phone numbers into the same format. Enhancements to COMPRESS (using kd as the third parameter) have made this step easy. If you only have three phone numbers, you can always code:

if phone1=phone2 then phone2=' ';

if phone1=phone3 then phone3=' ';

if phone2=phone3 then phone3=' ';

That won't handle the order problem, but that could be tackled as a next step. If you actually have more phone numbers, you might switch to arrays:

array phones {*} phone:;

do i=1 to dim(phone)-1;

do j=i+1 to dim(phone);

if phones{i}=phones{j} then phones{j}=' ';

end;

I like the hashing approach ... I'm just not up to speed on it yet.

Good luck.

art297 · Posted 08-16-2012 10:05 AM

Astounding: What I'd really like to see is a nodupkey option added to call sortn and call sortc, but I don't know if we're ever going to see that.

Haikuo · Posted 08-16-2012 10:07 AM

I like it, Art. I believe this is more suitable for Patrick's daunting 40M records. Haikuo

MikeZdeb · Posted 08-16-2012 02:59 PM

hi Art ... I tried the HASH approach with a lot more observations ...

data have;

infile datalines dsd truncover;

input person $ phone1-phone3;

datalines;

A,111,222,111

B,,444,444

C,111,,222,

;

data have;

set have;

do _n_=1 to 1000;

output;

end;

run;

and on my PC it dies after about 3 seconds with the message ...

ERROR: Hash object added 0 items when memory failure occurred.

FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.

ERROR: The SAS System stopped processing this step because of insufficient memory.

NOTE: There were 2543 observations read from the data set WORK.HAVE.

WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 2542 observations and 6 variables.

WARNING: Data set WORK.WANT was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

real time 3.07 seconds

cpu time 3.04 seconds

So, what change(s) would make it work with 3,000 observations?

ps I tried the de-deduping approach that I posted that uses _FILE_ and it took 0.01 seconds with the 3,000 observations.

With 30,000 observations and 15 phone numbers, the _FILE_ approach took 0.95 seconds

art297 · Posted 08-16-2012 03:48 PM

NOTE: Updated code. I think this works, tested it on 40 mil records with 15 phone numbers, and ran in 2.5 minutes.

data want (drop=i j);

set have;

array numbers(*) phone13-phone1;

do until (lastmiss eq currentmiss);

lastmiss=nmiss(of numbers(*));

call sortn (of numbers(*));

do i=lastmiss+1 to dim(numbers)-1;

skip=0;

do j=i+1 to dim(numbers);

if numbers(i) eq numbers(j) then do;

call missing(numbers(j));

skip+1;

end;

else leave;

end;

i+skip;

end;

currentmiss=nmiss(of numbers(*));

end;

run;

Patrick · Posted 08-16-2012 10:24 AM

Art, Haikuo, Astounding

Thanks a lot for your fast and valuable replies.

@Art

I was actually thinking about using a hash/iter object combination but I couldn't figure out of how to do it in a way that I wouldn't have to create and destroy the objects for every single iteration of the data step. I even was looking for papers from the master of hash (Paul Dorfman) but couldn't find the right one.

Thanks a lot for this. I will still need some time to fully digest this code but considering the author I have no doubt that it's kind of final and can't be bettered.

art297 · Posted 08-16-2012 10:37 AM

Patrick: And, since Paul has been known to occasionally read and respond to the forum posts, if he has come up with a more optimal solution, I would guess that we'll hear from him directly.

Haikuo · Posted 08-16-2012 10:48 AM

Patrick,

clear() method can been found in 9.2, It clears hash() without delete the object instance. I have no idea its overhead comparing to generating new hash() each record, and I don't know if it is at your disposal for 9.1.3. Here is the example, please note the bold ones.

data want;

set have;

array numbers(*) phone1-phone3;

** set sort parameters (hint: would-be macro parms) ;

seq = 'A' ; * A = ascending, D = descending ;

nodupkey = 1 ; * 0 = duplicates allowed, 1 = duplicates not allowed ;

if _n_=1 then do;

dcl hash _hh (hashexp: 16, ordered: seq) ;

dcl hiter _hi ('_hh' ) ;

_hh.definekey ('_k', '_n' ) ; * _n - extra enumerating key ;

_hh.definedata ('_k' ) ; * _k automatically assumes array data type ;

_hh.definedone ( ) ;

end;

** load composite (_k _n) key on the table ;

** if duplicates to be retained, set 0 <- _n ;

do _j = lbound (numbers) to hbound (numbers) ;

_n = _j * ^ nodupkey ;

_k = numbers [_j] ;

_hh.replace() ;

end ;

** use iterator HI to reload array from HH table, now in order ;

_n = lbound (numbers) - 1 ;

do _rc = _hi.first() by 0 while ( _rc = 0 ) ;

_n = _n + 1 ;

numbers [_n] = _k ;

_rc = _hi.next() ;

end ;

_q = _n ;

** fill array tail with missing values if duplicates are delete ;

do _n = _q + 1 to hbound (numbers) ;

numbers [_n] = . ;

end ;

drop _: ; * drop auxiliary variables ;

_hh.clear();

run;

Haikuo

MikeZdeb · Posted 08-16-2012 02:32 PM

hi ... another approach (HEX16. maintains values through numeric-character-numeric conversion) ...

data have;

infile datalines dsd truncover;

input person $ phone1-phone3;

datalines;

A,111,222,111

B,,444,444

C,111,,222,

;

options missing = ' ';

filename d dummy;

data want;

file d;

set have;

array phone(3);

do j=3 to 1 by -1;

put (phone:) (hex16. +1) @;

if count(_file_ , put(phone(j),hex16.)) ge 2 then phone(j) = .;

put;

end;

put (phone:) (hex16. +1) @;

do j=1 to 3;

phone(j) = input(scan(_file_,j,' '),hex16.);

end;

drop j;

run;

options missing = '.';

results ...

person phone1 phone2 phone3

A 111 222 .

B 444 . .

C 111 222 .

de-dup array

Re: de-dup array

Re: de-dup array

Re: de-dup array

Re: de-dup array

Re: de-dup array

Re: de-dup array

Re: de-dup array

Re: de-dup array

Re: de-dup array

Re: de-dup array

Re: de-dup array

Re: de-dup array

Re: de-dup array

Re: de-dup array

Re: de-dup array

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away