Solved: Re: de-dup array - Page 2

mkeintz · Posted 08-16-2012 03:35 PM

You have a list of phone numbers. Lets assume they are in character format (each length $10), and that you don't care if the resulting de-duped phone numbers are in any particular order.

Then (this is untested):

data want (drop=_:);

set have;

array ph {*} $ phone1-phone4;

length _list $60;

do _P=1 to dim(ph);

if ph{_P}=' ' then ;

else if _list=' ' then _list=ph{_P};

else if index(cats('!',_list,'!'), cats('!',ph{_P},'!') )=0 then _list=catx('!',_list,ph{_P});

ph{_P}=' ';

end;

if _list ^=' ' then do _P=1 to countw(_list,'!');

ph{_P}=scan(_list,_P,'!');

end;

run;

What this does is scan the list of phone values and lists them, one at a time, in an exclamation-delimited character variable (_LIST). But it only adds a number if it is not already in _LIST (bounding by exclamation mark makes the task of checking for dupes much easier). Then it takes the values in _LIST, and puts them back into the original phone number variables, with dupes removed, and "left-justified".

By the way, if the phone values are numeric, you'd only have to change the following (also untested).

array ph {*} phone1-phone4; **drop the $ sign **;

ph{_P}=input(scan(_list,_P,'!'),best10.); ** instead of ph{_P}=scan(_list,_P,'!') **;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

chang_y_chung_hotmail_com · Posted 08-16-2012 06:07 PM

If the maximum possible number of phones (per person) is small enough, then the simplest approach works fastest.

/* test data set */
data one;
    input id $ phone1-phone3;
cards;
A 111 222 111
B . 444 444
C 111 . 222
;
run;

/* dedup phone numbers within a person -- straight-forward approach */
%macro missout1(max);
    %local i j;
    %do i = &max. %to 2 %by -1;
      %*;if
      %do j = 1 %to %eval(&i - 1);
        %if &j > 1 %then %*; |;
        %*; phone&i = phone&j
      %end;
      %*; then phone&i = .;
    %end;
%mend missout1;
data chang1;
    set one;
    %missout1(3)
run;
proc print data=chang1;
run;
/* on lst
Obs    id    phone1    phone2    phone3
    1    A      111      222        .
    2    B        .      444        .
    3    C      111        .      222
*/

/* an alternative solution using overlapping definition of
    arrays -- inspired by Ksharp^s solution using "in" operator,
    and by Art297^s reversed array definition */
%macro missout2(max);
    %local i j;
    %do i = 1 %to &max;
      array p&i.

phone&

i. - phone1;
    %end;
    %do i = &max %to 2 %by -1;
      %let j = %eval(&i - 1);
      if p&i[1] in p&j. then p&i[1] = .;
    %end;
%mend missout2;
data chang2;
    set one;
    %missout2(3)
run;

/* check */
proc compare base=chang1 compare=chang2;
run;
/* on lst
NOTE: No unequal values were found. All values compared are exactly equal.
*/

art297 · Posted 08-16-2012 06:58 PM

Chang,

I still have to update my original post with what I think is correct code, but thought I should first respond to your post. This thread is definitely getting interesting!

I went a little further than you did. The OP has 40 million records, thus I'm in the middle of running tests on an expansion of a data set that Mike had created offline.

The latest version of my code worked, worked correctly (I think), and only took 2 and 1/2 minutes to run. I'll run the other tests in the morning.

Here is the code I tested:

/*create a test data set*/

data have (keep=person phone:);

array phone(15);

do person=1 to 40e6;

do j=1 to 15;

phone(j) = ceil(10*ranuni(999));

end;

output;

end;

run;

/*Art's method*/

data want (drop=i j);

set have;

array numbers(*) phone15-phone1;

do until (lastmiss eq currentmiss);

lastmiss=nmiss(of numbers(*));

call sortn (of numbers(*));

do i=lastmiss+1 to dim(numbers)-1;

skip=0;

do j=i+1 to dim(numbers);

if numbers(i) eq numbers(j) then do;

call missing(numbers(j));

skip+1;

end;

else leave;

end;

i+skip;

end;

currentmiss=nmiss(of numbers(*));

end;

run;

/*Chang's method*/

ods _all_ close;

ods listing;

options nocenter cmplib=work.func mprint fullstimer;

proc fcmp outlib=work.func.rsort;

subroutine swapn(x, y); /* swaps two numeric vars */

outargs x, y;

k = x;

x = y;

y = k;

endsub;

subroutine revArray(x

); /* reverse a numeric array or */

outargs x; /* swaps x[1] and x[dim(x)] */

d = dim(x); /* x[2] and x[dim(x)-1], */

do i = 1 to d/2 by 1; /* x[3] and x[dim(x)-2], */

m = x; /* and so on */

n = x[d-i+1];

call swapn(m, n);

x = m;

x[d-i+1] = n;

end;

endsub;

quit;

/* a test data each person can have &max phone numbers */

%let n = %sysevalf(40e6); %*-- number of people --*;

%let max = 15; %*-- max number of phones --*;

%let p = 0.06; %*-- prob(phone) except for the 1st --*;

%let dup = 0.05; %*-- probability of a phone to be duplicated --*;

%let seed = 1234567; %*-- random seed --*;

/* dedup phone numbers within a person -- straight-forward approach */

%macro missout(max);

%local i j;

%do i = &max. %to 2 %by -1;

%*;if

%do j = 1 %to %eval(&i - 1);

%if &j > 1 %then %*; |;

%*; phone&i = phone&j

%end;

%*; then phone&i = .;

%end;

%mend missout;

data chang;

set have;

%missout(&max.)

run;

art297 · Posted 08-17-2012 10:53 AM

Chang (and anyone else following this thread),

I haven't tested your latest code, but was finally able to run your original code and was extremely impressed. The reason I was unable to run it last night was that attempting to test everyone's version on a 40 million record test file, with 15 numbers per record, simply required more resources than I had available. Everyone's code ran if I ran them separately.

The results: Chang's code did the entire operation in approximately 30 seconds. Both KSharp's and mine accomplished the task in approximately 2 minutes.

I didn't include Mike's code in the test as I already discovered that it took around six times longer to run (on a smaller test file) than either KSharp's or my code, and should reflect those same differences on a larger dataset.

Art

art297 · Posted 08-16-2012 10:59 PM

I thought a quick note was in order to explain why my original alternate code was wrong. It's also a good question, I think, for the certification exam. After all of these years, it wasn't until this afternoon that I understood what was going on with a backwards array assignment.

For example, take the following one record datastep:

/*create one sample test record*/

data have (keep=person phone:);

input person phone1-phone4;

cards;

1 1 . 3 4

;

Easy enough, but now set it in another datastep, declare an array, and load the phone numbers in reverse order: i.e.,

data want;

set have;

array numbers(*) phone4-phone1;

output;

call sortn (of numbers(*));

output;

run;

proc print data=want;

run;

The datastep produces two records, one before, and one after a call sortn.

Can you correctly guess what the output is going to look like? I couldn't, thus based my original code on a faulty understanding.

The actual results:

Obs person phone1 phone2 phone3 phone4

1 1 1 . 3 4

2 1 4 3 1 .

Ksharp · Posted 08-16-2012 11:44 PM

Patrick,

For your question, there is no need to use HashTable yet. Since HashTable is a associated arrays. You also can a simple array to get it . I believe my code is very fast . Have a NICE day!

data have;
  infile datalines dsd truncover;
  input person $ phone1-phone3;
  datalines;
A,111,222,111
B,,444,444
C,111,,222,
;
run;
data want(drop=i j k _:);
 set have;
 array _p{*} phone1-phone3 ;
 array _pp{*} _phone1-_phone3;
 j=0;
 do i=1 to dim(_p);
  if _p{i} not in _pp then do;j+1;_pp{j}=_p{i};end;
 end;
 do k=1 to dim(_p);
  _p{k}=_pp{k};
 end;
run;

Ksharp

Haikuo · Posted 08-16-2012 11:55 PM

That is it? "if _p{i} not in _pp then do;" , never have thought that you can refer array() like that. I have actually tried something like '_pp(*)', and it did not work, so I thought this must be a dead end. Tell you want Ksharp, I'd like to clone your SAS part of brain, I will trade my wife for it.:smileysilly:

Haikuo

Ksharp · Posted 08-17-2012 12:21 AM

Thanks, Bian Hai Kuo.

That is it. it is a special operator for ARRARY. You can find it in documentation .

I also want to borrow my brain to you, since I can't use it here .

Have a NICE day!

Ksharp

MikeZdeb · Posted 08-20-2012 10:34 AM

Hi ... yes, the use if IN with the arrays was very nice. I had not noticed it in the documentation, but for those who are also looking for it ...

http://support.sas.com/documentation/cdl/en/lrcon/62753/HTML/default/viewer.htm#p00iah2thp63bmn1lt20...

I guess that WHICHN (and WHICHC) would also work in place of IN (though the IN looks more elegant and one doesn't have to think about numeric or character comparisons) ...

do i=1 to dim(_p);

if ^whichn(_p{i}, of _pp{*}) then do;j+1;_pp{j}=_p{i};end;

end;

in place of ...

do i=1 to dim(_p);

if _p{i} not in _pp then do;j+1;_pp{j}=_p{i};end;

end;

In some tests, IN and WHICHN (or WHICHC) take the same amount of time.

Not sure if it's been mentioned here, but a couple months ago on SAS-L there was a posting of an undocumented (unsupported) IN function. Here's an example, but it's also interesting to me how the COLON modifier works in conjunction with the array name for variable IN5 ...

data x;

array stuff(6) $10 _temporary_ ('ART' 'KSHARP' 'CHANG' 'BRIAN' 'TOM' 'MIKE');

input name :$10. @@;

in1 = in(name, of stuff(*));

in2 = in(trim(name), of stuff(*));

in3 = name in stuff;

in4 = trim(name) in stuff;

in5 = trim(name) in :stuff;

datalines;

SARA KATHY ART CHANG MIKE ARTHUR KSH

;

name in1 in2 in3 in4 in5

SARA 0 0 0 0 0

KATHY 0 0 0 0 0

ART 1 1 1 1 1

CHANG 1 1 1 1 1

MIKE 1 1 1 1 1

ARTHUR 0 0 0 0 0

KSH 0 1 0 0 1

To me, this appears useful in some of the data I work with in searching lists of diagnosis codes to see if a given DX is in the list or if it matches the string that

begins any DX in the list ...

data x;

array d(3) $5 _temporary_ ('25012' '11199' '45250');

input diag :$5. @@;

in1 = diag in d;

in2 = trim(diag) in :d;

datalines;

123 250 1119 25012 23456 452

;

diag in1 in2

123 0 0

250 0 1

1119 0 1

25012 1 1

23456 0 0

452 0 1

Ksharp · Posted 08-21-2012 12:18 AM

Thanks . Mike.

Good to know such interesting undocumented operator .

Keith · Posted 08-21-2012 04:39 AM

Yes, the use of IN: is very useful Mike. One point to note that some readers may not be aware of is that this function truncates the longer of the 2 values being compared to the shorter, this becomes important with variable lengths of the values of interest. For example, if you changed the length of diag to 6 in your code and added a digit to 25012 in the datalines statement, it would still match with 25012 in array d because it truncates diag to 5 characters in this instance.

MikeZdeb · Posted 08-21-2012 10:21 AM

hi ... look at this example (I think my conclusion at the bottom is correct) ...

data x;

array d(3) $5 _temporary_ ('25012' '11199' '45250' '12' );

input diag :$5. @@;

in0 = whichc(diag, of d(*));

in1 = diag in d;

in2 = trim(diag) in :d;

in3 = in(trim(diag), of d(*));

in4 = trim(diag) in ('25012' '11199' '45250' '12');

in5 = trim(diag) in : ('25012' '11199' '45250' '12');

datalines;

123 250 1119 25012 23456 452 199

;

diag in0 in1 in2 in3 in4 in5

123 0 0 0 0 0 1

250 0 0 1 1 0 1

1119 0 0 1 1 0 1

25012 1 1 1 1 1 1

23456 0 0 0 0 0 0

452 0 0 1 1 0 1

199 0 0 0 0 0 0

look at the value of IN5 when DIAG is '123' ...

the colon modifier when used with the IN operator and a list of values DID truncate the comparison to the shorter value

look at the value of IN3 when DIAG is '123' ...

appears that the colon modifier when used with the IN operator and an ARRAY DID NOT truncate the comparison to the shorter value

my conclusion is that it's be safer when using the colon modifier to find if a value is in a list if you put the list of values in a temporary array and use the IN operator to search the array (seems to take care of the truncation issue)

Keith · Posted 08-21-2012 10:29 AM

Hi Mike, the reason for this is that array d has a fixed length of 5, so the value '12' stored in it is padded out with 3 trailing blanks.

MikeZdeb · Posted 08-21-2012 10:38 AM

hi ... thanks (a "duh" moment) ... at least that makes colon behavior consistent (ah)

Linlin · Posted 08-21-2012 10:54 AM

Hi Mike,

What is "duh" moment?

Classroom Training Available!