BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mkeintz
PROC Star

You have a list of phone numbers.  Lets assume they are in character format (each length $10), and that you don't care if the resulting de-duped phone numbers are in any particular order.

Then (this is untested):


data want (drop=_:);

  set have;

  array ph {*} $ phone1-phone4;

  length _list $60;

  do _P=1 to dim(ph);

    if ph{_P}=' ' then ;

    else if _list=' ' then _list=ph{_P};

    else if index(cats('!',_list,'!'), cats('!',ph{_P},'!') )=0 then _list=catx('!',_list,ph{_P});

    ph{_P}=' ';

  end;


  if _list ^=' ' then do _P=1 to countw(_list,'!');

    ph{_P}=scan(_list,_P,'!');

  end;

run;

What this does is scan the list of phone values and lists them, one at a time, in an exclamation-delimited character variable (_LIST).  But it only adds a number if it is not already in _LIST (bounding by exclamation mark makes the task of checking for dupes much easier).  Then it takes the values in _LIST, and puts them back into the original phone number variables, with dupes removed, and "left-justified".

By the way, if the phone values are numeric, you'd only have to change the following (also untested).

  array ph {*}  phone1-phone4;    **drop the $ sign **;

  ph{_P}=input(scan(_list,_P,'!'),best10.);   ** instead of ph{_P}=scan(_list,_P,'!') **;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
chang_y_chung_hotmail_com
Obsidian | Level 7

If the maximum possible number of phones (per person) is small enough, then the simplest approach works fastest.

/* test data set */
  data one;
    input id $ phone1-phone3;
  cards;
  A 111 222 111
  B  . 444 444
  C 111  . 222
  ;
  run;
 
  /* dedup phone numbers within a person -- straight-forward approach */
  %macro missout1(max);
    %local i j;
    %do i = &max. %to 2 %by -1;
      %*;if
      %do j = 1 %to %eval(&i - 1);
        %if &j > 1 %then %*; |;
        %*; phone&i = phone&j
      %end;
      %*; then phone&i = .;
    %end;
  %mend  missout1;
  data chang1;
    set one;
    %missout1(3)
  run;
  proc print data=chang1;
  run;
  /* on lst
  Obs    id    phone1    phone2    phone3
    1    A      111      222        .
    2    B        .      444        .
    3    C      111        .      222
  */
 
  /* an alternative solution using overlapping definition of
    arrays -- inspired by Ksharp^s solution using "in" operator,
    and by Art297^s reversed array definition */
  %macro missout2(max);
    %local i j;
    %do i = 1 %to &max;
      array p&i.

  • phone&
  • i. - phone1;
        %end;
        %do i = &max %to 2 %by -1;
          %let j = %eval(&i - 1);
          if p&i[1] in p&j. then p&i[1] = .;
        %end;
      %mend  missout2;
      data chang2;
        set one;
        %missout2(3)
      run;
     
      /* check */
      proc compare base=chang1 compare=chang2;
      run;
      /* on lst
      NOTE: No unequal values were found. All values compared are exactly equal.
      */

    art297
    Opal | Level 21

    Chang,

    I still have to update my original post with what I think is correct code, but thought I should first respond to your post.  This thread is definitely getting interesting!

    I went a little further than you did.  The OP has 40 million records, thus I'm in the middle of running tests on an expansion of a data set that Mike had created offline.

    The latest version of my code worked, worked correctly (I think), and only took 2 and 1/2 minutes to run.  I'll run the other tests in the morning.

    Here is the code I tested:

    /*create a test data set*/

    data have (keep=person phone:);

      array phone(15);

      do person=1 to 40e6;

        do j=1 to 15;

          phone(j) = ceil(10*ranuni(999));

        end;

        output;

      end;

    run;

    /*Art's method*/

    data want (drop=i j);

      set have;

      array numbers(*) phone15-phone1;

      do until (lastmiss eq currentmiss);

        lastmiss=nmiss(of numbers(*));

        call sortn (of numbers(*));

        do i=lastmiss+1 to dim(numbers)-1;

          skip=0;

          do j=i+1 to dim(numbers);

            if numbers(i) eq numbers(j) then do;

              call missing(numbers(j));

              skip+1;

            end;

         else leave;

          end;

          i+skip;

        end;

        currentmiss=nmiss(of numbers(*));

      end;

    run;

    /*Chang's method*/

      ods _all_ close;

      ods listing;

      options nocenter cmplib=work.func mprint fullstimer;

      proc fcmp outlib=work.func.rsort;

        subroutine swapn(x, y); /* swaps two numeric vars */

          outargs x, y;

          k = x;

          x = y;

          y = k;

        endsub;

        subroutine revArray(x

  • ); /* reverse a numeric array or */
  •       outargs x;               /* swaps x[1] and x[dim(x)]   */

          d = dim(x);              /*     x[2] and x[dim(x)-1],  */

          do i = 1 to d/2 by 1;    /*     x[3] and x[dim(x)-2],  */

            m = x;              /* and so on                  */

            n = x[d-i+1];

            call swapn(m, n);

            x = m;

            x[d-i+1] = n;

          end;

        endsub;

      quit;

      /* a test data each person can have &max phone numbers */

      %let n = %sysevalf(40e6);  %*-- number of people                        --*;

      %let max = 15;             %*-- max number of phones                    --*;

      %let p = 0.06;            %*-- prob(phone) except for the 1st          --*;

      %let dup = 0.05;          %*-- probability of a phone to be duplicated --*;

      %let seed = 1234567;      %*-- random seed                             --*;

      /* dedup phone numbers within a person -- straight-forward approach */

      %macro missout(max);

        %local i j;

        %do i = &max. %to 2 %by -1;

          %*;if

          %do j = 1 %to %eval(&i - 1);

            %if &j > 1 %then %*; |;

            %*; phone&i = phone&j

          %end;

          %*; then phone&i = .;

        %end;

      %mend  missout;

      data chang;

        set have;

        %missout(&max.)

      run;

    art297
    Opal | Level 21

    Chang (and anyone else following this thread),

    I haven't tested your latest code, but was finally able to run your original code and was extremely impressed.  The reason I was unable to run it last night was that attempting to test everyone's version on a 40 million record test file, with 15 numbers per record, simply required more resources than I had available.  Everyone's code ran if I ran them separately.

    The results: Chang's code did the entire operation in approximately 30 seconds.  Both KSharp's and mine accomplished the task in approximately 2 minutes.

    I didn't include Mike's code in the test as I already discovered that it took around six times longer to run (on a smaller test file) than either KSharp's or my code, and should reflect those same differences on a larger dataset.

    Art

    art297
    Opal | Level 21

    I thought a quick note was in order to explain why my original alternate code was wrong.  It's also a good question, I think, for the certification exam.  After all of these years, it wasn't until this afternoon that I understood what was going on with a backwards array assignment.

    For example, take the following one record datastep:

    /*create one sample test record*/

    data have (keep=person phone:);

      input person phone1-phone4;

      cards;

    1 1 . 3 4

    ;

    Easy enough, but now set it in another datastep, declare an array, and load the phone numbers in reverse order: i.e.,

    data want;

      set have;

      array numbers(*) phone4-phone1;

      output;

      call sortn (of numbers(*));

      output;

    run;

    proc print data=want;

    run;

    The datastep produces two records, one before, and one after a call sortn.

    Can you correctly guess what the output is going to look like?  I couldn't, thus based my original code on a faulty understanding.

    The actual results:

                         Obs    person    phone1    phone2    phone3    phone4

                          1        1         1         .         3         4

                          2        1         4         3         1         .

    Ksharp
    Super User

    Patrick,

    For your question, there is no need to use HashTable yet. Since HashTable is a associated arrays. You also can a simple array to get it . I believe my code is very fast . Have a NICE day!

    data have;
      infile datalines dsd truncover;
      input person $ phone1-phone3;
      datalines;
    A,111,222,111
    B,,444,444
    C,111,,222,
    ;
    run;
    data want(drop=i j k _:);
     set have;
     array _p{*} phone1-phone3 ;
     array _pp{*} _phone1-_phone3;
     j=0;
     do i=1 to dim(_p);
      if _p{i} not in _pp then do;j+1;_pp{j}=_p{i};end;
     end;
     do k=1 to dim(_p);
      _p{k}=_pp{k};
     end;
    run;
    

    Ksharp

    Haikuo
    Onyx | Level 15

    That is it? "if _p{i} not in _pp then do;" , never have thought that you can refer array() like that. I have actually tried something like '_pp(*)', and it did not work, so I thought this must be a dead end. Tell you want Ksharp, I'd like to clone your SAS part of brain, I will trade my wife for it.:smileysilly:

    Haikuo

    Ksharp
    Super User

    Thanks, Bian Hai Kuo.

    That is it. it is a special operator for ARRARY. You can find it in documentation .

    I also want to borrow my brain to you, since I can't use it here .

    Have a NICE day!

    Ksharp

    MikeZdeb
    Rhodochrosite | Level 12

    Hi ... yes, the use if IN with the arrays was very nice.  I had not noticed it in the documentation, but for those who are also looking for it ...

    http://support.sas.com/documentation/cdl/en/lrcon/62753/HTML/default/viewer.htm#p00iah2thp63bmn1lt20...

    I guess that WHICHN (and WHICHC) would also work in place of IN (though the IN looks more elegant and one doesn't have to think about numeric or character comparisons) ...

    do i=1 to dim(_p);

      if ^whichn(_p{i}, of _pp{*}) then do;j+1;_pp{j}=_p{i};end;

    end;

    in place of ...

    do i=1 to dim(_p);

      if _p{i} not in _pp then do;j+1;_pp{j}=_p{i};end;

    end;

    In some tests, IN and WHICHN (or WHICHC) take the same amount of time.

    Not sure if it's been mentioned here, but a couple months ago on SAS-L there was a posting of an undocumented (unsupported) IN function.  Here's an example, but it's also interesting to me how the COLON modifier works in conjunction with the array name for variable IN5 ...

    data x;

    array stuff(6) $10 _temporary_  ('ART' 'KSHARP' 'CHANG' 'BRIAN' 'TOM' 'MIKE');

    input name :$10. @@;

    in1 = in(name, of stuff(*));

    in2 = in(trim(name), of stuff(*));

    in3 = name in stuff;

    in4 = trim(name) in stuff;

    in5 = trim(name) in :stuff;

    datalines;

    SARA KATHY ART CHANG MIKE ARTHUR KSH

    ;

    name      in1    in2    in3    in4    in5

    SARA       0      0      0      0      0

    KATHY      0      0      0      0      0

    ART        1      1      1      1      1

    CHANG      1      1      1      1      1

    MIKE       1      1      1      1      1

    ARTHUR     0      0      0      0      0

    KSH        0      1      0      0      1

    To me, this appears useful in some of the data I work with in searching lists of diagnosis codes to see if a given DX is in the list or if it matches the string that

    begins any DX in the list ...

    data x;

    array d(3) $5 _temporary_ ('25012' '11199' '45250');

    input diag :$5. @@;

    in1 = diag in d;

    in2 = trim(diag) in :d;

    datalines;

    123 250 1119 25012 23456 452

    ;


    diag     in1    in2

    123       0      0

    250       0      1

    1119      0      1

    25012     1      1

    23456     0      0

    452       0      1

    Ksharp
    Super User

    Thanks . Mike.

    Good to know such interesting undocumented operator .

    Keith
    Obsidian | Level 7

    Yes, the use of IN: is very useful Mike.  One point to note that some readers may not be aware of is that this function truncates the longer of the 2 values being compared to the shorter, this becomes important with variable lengths of the values of interest.  For example, if you changed the length of diag to 6 in your code and added a digit to 25012 in the datalines statement, it would still match with 25012 in array d because it truncates diag to 5 characters in this instance.

    MikeZdeb
    Rhodochrosite | Level 12

    hi ... look at this example (I think my conclusion at the bottom is correct) ...

    data x;

    array d(3) $5 _temporary_ ('25012' '11199' '45250' '12' );

    input diag :$5. @@;

    in0 = whichc(diag, of d(*));

    in1 = diag in d;

    in2 = trim(diag) in :d;

    in3 = in(trim(diag), of d(*));

    in4 = trim(diag) in ('25012' '11199' '45250' '12');

    in5 = trim(diag) in : ('25012' '11199' '45250' '12');

    datalines;

    123 250 1119 25012 23456 452 199

    ;

    diag     in0    in1    in2    in3    in4    in5

    123       0      0      0      0      0      1

    250       0      0      1      1      0      1

    1119      0      0      1      1      0      1

    25012     1      1      1      1      1      1

    23456     0      0      0      0      0      0

    452       0      0      1      1      0      1

    199       0      0      0      0      0      0



    look at the value of IN5 when DIAG is '123' ...

    the colon modifier when used with the IN operator and a list of values DID truncate the comparison to the shorter value

    look at the value of IN3 when DIAG is '123' ...

    appears that  the colon modifier when used with the IN operator and an ARRAY DID NOT truncate the comparison to the shorter value

    my conclusion is that it's be safer when using the colon modifier to find if a value is in a list if you put the list of values in a temporary array and use the IN operator to search the array (seems to take care of the truncation issue)

    Keith
    Obsidian | Level 7

    Hi Mike, the reason for this is that array d has a fixed length of 5, so the value '12' stored in it is padded out with 3 trailing blanks.

    MikeZdeb
    Rhodochrosite | Level 12

    hi ... thanks (a "duh" moment) ... at least that makes colon behavior consistent (ah)

    Linlin
    Lapis Lazuli | Level 10

    Hi Mike,

    What is "duh" moment?

    sas-innovate-2024.png

    Don't miss out on SAS Innovate - Register now for the FREE Livestream!

    Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

     

    Register now!

    How to Concatenate Values

    Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

    Find more tutorials on the SAS Users YouTube channel.

    Click image to register for webinarClick image to register for webinar

    Classroom Training Available!

    Select SAS Training centers are offering in-person courses. View upcoming courses for:

    View all other training opportunities.

    Discussion stats
    • 34 replies
    • 2871 views
    • 6 likes
    • 10 in conversation