Re: Largest with text fields

tomrvincent · Posted 06-30-2017 11:42 AM

It sucks that smallest and largest don't work with strings. Here's how I got around it.

With this data...multiple lines per member, claim and even claim line there are up to 4 'code's:

Member    Claim    Claim_Line    code1    code2    code3    code4
866738350   14003347700   1   95   95   B23   B23
866738350   14003347700   4   95   95   B23   B23
866738350   14003347700   3   95   95   B23   B23
866738350   14003347700   2   95   95   B23   B23
866738350   14003347700   5   95   95   B23   B23
850308450   142801958000   2   B13   B13   96   96
861027250   142802729900   1   95   95   96   45
861027250   142802729900   1   95   95   45   45
855217600   142802814700   1   95   95   45   16
855217600   142802814700   1   95   95   16   16

So, what I do is this:

First, add a row number to uniquely identify each line:

    data WORK.t2;
      set work.t1;
      rownum=_n_;
   run;

then spit out the 4 (could be less) lines per row (one per code):

    data WORK.t3(keep=rownum code);
   format code $ 10.;
      set work.t1;
      rownum=_n_;
      if not missing(code1) then do; code=code1; output; end;
      if not missing(code2) then do; code=code2; output; end;
      if not missing(code3) then do; code=code3; output; end;
      if not missing(code4) then do; code=code4; output; end;
   run;

Then sort, removing duplicate codes:

proc sort data=WORK.t3 nodup; by rownum descending code ;
run;

Then add a row number to that resulting deduped table:

data WORK.t4; set work.t3;
coderow=_n_;

Then rank that table by code descending to handle missing codes:

proc rank data=WORK.t4 out=work.t5 ties=low descending;
   by rownum;
   var coderow;
   ranks codeRank ;
run;

Then get the up-to 4 remaining, deduped codes per original line:

PROC SQL number;
   CREATE TABLE WORK.t6 AS
   SELECT
       rownum,
       max(case when coderank=1 then code end) as code1,
       max(case when coderank=2 then code end) as code2,
       max(case when coderank=3 then code end) as code3,
       max(case when coderank=4 then code end) as code4
      FROM work.t5
      group by rownum
;
QUIT;

Resulting in this table:

rownum   code1    code2    code3    code4
1   95   B23
2   95   B23
3   95   B23
4   95   B23
5   95   B23
6   96   B13
7   45   95   96
8   45   95
9   16   45   95
10   16   95

ballardw · Posted 06-30-2017 12:13 PM

You might explain what the problem you are "solving" actually is. I am not sure I understand.

Since you have a SORT step you may want to look at the options for SORTSEQ and LINGUISTIC which will correctly sort

B23

B3

C18

to

B3

B23

C18.

I also think your example data is poor as the ranges of values don't exercise Codes starting with different letters, more than one letter, any special charaters like - and every single 95 or B23 value is associated with the same 2 variables and paired

fy1 · Posted 06-30-2017 12:36 PM

*******************************************;

data claim_a;
set claim;
ord=_N_;
run;
proc sort data=claim_a out=claim_as;
by ord;
run;
proc transpose data=claim_as out=t_claim;
var code1 code2 code3 code4;
by ord;
run;
proc sort data=t_claim out=t_claim_s nodupkey;
by ord descending COL1;
run;

proc transpose data=t_claim_s out=ranked_claim (drop=_:) prefix=ranked_code;
var COL1;
by ord;
run;

***************;

ChrisNZ · Posted 06-30-2017 06:35 PM

If I understand what you are attempting to do, you could do it on one datastep: store the values in an array, sort the array with call sortc, depuplicate it by looping through the values.

data t;
  CODE1='95 '; CODE2='95 ';CODE3='B23';CODE4='B23';
  array CODE [4] $;
  call sortc(of CODE[*]);
  do I=1 to 3;
    if CODE[I]=CODE[I+1] then  CODE[I+1]='ZZ';
  end;
  call sortc(of CODE[*]);
  do I=1 to 4;
    if CODE[I]='ZZ' then CODE[I]=' ';
  end;
run;

High-Performance SAS Coding - Third Edition

Patrick · Posted 07-01-2017 02:03 AM

@tomrvincent

You could also use a SAS Hash table to both dedupe and sort your array in one go.

data have;
  input (Member Claim) (:$11.) (Claim_Line code1 code2 code3 code4) ($);
datalines;
866738350 14003347700 1 95 95 B23 B23
866738350 14003347700 4 95 95 B23 B23
866738350 14003347700 3 95 95 B23 B23
866738350 14003347700 2 95 95 B23 B23
866738350 14003347700 5 95 95 B23 B23
850308450 142801958000 2 B13 B13 96 96
861027250 142802729900 1 95 95 96 45
861027250 142802729900 1 95 95 45 45
855217600 142802814700 1 95 95 45 16
855217600 142802814700 1 95 95 16 16
;
run;

data want(drop=_:);
  set have;

  if _n_=1 then
    do;
      if 0 then _code=code1;
      dcl hash h1(ordered:'y',hashexp:2);
      h1.defineKey('_code');
      h1.defineData('_code');
      h1.defineDone();
      dcl hiter hh1('h1');
    end;

  array codes{*} code1 code2 code3 code4;
  do _i=1 to dim(codes);
    _code=codes[_i];
    h1.ref();
  end;

  call missing(of codes[*]);

  _i=1;
  _rc = hh1.first();
  do while (_rc = 0);
    codes[_i]=_code;
    _rc = hh1.next();
    _i+1;
  end;

  h1.clear();

run;

art297 · Posted 07-01-2017 05:59 PM

@tomrvincent: I was bored (for a few minutes) this afternoon, and found the problem challenging, so created the following. It solves the problem in one datastep AND accounts for linguistic collation (sort of). It uses the qsort macro that Paul Dorfman posted years ago and included in the code. If one of our hash experts can accomplish the same thing using the hash object, I'd definitely be interested in seeing how it could be done). What it does is create a new sort key that, basically, normalizes the data, sorts by that key, but identifies the largest and smallest based on the non-normalized data values:

%Macro Qsort (
  Arr =                        /* Parallel array name list */
  , By  = %QScan(&Arr,1,%Str( )) /* Key array name           */
  , Seq = A                      /* Seq=D for descending     */
  , LB  = Lbound(&By)            /* Lower bound to sort      */
  , HB  = Hbound(&By)            /* Upper bound to sort      */
  , M   = 9                      /* Tuning range: [1:15]     */
   ) ;

  %Local _ H I J L N P Q S T W ;

  %Macro Swap (I,J) ;
    %Local W ;
     Do ;
       %Do W = 1 %To &N ;
         &T&W      = &&A&W(&I) ;
         &&A&W(&I) = &&A&W(&J) ;
         &&A&W(&J) =  &T&W     ;
       %End ;
    End ;
  %Mend Swap ;

  %If %Upcase(&Seq) = %Upcase(A) %Then %Let Q = G ;
  %Else                                %Let Q = L ;

  %Do %Until (&&A&N EQ ) ;
    %Let N  = %Eval(&N + 1) ;
    %Local A&N ;
    %Let A&N = %Scan(&Arr,&N,%Str( )) ;
  %End ;
  %Let N = %Eval(&N - 1) ;

  %Let _ = %Substr(%Sysfunc(Ranuni(0)),3,
  %Eval(7 - %Length(&N) + 5*(%Substr(&Sysver,1,1) GT 6))) ;

  %Let H = H&_ ; %Let I = I&_ ; %Let J = J&_ ; %Let L = L&_ ;
  %Let P = P&_ ; %Let S = S&_ ; %Let T = T&_ ; %Let Z = Z&_ ;

  Array &Z (0:1, 0:50) _Temporary_ ;

  &L = &LB ; &H = &HB ;

  If &H - &L GT &M Then Do &S=1 By 0 While (&S) ;
    &J = (&H - &L)/3 ; &I = &L + &J ; &J = &I + &J ;
    If &By(&L) &Q.T &By(&I) Then %Swap(&L,&I) ;
    If &By(&I) &Q.T &By(&J) Then %Swap(&I,&J) ;
    If &By(&J) &Q.T &By(&H) Then %Swap(&J,&H) ;
    If &By(&L) &Q.T &By(&I) Then %Swap(&L,&I) ;
    If &By(&I) &Q.T &By(&J) Then %Swap(&I,&J) ;
    If &By(&L) &Q.T &By(&I) Then %Swap(&L,&I) ;

    %If &M LE 3 %Then %Do ;
      If &H - &L LE 3 Then Do ;
        &L = &Z(0,&S) ; &H = &Z(1,&S) ; &S +- 1 ;
        Continue ;
      End ;
    %End ;

    %Swap(&L,&I) ; &P = &By(&L) ; &I = &L ;

    Do &J=&H + 1 By 0 ;
      Do &I=&I + 1 By  + 1 Until(&By(&I) &Q.E &P) ; End ;
      Do &J=&J - 1 By  - 1 Until(&P &Q.E &By(&J)) ; End ;
      If &I GE &J Then Leave ;
      %Swap(&I,&J) ;
    End ;

    %Swap(&L,&J) ;

    If      &H - &J GE &J - &L GT &M Then Do &S = &S + 1 ;
      &Z(0,&S) = &J + 1 ; &Z(1,&S) = &H ; &H = &J - 1 ;
    End ;
    Else If &J - &L GE &H - &J GT &M Then Do &S = &S + 1 ;
      &Z(0,&S) = &L ; &Z(1,&S) = &J - 1 ; &L = &J + 1 ;
    End ;
    Else If &J - &L GT &M GE &H - &J Then &H = &J - 1 ;
    Else If &H - &J GT &M GE &J - &L Then &L = &J + 1 ;
    Else Do ;
      &L = &Z(0,&S) ; &H = &Z(1,&S) ; &S +- 1 ;
    End ;
  End ;

  %If &M = 1 %Then %Goto Exit ;

  Do &J = &LB + 1 To &HB ;
    If &By(&J - 1) &Q.T &By(&J) Then Do ;
      &P = &By(&J) ;
      %Do W = 1 %To &N ;
        %If &&A&W Ne &By %Then &T&W = &&A&W(&J)  ;  ;
      %End ;
      Do &I = &J - 1 To &LB By  - 1 ;
        If &P &Q.E &By(&I) Then Leave ;
        %Do W = 1 %To &N ;
          &&A&W(&I + 1) = &&A&W(&I) ;
        %End ;
      End ;
      &By(&I + 1) = &P ;
      %Do W = 1 %To &N ;
        %If &&A&W Ne &By %Then &&A&W(&I + 1) = &T&W ; ;
      %End ;
    End ;
  End ;

  %Exit: Drop &H &I &J &L &P &S T&_: ;

%Mend Qsort ;

data have;
  informat Member $9.;
  informat Claim $12.;
  informat code1-code4 $8.;
  infile cards truncover;
  input Member Claim Claim_Line code1-code4;
  cards;
866738350    14003347700    1  North3      North21     North12    North35
866738350    14003347700    4  95      95      B23     B23
866738350    14003347700    3  95      95      B12     B2
866738350    14003347700    2  CCCCC2  CCCCC12  CCCCC39
866738350    14003347700    5  B750    B86     B7      B12
850308450    142801958000   2  B13     B13     96      96
861027250    142802729900   1  AA85    AA9     96      45
861027250    142802729900   1  3      33      12      2
855217600    142802814700   1  95      95     45       16
855217600    142802814700   1  95      95     16       16
;

data want (drop=i firstnum);
  set have;
  array v_in(4) $ code1-code4;
  array v_sort(4) $255. _temporary_;
  do i=1 to 4;
    if not missing(v_in(i)) then do;
      firstnum=anydigit(v_in(i));
      if firstnum gt 1 then
        v_sort(i)=catt(substr(v_in(i),1,firstnum-1),
         put(input(substr(v_in(i),firstnum),8.),z8.));
      else v_sort(i)=
         put(input(substr(v_in(i),1),8.),z8.);
    end;
  end;
  %Qsort (Arr=v_sort v_in, By=v_sort, Seq=D);
  i=1;
  do while(i le dim(v_in));
    if not missing(v_in(i)) then do;
      if i eq 1 then largest=v_in(i);
      else smallest=v_in(i);
    end;
    i+1;
  end;
run;

Art, CEO, AnalystFinder.com

Largest with text fields