Re: Hierarchy and Collapsing

A_Halps · Posted 08-22-2023 12:56 PM

I want to collapse the data within each state to get 1 Record Per ID-State. on top of that, there is a hierarchy for who is the most eligible in that BEN_CD per state. The hierarchy of who should be selected: 1>2>3>4>5>6>7>0.

So, I am combining all of them based on the BEN_CD_#.

So for BEN_CD_01: NJ has 1,0,0 so then final result will be 1.

For BEN_CD_03: NJ has 1,0,2 so then final result will be 1 because according to my hierarchy, 1 >2>0

For BEN_CD_04: NJ has 3,1,1 so then final result will be 1 because according to my hierarchy, 1 >3

For BEN_CD_07: NJ has 5,6,0 so then final result will be 5 because according to my hierarchy 5>6>0

The code starts like this:

data demogs;
input id state_cd $ ben_cd_01-ben_cd_12;
datalines;
111 NJ 1 1 1 3 1 0 5 0 0 0 0 0
111 NJ 0 0 0 1 0 1 6 0 0 0 0 0
111 NJ 0 0 2 1 0 1 0 0 0 0 0 0
111 FL 0 0 0 0 0 1 1 1 1 1 1 1
;
run;

Here is some code I have tried to use thus far but am not getting what I want at all:

data DEMOGS_COLLAPSED;
    set DEMOGS;
	    by state_cd;

    array BENS(12) BENS_CD_01-BENS_CD_12;
	    do i = 01 to 12;
		if BENS[i] = 0 then BENS[i] = 99;
	    end;
drop i;
run;

data DEMOGS_COLLAPSED_2;
    set DEMOGS_COLLAPSED;
	    by state_cd;

	array BENS_rst(12) BENS_CD_01-BENS_CD_12;

    do i = 01 to 12;
		if BENS_rst[i] ne . then BENS_rst[i]=min(BENS_rst[i]);
    end;

drop i;
run;

and I want it to end like this:

data final;
input id state_cd $ ben_cd_01-ben_cd_12;
datalines;
111 NJ 1 1 1 1 1 1 5 0 0 0 0 0
111 FL 0 0 0 0 0 1 1 1 1 1 1 1
;
run;

Thanks in advance!

Quentin · Posted 08-22-2023 01:50 PM

Your rules aren't clear. For NJ there are 3 records that have BEN_CD=1. Why did you choose the first record? Is it because it has more 1's than the other records?

Also, please show the code you have tried. It will help people help you.

A_Halps · Posted 08-22-2023 02:08 PM

Changed the original post to answer your questions

Quentin · Posted 08-22-2023 03:20 PM

I see. Are you familiar with PROC SQL? There is almost an easy SQL approach, because you almost want the minimum value. "Almost" because of the pesky problem that 0 lowest in your hierarchy.

You could almost just do:

proc sql ;
  create table want as
  select id,state_cd
        ,min(ben_cd_01) as ben_cd_01
        ,min(ben_cd_02) as ben_cd_02
        ,min(ben_cd_03) as ben_cd_03
        ,min(ben_cd_04) as ben_cd_04
  from demogs
  group by id,state_cd
   ;
quit ;

Then I went down the rathole of trying to deal with the zero values, and decided to treat them as 9s and came up with the horrendous:

proc sql ;
  create table want as
  select id,state_cd
        ,ifn(min(ifn(ben_cd_01=0,9,ben_cd_01))=9,0,min(ifn(ben_cd_01=0,9,ben_cd_01))) as ben_cd_01
        ,ifn(min(ifn(ben_cd_02=0,9,ben_cd_02))=9,0,min(ifn(ben_cd_02=0,9,ben_cd_02))) as ben_cd_02
        ,ifn(min(ifn(ben_cd_03=0,9,ben_cd_03))=9,0,min(ifn(ben_cd_03=0,9,ben_cd_03))) as ben_cd_03
        ,ifn(min(ifn(ben_cd_04=0,9,ben_cd_04))=9,0,min(ifn(ben_cd_04=0,9,ben_cd_04))) as ben_cd_04
  from demogs
  group by id,state_cd
   ;
quit ;

Or similar wallpaper code:

proc sql ;
  create table want as
  select id,state_cd
        ,ifn(ben_cd_01=9,0,ben_cd_01) as ben_cd_01
        ,ifn(ben_cd_02=9,0,ben_cd_02) as ben_cd_02
        ,ifn(ben_cd_03=9,0,ben_cd_03) as ben_cd_03
        ,ifn(ben_cd_04=9,0,ben_cd_04) as ben_cd_04
        ,ifn(ben_cd_05=9,0,ben_cd_05) as ben_cd_05
        ,ifn(ben_cd_06=9,0,ben_cd_06) as ben_cd_06
        ,ifn(ben_cd_07=9,0,ben_cd_07) as ben_cd_07
        ,ifn(ben_cd_08=9,0,ben_cd_08) as ben_cd_08
        ,ifn(ben_cd_09=9,0,ben_cd_09) as ben_cd_09
        ,ifn(ben_cd_10=9,0,ben_cd_10) as ben_cd_10
        ,ifn(ben_cd_11=9,0,ben_cd_11) as ben_cd_11
        ,ifn(ben_cd_12=9,0,ben_cd_12) as ben_cd_12
  from (
        select id,state_cd
        ,min(ifn(ben_cd_01=0,9,ben_cd_01)) as ben_cd_01
        ,min(ifn(ben_cd_02=0,9,ben_cd_02)) as ben_cd_02
        ,min(ifn(ben_cd_03=0,9,ben_cd_03)) as ben_cd_03
        ,min(ifn(ben_cd_04=0,9,ben_cd_04)) as ben_cd_04
        ,min(ifn(ben_cd_05=0,9,ben_cd_05)) as ben_cd_05
        ,min(ifn(ben_cd_06=0,9,ben_cd_06)) as ben_cd_06
        ,min(ifn(ben_cd_07=0,9,ben_cd_07)) as ben_cd_07
        ,min(ifn(ben_cd_08=0,9,ben_cd_08)) as ben_cd_08
        ,min(ifn(ben_cd_09=0,9,ben_cd_09)) as ben_cd_09
        ,min(ifn(ben_cd_10=0,9,ben_cd_10)) as ben_cd_10
        ,min(ifn(ben_cd_11=0,9,ben_cd_11)) as ben_cd_11
        ,min(ifn(ben_cd_12=0,9,ben_cd_12)) as ben_cd_12
        from demogs
        group by id,state_cd
        ) as a
   ;
quit ;

I think instead of that mess I would probably use a DATA step with an array to recode all of the zeros to nines, then run the simple PROC SQL step (or PROC MEANS) to get the minimum values, then another step to recode the nines back to zeros.

My other thought is that generally I don't like this wide format. So if I were working with this, I'd be likely to transpose it into a vertical dataset that had columns ID State Ben_CD_ID Ben_CD. So each record in your dataset would become 12 records in the vertical dataset. That would allow you to use the simple PROC SQL or PROC MEANS approach to grab the minimum value (after dealing with the zero problem). If you'd like to see a transpose approach, drop a note, and I or someone else will post one.

Kurt_Bremser · Posted 08-22-2023 04:50 PM

When you subtract 1 and use MOD by 7, the 0 turns into the highest value.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 08-22-2023 08:25 PM

Since your not quite obvious rule is to search across observations then perhaps it means the data should be reshaped so that the rows become the variables and the Ben variables become rows. Then searching is a bit easier.

The code below assumes the values are at least grouped by ID and State. if not, sort by Id and state before the first step.

proc transpose data=demogs out=trans
   ;
by id state_cd notsorted;
var ben: ;
run;

data rankit;
   set trans;
   do i= 1 to 7,0;
      if whichn(i,of col:)>0 then do;
         rank=i;
         leave;
      end;
   end;
run;

proc transpose data=rankit out=want (drop=_name_);
   by id state_cd notsorted;
   var rank;
   id _name_;
run;

The first transpose makes the row/column reshape.

The data step uses the order of the DO loop variable to search in a specific order. If you had really stupid hierarchy rules like 2>7>1>5>4>6 ... then listing them in order on the Do statement would work.

The WHICH function will search for the first value in the parameter list in a list of other values which here uses the variable list Col: as the default variables created by Proc transpose will be Col1, col2 ... The Whichn functions returns the position of the first match or 0. So the test for if the value of I was found. Only when found set a variable I called Rank to match your hierarchy level. The LEAVE instruction terminates the DO loop when executed. So we Leave the Loop when the first match is found.

Then use Proc Transpose to reshape back to your desired layout.

FreelanceReinh · Posted 08-23-2023 06:14 AM

Hello @A_Halps,

I agree with the others that a dataset in a "long" format with only one variable (B) containing the BEN_CD_n values would make the task easier. PROC SQL could then select an expression such as

mod(min(mod(B+7, 8))+1, 8)

(similar to what @Kurt_Bremser suggested) from each "id state_cd n" BY group (in the sense of a GROUP BY clause). This expression would combine the two transformations needed for the special case B=0.

To avoid the two "transpose" steps creating (and reversing) the long format temporarily, you could fix your existing array approach and get the result in a single DATA step:

%macro f(a);
ifn(&a[i] in (0,9), 9-&a[i], &a[i])
%mend f;

data want(drop=i m:);
do until(last.state_cd);
  set demogs;
  by id state_cd notsorted;
  array b[12] b:;
  array m[12];
  do i=1 to dim(b);
    m[i]=min(m[i], %f(b));
  end;
end;
do i=1 to dim(b);
  b[i]=%f(m);
end;
run;

The DOW loop (do until(last.state_cd) ...) facilitates the collapsing of observations. The little macro F is just an abbreviation of the function used in two places to implement the special rule for zeros in array B. Array M (storing the "minimum" values) could also be defined as a _temporary_ array for possibly better performance. This would require a

call missing(of m[*]);

to be inserted, e.g., before the RUN statement (to override the automatic RETAIN of temporary arrays) and the "m:" in the DROP= dataset option to be removed.

mkeintz · Posted 08-23-2023 09:11 AM

If your data are already grouped by ID/STATE_CD:


data want (drop=_: );
  set demogs;
  by id state_cd notsorted;
  retain _pref_ben1-_pref_ben12;
  array pref_ben _pref_ben: ;
  array ben_cd   ben_cd_01-ben_cd_12 ;
  if first.state_cd then do over ben_cd;
    pref_ben=ben_cd;
  end;
  else do over ben_cd;
    if mod(ben_cd+7,8)<mod(pref_ben+7,8) then pref_ben=ben_cd;
  end;

  if last.state_cd;
  do over ben_cd;
    ben_cd=pref_ben;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Registration is open

SAS Training: Just a Click Away