topic Re: Compile records based on rules in SAS Programming

Compile records based on rules

lichee — Sat, 29 Oct 2022 15:36:30 GMT

Hi all,

I wanted to compile records into the target data set "want" based on some simple rules for the data I have "have" as below. The simple rules include

(1) When multiple records per person have the same state info and the complementary Type_CDs, compile them into one record, for example, person1, person2, person 3, and person4.

(2) When multiple records per person have the different state info and their Type_CDs are different, the records cannot be compiled, for example, person5 and person6. However, if one record has no Type_CDs available at all, then the record with any available Type_CDs should be picked, and Person8 is this case.

(3) Any other Type_CD is subset of Type_CD 1. Because of this, any record with Type_CD=1 can overwrite other records per person if the records have the same state info. for example, Person7.

data have;
infile datalines truncover dsd;
input DummyID $ State $ Type_CD1-Type_CD6
;
datalines;
Person1,NJ,,,,15,1,1
Person1,NJ,15,15,15,,,
Person2,NY,7,7,7,7,7,
Person2,NY,,,,,,7
Person3,NY,,,,,,7
Person3,NY,,,,7,7,
Person4,NY,,,,7,7,7
Person4,NY,7,7,7,,,
Person5,NY,7,7,7,7,7,7
Person5,CA,,,,,,4
Person6,NY,7,7,7,7,7,
Person6,NJ,,,,,,15
Person7,AR,1,1,1,1,1,1
Person7,AR,,,7,7,7,7
Person8,NY,,,,1,1,1
Person8,NJ,,,,,,
;
run;

data want;
infile datalines truncover dsd;
input DummyID $ State $ Type_CD1-Type_CD6
;
datalines;
Person1,NJ,15,15,15,15,1,1
Person2,NY,7,7,7,7,7,7
Person3,NY,,,,7,7,7
Person4,NY,7,7,7,7,7,7
Person7,AR,1,1,1,1,1,1
Person8,NY,,,,1,1,1
;
run;

Any suggestions are greatly appreciated.

Thank you!

Re: Compile records based on rules

Tom — Sat, 29 Oct 2022 16:29:59 GMT

So first get rid of the observations with all missing codes. That is easier with a data step.

data not_missing;
  set have;
  if n(of Type_CD:);
run;

Then eliminate those with conflicting STATE codes. That is easier in PROC SQL.

proc sql;
create table no_dups as
select * from not_missing
group by dummyid
having count(distinct state)=1
;
quit;

Then it looks like you just want to do an UPDATE to collapse the non-missing codes.

data want;
  update no_dups(obs=0) no_dups;
  by dummyid ;
run;

Note that the last non-missing value will be retained by this method. So if the order matters you should have an ordering variable to make sure the de-duped data records are processed in the right order. Perhaps there is DATE or other variable you can use to order?

Implementing the any value of 1 overrides the other values rule will be harder. You could make a set of observations that just have the DUMMYID and a 1 for the particular TYPE code variable and then use that to update the previous results.

data any_one;
  set no_dups(rename=(TYPE_CD1-TYPE_CD6=x1-x6));
  array type_cd [6];
  array x[6];
  do index=1 to 6;
    if x[index]=1 then do;
      type_cd[index]=1;
      output;
      type_cd[index]=.;
    end;
  end;
  keep dummyid type_cd: ;
run;

data want;
  update want any_one;
  by dummyid;
run;

Re: Compile records based on rules

Tom — Sat, 29 Oct 2022 17:03:06 GMT

Here is a method using PROC SUMMARY that works for your example. But I worry it might not do what you want with more complex data. In particular it might not do the duplicate state issue right if there are more than two observations per ID. Also if the any of the type codes are zero then using the minimum to test if the code is 1 will not work. And finally (again) it is not clear what criteria you want to use to pick between conflicting types codes other than 1. This method takes the MAX of the other codes.

proc summary nway 
  data=have(where=(n(type_cd1,type_cd2,type_cd3,type_cd4,type_cd5,type_cd6))
;
  class dummyid;
  var type: ;
  output out=summary idgroup(out[2] (state)=) min=min1-min6 max=max1-max6 ;
run;

data want;
  set summary;

* Eliminate duplicate states ;
  if state_1=state_2 or state_2=' ';

* Copy the state ;
  state=state_1;

* Select the calcualted TYPE code ;
* Select 1 if present otherwise use the maximum ;
  array low min1-min6;
  array high max1-max6;
  array type Type_CD1-Type_CD6;
  do index=1 to 6;
    if low[index]=1 then type[index]=1;
    else type[index]=high[index];
  end;
  keep dummyid state type:;
run;