Re: Compile records based on rules

lichee · Posted 10-28-2022 10:56 PM

Hi all,

I wanted to compile records into the target data set "want" based on some simple rules for the data I have "have" as below. The simple rules include

(1) When multiple records per person have the same state info and the complementary Type_CDs, compile them into one record, for example, person1, person2, person 3, and person4.

(2) When multiple records per person have the different state info and their Type_CDs are different, the records cannot be compiled, for example, person5 and person6. However, if one record has no Type_CDs available at all, then the record with any available Type_CDs should be picked, and Person8 is this case.

(3) Any other Type_CD is subset of Type_CD 1. Because of this, any record with Type_CD=1 can overwrite other records per person if the records have the same state info. for example, Person7.

data have;
infile datalines truncover dsd;
input DummyID $ State $ Type_CD1-Type_CD6
;
datalines;
Person1,NJ,,,,15,1,1
Person1,NJ,15,15,15,,,
Person2,NY,7,7,7,7,7,
Person2,NY,,,,,,7
Person3,NY,,,,,,7
Person3,NY,,,,7,7,
Person4,NY,,,,7,7,7
Person4,NY,7,7,7,,,
Person5,NY,7,7,7,7,7,7
Person5,CA,,,,,,4
Person6,NY,7,7,7,7,7,
Person6,NJ,,,,,,15
Person7,AR,1,1,1,1,1,1
Person7,AR,,,7,7,7,7
Person8,NY,,,,1,1,1
Person8,NJ,,,,,,
;
run;

data want;
infile datalines truncover dsd;
input DummyID $ State $ Type_CD1-Type_CD6
;
datalines;
Person1,NJ,15,15,15,15,1,1
Person2,NY,7,7,7,7,7,7
Person3,NY,,,,7,7,7
Person4,NY,7,7,7,7,7,7
Person7,AR,1,1,1,1,1,1
Person8,NY,,,,1,1,1
;
run;

Any suggestions are greatly appreciated.

Thank you!

Tom · Posted 10-29-2022 11:53 AM

So first get rid of the observations with all missing codes. That is easier with a data step.

data not_missing;
  set have;
  if n(of Type_CD:);
run;

Then eliminate those with conflicting STATE codes. That is easier in PROC SQL.

proc sql;
create table no_dups as
select * from not_missing
group by dummyid
having count(distinct state)=1
;
quit;

Then it looks like you just want to do an UPDATE to collapse the non-missing codes.

data want;
  update no_dups(obs=0) no_dups;
  by dummyid ;
run;

Note that the last non-missing value will be retained by this method. So if the order matters you should have an ordering variable to make sure the de-duped data records are processed in the right order. Perhaps there is DATE or other variable you can use to order?

Implementing the any value of 1 overrides the other values rule will be harder. You could make a set of observations that just have the DUMMYID and a 1 for the particular TYPE code variable and then use that to update the previous results.

data any_one;
  set no_dups(rename=(TYPE_CD1-TYPE_CD6=x1-x6));
  array type_cd [6];
  array x[6];
  do index=1 to 6;
    if x[index]=1 then do;
      type_cd[index]=1;
      output;
      type_cd[index]=.;
    end;
  end;
  keep dummyid type_cd: ;
run;

data want;
  update want any_one;
  by dummyid;
run;

Tom · Posted 10-29-2022 01:01 PM

Here is a method using PROC SUMMARY that works for your example. But I worry it might not do what you want with more complex data. In particular it might not do the duplicate state issue right if there are more than two observations per ID. Also if the any of the type codes are zero then using the minimum to test if the code is 1 will not work. And finally (again) it is not clear what criteria you want to use to pick between conflicting types codes other than 1. This method takes the MAX of the other codes.

proc summary nway 
  data=have(where=(n(type_cd1,type_cd2,type_cd3,type_cd4,type_cd5,type_cd6))
;
  class dummyid;
  var type: ;
  output out=summary idgroup(out[2] (state)=) min=min1-min6 max=max1-max6 ;
run;

data want;
  set summary;

* Eliminate duplicate states ;
  if state_1=state_2 or state_2=' ';

* Copy the state ;
  state=state_1;

* Select the calcualted TYPE code ;
* Select 1 if present otherwise use the maximum ;
  array low min1-min6;
  array high max1-max6;
  array type Type_CD1-Type_CD6;
  do index=1 to 6;
    if low[index]=1 then type[index]=1;
    else type[index]=high[index];
  end;
  keep dummyid state type:;
run;

Compile records based on rules

Re: Compile records based on rules

Re: Compile records based on rules

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!