Re: Filling values of ID from most recent ID by values of another vari...

Walternate · Posted 02-12-2016 08:58 AM

Hi all,

I have a dataset at the person/month/categorical group variable level, like this:

ID First name Last name Month Categorical var

1 abc def 200901 a

1 abc def 200902 a

2 abc def 200903 a

3 abc def 200901 b

4 abc def 200902 b

I'm trying to clean the ID variable by setting all values of ID within levels of the categorical variable, as the value of ID as of the most recent month. So basically, as long as a person has the same name, I assume they're the same person and overwrite older values of ID (within categorical var=a) with the newest value, like this:

ID First name Last name Month Categorical var

2 abc def 200901 a

2 abc def 200902 a

2 abc def 200903 a

4 abc def 200901 b

4 abc def 200902 b

So basically, I want to retain all the same data, but with ID now consistent within each level of categorical var for each person.

Any help is much appreciated.

RW9 · Posted 02-12-2016 09:28 AM

Well, few ways to do it, you could sort and retain, or do it in SQL like:

data have;
  input id first_name $ last_name $ month categorical_var $;
datalines;
1 abc def 200901 a
1 abc def 200902 a
2 abc def 200903 a
3 abc def 200901 b
4 abc def 200902 b
;
run;

proc sql;
  create table WANT as
  select  B.ID,
          A.FIRST_NAME,
          A.LAST_NAME,
          A.MONTH,
          A.CATEGORICAL_VAR 
  from    HAVE A
  left join (select CATEGORICAL_VAR,max(ID) as ID from HAVE group by CATEGORICAL_VAR) B
  on      A.CATEGORICAL_VAR=B.CATEGORICAL_VAR;
quit;

Walternate · Posted 02-12-2016 09:41 AM

I realize that in the example the IDs got higher the later the month was, but that is not the case in my source data, so I don't think selecting max ID would get me what I want.

RW9 · Posted 02-12-2016 09:48 AM

Then change the second group per:

  left join (select distinct CATEGORICAL_VAR,ID as ID from HAVE group by CATEGORICAL_VAR having max(MONTH)) B

However that will cause duplicates if there are more than one cat/id in the last month. If you can post a datastep with some test data, exactly matching your requirements the code can be adapted, can only go on what I see.

Ksharp · Posted 02-14-2016 10:26 PM

Assuming the data has been sorted by month like your sample data did.

data have;
  input id first_name $ last_name $ month categorical_var $;
datalines;
1 abc def 200901 a
1 abc def 200902 a
2 abc def 200903 a
3 abc def 200901 b
4 abc def 200902 b
;
run;
data want;
do until(last.categorical_var);
 set have;
 by first_name  last_name  categorical_var ;
end;
_id=id;
do until(last.categorical_var);
 set have;
 by first_name  last_name  categorical_var ;
 id=_id;output;
end;
drop _id;
run;

Filling values of ID from most recent ID by values of another variable

Re: Filling values of ID from most recent ID by values of another variable

Re: Filling values of ID from most recent ID by values of another variable

Re: Filling values of ID from most recent ID by values of another variable

Re: Filling values of ID from most recent ID by values of another variable

Filling values of ID from most recent ID by values of another variable

Re: Filling values of ID from most recent ID by values of another variable

Re: Filling values of ID from most recent ID by values of another variable

Re: Filling values of ID from most recent ID by values of another variable

Re: Filling values of ID from most recent ID by values of another variable

Click image to register for webinar

Classroom Training Available!