DATA Step, Macro, Functions and more

Filling values of ID from most recent ID by values of another variable

Reply
Frequent Contributor
Posts: 138

Filling values of ID from most recent ID by values of another variable

Hi all,

 

I have a dataset at the person/month/categorical group variable level, like this:

 

ID      First name        Last name       Month       Categorical var

1        abc                    def                  200901              a

1        abc                    def                  200902              a

2        abc                    def                  200903              a

3        abc                    def                  200901              b

4        abc                    def                  200902              b

 

I'm trying to clean the ID variable by setting all values of ID within levels of the categorical variable, as the value of ID as of the most recent month. So basically, as long as a person has the same name, I assume they're the same person and overwrite older values of ID (within categorical var=a) with the newest value, like this:

 

ID      First name        Last name       Month       Categorical var

2        abc                    def                  200901              a

2        abc                    def                  200902              a

2        abc                    def                  200903              a

4        abc                    def                  200901              b

4        abc                    def                  200902              b

 

So basically, I want to retain all the same data, but with ID now consistent within each level of categorical var for each person. 

 

Any help is much appreciated. 

Super User
Super User
Posts: 7,942

Re: Filling values of ID from most recent ID by values of another variable

Posted in reply to Walternate

Well, few ways to do it, you could sort and retain, or do it in SQL like:

data have;
  input id first_name $ last_name $ month categorical_var $;
datalines;
1 abc def 200901 a
1 abc def 200902 a
2 abc def 200903 a
3 abc def 200901 b
4 abc def 200902 b
;
run;

proc sql;
  create table WANT as
  select  B.ID,
          A.FIRST_NAME,
          A.LAST_NAME,
          A.MONTH,
          A.CATEGORICAL_VAR 
  from    HAVE A
  left join (select CATEGORICAL_VAR,max(ID) as ID from HAVE group by CATEGORICAL_VAR) B
  on      A.CATEGORICAL_VAR=B.CATEGORICAL_VAR;
quit;
Frequent Contributor
Posts: 138

Re: Filling values of ID from most recent ID by values of another variable

I realize that in the example the IDs got higher the later the month was, but that is not the case in my source data, so I don't think selecting max ID would get me what I want. 

Super User
Super User
Posts: 7,942

Re: Filling values of ID from most recent ID by values of another variable

Posted in reply to Walternate

Then change the second group per:

  left join (select distinct CATEGORICAL_VAR,ID as ID from HAVE group by CATEGORICAL_VAR having max(MONTH)) B

However that will cause duplicates if there are more than one cat/id in the last month.  If you can post a datastep with some test data, exactly matching your requirements the code can be adapted, can only go on what I see. 

Super User
Posts: 10,018

Re: Filling values of ID from most recent ID by values of another variable

Posted in reply to Walternate

Assuming the data has been sorted by month like your sample data did.

 

data have;
  input id first_name $ last_name $ month categorical_var $;
datalines;
1 abc def 200901 a
1 abc def 200902 a
2 abc def 200903 a
3 abc def 200901 b
4 abc def 200902 b
;
run;
data want;
do until(last.categorical_var);
 set have;
 by first_name  last_name  categorical_var ;
end;
_id=id;
do until(last.categorical_var);
 set have;
 by first_name  last_name  categorical_var ;
 id=_id;output;
end;
drop _id;
run;
 
Ask a Question
Discussion stats
  • 4 replies
  • 242 views
  • 0 likes
  • 3 in conversation