Hi all,
I have a dataset at the person/month/categorical group variable level, like this:
ID First name Last name Month Categorical var
1 abc def 200901 a
1 abc def 200902 a
2 abc def 200903 a
3 abc def 200901 b
4 abc def 200902 b
I'm trying to clean the ID variable by setting all values of ID within levels of the categorical variable, as the value of ID as of the most recent month. So basically, as long as a person has the same name, I assume they're the same person and overwrite older values of ID (within categorical var=a) with the newest value, like this:
ID First name Last name Month Categorical var
2 abc def 200901 a
2 abc def 200902 a
2 abc def 200903 a
4 abc def 200901 b
4 abc def 200902 b
So basically, I want to retain all the same data, but with ID now consistent within each level of categorical var for each person.
Any help is much appreciated.
Well, few ways to do it, you could sort and retain, or do it in SQL like:
data have; input id first_name $ last_name $ month categorical_var $; datalines; 1 abc def 200901 a 1 abc def 200902 a 2 abc def 200903 a 3 abc def 200901 b 4 abc def 200902 b ; run; proc sql; create table WANT as select B.ID, A.FIRST_NAME, A.LAST_NAME, A.MONTH, A.CATEGORICAL_VAR from HAVE A left join (select CATEGORICAL_VAR,max(ID) as ID from HAVE group by CATEGORICAL_VAR) B on A.CATEGORICAL_VAR=B.CATEGORICAL_VAR; quit;
I realize that in the example the IDs got higher the later the month was, but that is not the case in my source data, so I don't think selecting max ID would get me what I want.
Then change the second group per:
left join (select distinct CATEGORICAL_VAR,ID as ID from HAVE group by CATEGORICAL_VAR having max(MONTH)) B
However that will cause duplicates if there are more than one cat/id in the last month. If you can post a datastep with some test data, exactly matching your requirements the code can be adapted, can only go on what I see.
Assuming the data has been sorted by month like your sample data did.
data have;
input id first_name $ last_name $ month categorical_var $;
datalines;
1 abc def 200901 a
1 abc def 200902 a
2 abc def 200903 a
3 abc def 200901 b
4 abc def 200902 b
;
run;
data want;
do until(last.categorical_var);
set have;
by first_name last_name categorical_var ;
end;
_id=id;
do until(last.categorical_var);
set have;
by first_name last_name categorical_var ;
id=_id;output;
end;
drop _id;
run;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.