Hi, I have a dataset that looks something like below:
subject year disease
1 2008 123
1 2009 1234
1 2010 A123
2 2009 2234
2 2009 3334
2 2010 A234
2 2010 A334
2 2010 A444
So basically, 123, 1234 and A123 are same disease but was coded differently every year.
I want to keep only the last observation of each disease per subject.
Like this:
subject year disease
1 2010 A123
2 2010 A234
2 2010 A334
2 2010 A444
Could anyone help me how to solve this??
Here is one way
data have;
input subject year disease $;
datalines;
1 2008 123
1 2009 1234
1 2010 A123
2 2009 2234
2 2009 3334
2 2010 A234
2 2010 A334
2 2010 A444
;
data want (drop=y);
do _N_ = 1 by 1 until (last.subject);
set have;
by subject;
y = year;
end;
do _N_ = 1 to _N_;
set have;
if year=y then output;
end;
run;
Result:
subject year disease 1 2010 A123 2 2010 A234 2 2010 A334 2 2010 A444
Hi @lizwarr
You can also use a PROC SQL to retrieve those observations:
proc sql;
create table want as
select *
from have
group by subject
having year=max(year);
quit;
Did you simplify the problem? Are there going to be more entries in the input data sets, with additional diseases having multiple codes?
If so, I think you'll need to have a mapping table (or format) mapping all of the codes to uniform disease names. The code will be more complicated.
At first glance I thought this might be an opportunity to take advantage of the rarely-used GROUPFORMAT option, but on reflection I don't think there would be any benefit.
Indeed, I think I may have over-simplified the problem. It's data of hundreds of patients followed for 9 years with all kinds of chronic disease. I do have an excel file with codes of each year, but it's just that there are so many disease, I could't imagine unifying the codes manually...
@lizwarr wrote:
Indeed, I think I may have over-simplified the problem. It's data of hundreds of patients followed for 9 years with all kinds of chronic disease. I do have an excel file with codes of each year, but it's just that there are so many disease, I could't imagine unifying the codes manually...
In this Excel file is there any logical way to determine if two codes from two different years represent the same disease? If so then it wouldn't be too hard to create such a mapping table. We could actually simply generate a SAS Format which maps all code variations to the latest version.
Above approach is of course only possible if there is a 1:1 relationship between the code variations over the years. Things will get much harder if codes change because of changing categorisations/higher differentiation of diseases like in year 1 a decease has code A but in year two it gets either code A1 or code A2 because coding becomes more differentiated. So to what would you now map code A from year 1? Code A1 or code A2? ...not a decease but just thinking how coding for a person within the autism spectrum might have changed over the years.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.