BookmarkSubscribeRSS Feed
lizwarr
Fluorite | Level 6

Hi, I have a dataset that looks something like below:

 

subject  year   disease

1           2008   123

1           2009    1234

1           2010     A123

2           2009     2234

2           2009      3334

2           2010     A234

2           2010     A334

2           2010     A444

 

So basically, 123, 1234 and A123 are same disease but was coded differently every year.

I want to keep only the last observation of each disease per subject.

 

Like this:

 

subject  year   disease

1           2010     A123

2           2010     A234

2           2010     A334

2           2010     A444

 

 

Could anyone help me how to solve this??

5 REPLIES 5
PeterClemmensen
Tourmaline | Level 20

Here is one way

 

data have;
input subject year disease $;
datalines;
1 2008 123
1 2009 1234
1 2010 A123
2 2009 2234
2 2009 3334
2 2010 A234
2 2010 A334
2 2010 A444
;

data want (drop=y);
    do _N_ = 1 by 1 until (last.subject);
        set have;
        by subject;
        y = year;
    end;
    do _N_ = 1 to _N_;
        set have;
        if year=y then output;
    end;
run;

Result:

 

subject year  disease 
1       2010  A123 
2       2010  A234 
2       2010  A334 
2       2010  A444 
ed_sas_member
Meteorite | Level 14

Hi @lizwarr 

 

You can also use a PROC SQL to retrieve those observations:

proc sql;
	create table want as
	select *
	from have
	group by subject
	having year=max(year);
quit;

Capture d’écran 2020-03-03 à 11.04.40.png

 

JackHamilton
Lapis Lazuli | Level 10

Did you simplify the problem?  Are there going to be more entries in the input data sets, with additional diseases having multiple codes?

 

If so, I think you'll need to have a mapping table (or format) mapping all of the codes to uniform disease names.  The code will be more complicated.

 

At first glance I thought this might be an opportunity to take advantage of the rarely-used GROUPFORMAT option, but on reflection I don't think there would be any benefit.

 

lizwarr
Fluorite | Level 6

Indeed, I think I may have over-simplified the problem. It's data of hundreds of patients followed for 9 years with all kinds of chronic disease. I do have an excel file with codes of each year, but it's just that there are so many disease, I could't imagine unifying the codes manually...

Patrick
Opal | Level 21

@lizwarr wrote:

Indeed, I think I may have over-simplified the problem. It's data of hundreds of patients followed for 9 years with all kinds of chronic disease. I do have an excel file with codes of each year, but it's just that there are so many disease, I could't imagine unifying the codes manually...


In this Excel file is there any logical way to determine if two codes from two different years represent the same disease? If so then it wouldn't be too hard to create such a mapping table. We could actually simply generate a SAS Format which maps all code variations to the latest version.

Above approach is of course only possible if there is a 1:1 relationship between the code variations over the years. Things will get much harder if codes change because of changing categorisations/higher differentiation of diseases like in year 1 a decease has code A but in year two it gets either code A1 or code A2 because coding becomes more differentiated. So to what would you now map code A from year 1? Code A1 or code A2? ...not a decease but just thinking how coding for a person within the autism spectrum might have changed over the years.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1000 views
  • 4 likes
  • 5 in conversation