Managing Duplicate Data

Posts: 21

Managing Duplicate Data


I have several thousands of patients data and unfortunately, there are a lots of repeat observations except for 1 column which might have different value. So, for example, if a patient got 2 medications, then medication 1 is entered in row 1 and medication 2 in row 2. All other variables are similar between row 1 and 2. I need both medication 1 and 2 but I want them in the same row so that it is easy to analyze these data. And I have several thousands of patient data like this trying to figure out how to condense this into 1 row for each patient. Some patients might have had 6 or 7 medications and hence, 6 or 7 rows.

Is there a way to do this in SAS?

Please note: I tried the below code:

if medication =1 then medication1=1;

if medication=2 then medication2=1;

if medication=3 then medication3=1;

BUT, it only creates a unique value for each medication1 or medication2 for that particular row and obviously in row 1, for a patient who had medication 1, I would have medication1=1 and rest of the medications as 0. I would only have medication2=1 in row2...and, when I try to remove duplicates, it doesn't help and doesn't serve the purpose that I am looking for.

Would really appreciate if any one has some experience in dealing with this...



Respected Advisor
Posts: 4,736

Re: Managing Duplicate Data

You might want to look into Proc Transpose (or the transpose wizard in EG).

Super Contributor
Super Contributor
Posts: 464

Re: Managing Duplicate Data

will this make sense?

data have;

input pat_id $ med $ var1 $ var2 $ var3 $;


100    med11    xxx    yyy    zzz

100    med12    xxx    yyy    zzz

100    med13    xxx    yyy    zzz

101    med21    aaa    bbb    ccc

101    med22    aaa    bbb    ccc

101    med23    aaa    bbb    ccc


proc print;run;

proc sort ;by pat_id var1-var3;proc print;run;

proc transpose data=have out=want(drop=_name_) prefix=med;

var med;

by pat_id var1-var3;

proc print;run;

Super User
Posts: 13,508

Re: Managing Duplicate Data

You don't mention what kind of analysis you are looking to perform. You may find that the one record per patient per row is preferred for some types of analysis. For instance, using proc means or summary with the patient id as a class variable would let you get the number of medications per patient. Using the medication as a class variable you could getn the number of patients receiving that medication. If you have demographic information or doctor then you would be able to compare on those as well.

It is also easy using a WHERE clause to only look at patient records involving specific medications.

If you have seven medication variables per patient then you are going to have to write additional code to determine which variable contains medication of interest.

Ask a Question
Discussion stats
  • 3 replies
  • 4 in conversation