BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Here's an easy way to handle the situation:

 

data want;
   update have (obs=0) have;
   by id;
run;

View solution in original post

11 REPLIES 11
emaneman
Pyrite | Level 9
Dear community,
I am having an issue with a dataset in which I have duplicate observations, which are, however, not identical. The identifier shows that it is the same observations, but on some columns/variables one line has data and on some other columns/variables it is another line that has the data. Some duplicate observations have all the values, and the duplicate has none.
In the datasets below the top part shows the initial file and the bottom part shows the result that i want to obtain.
As you can see, observations with ID A and D are duplicate in the initial file, and are "merged" in the outcome file.
Any suggestion on how to do that in SAS?
Thanks in advance.
Eman
ID var1 var2 var3 var4 var5 var6
A 1 1 1
A 2 2 2
B 1 3 2 2 3 3
C 1 2 3 3 3 3
D 2 2 2
D 2 2 2 3 3 3


ID var1 var2 var3 var4 var5 var6
A 1 1 1 2 2 2
B 1 3 2 2 3 3
C 1 2 3 3 3 3
D 2 2 2 3 3 3


emaneman
Pyrite | Level 9

I apologize. When pasting the data the format was lost. 

See attached excel file for the "have" and "wanted" files. 

PaigeMiller
Diamond | Level 26

Many of us (including me) will not download Microsoft office files as they can be a security threat. The preferred way to make your data available is as SAS data step code (instructions).

--
Paige Miller
emaneman
Pyrite | Level 9

absolutely right. The easiest is to put a picture, here it is.

Screenshot 2021-11-29 at 01.03.06.png

PaigeMiller
Diamond | Level 26

I'm not going to try to program from data in a screen capture. Please provide the data as a SAS data step, which you can create yourself, or you can use the instructions I gave.

--
Paige Miller
emaneman
Pyrite | Level 9

 

Got it. I include SAS syntax

 

data have;
   infile datalines delimiter='	'; 

input ID$ v1-v6;
datalines;
A	1	1	1	.	.	.
A	.	.	.	2	2	2
B	1	3	2	2	3	3
C	1	2	3	3	3	3
D	2	2	2	.	.	.
D	2	2	2	3	3	3
;

proc print;


data want;
   infile datalines delimiter='	'; 

input ID$ v1-v6;
datalines;
A	1	1	1	2	2	2
B	1	3	2	2	3	3
C	1	2	3	3	3	3
D	2	2	2	3	3	3
;

proc print;

 

 

Astounding
PROC Star

Here's an easy way to handle the situation:

 

data want;
   update have (obs=0) have;
   by id;
run;
emaneman
Pyrite | Level 9

Thank you, that was simple and works perfectly.

There is a further complication, which I will probably handle by hand, but which I outline below in case any of you has some idea.

For certain duplicate observations, one or more columns both have values in it, but they are not the same values. 

I include a SAS syntax with the same have dataset than before, with one more variable, V7, which for the duplicate obs A has the same value (x) in both lines while for duplicate obs D it has different values (x, y)on the two lines.

SAS seems to choose one of the values, probably the value of the observation that comes first, I would guess, and of course it has no way of choosing which one is correct. What would be useful would be to get an output in which all the duplicate obs (that is, obs that have the same ID value) that have at least one column with different values, flagged.

Eman

 

 

data have;
   infile datalines delimiter='	'; 

input ID$ v1-v7;
datalines;
A	1	1	1	.	.	.	x
A	.	.	.	2	2	2	x
B	1	3	2	2	3	3	y
C	1	2	3	3	3	3	x
D	2	2	2	.	.	.	x
D	2	2	2	3	3	3	y
;

 

eman

Astounding
PROC Star

Just for the record, with multiple observations, UPDATE takes the last non-missing value for each variable.

 

To get the sort of "different nonmissing" cases that you describe, I would switch to PROC SUMMARY.  It can calculate the minimum value and the maximum value for each variable, and you can program with that to find cases where minimum is not equal to maximum.  It won't give you the specific observations, but it will identify both the IDs and the variables.  This would be a good starting point for what you are asking, because of compications when there are multiple observations (not just two) for the same ID.

 

This would be a good starting point:

 

proc summary data=have;
   by id;
   var var1-var5;
   output out=minimums (drop=_type_ _freq_) min=;
   output out=maximums (drop=_type_ _freq_) max=;
run;

data want;
   set minimums maximums;
   by id;
   array vars {5} var1-var5;
   do _n_= 1 to 5;
      change = dif(var{_n_});
      if change ne 0 then do;
         varname = vname{var{_n_});
         output;
      end;
   end;
   keep id varname change;
run;

It's untested code, but looks about right.  See how it fits when you want.

emaneman
Pyrite | Level 9
thank you, I will check try this out.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1906 views
  • 1 like
  • 4 in conversation