- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's an easy way to handle the situation:
data want;
update have (obs=0) have;
by id;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am having an issue with a dataset in which I have duplicate observations, which are, however, not identical. The identifier shows that it is the same observations, but on some columns/variables one line has data and on some other columns/variables it is another line that has the data. Some duplicate observations have all the values, and the duplicate has none.
In the datasets below the top part shows the initial file and the bottom part shows the result that i want to obtain.
As you can see, observations with ID A and D are duplicate in the initial file, and are "merged" in the outcome file.
Any suggestion on how to do that in SAS?
Thanks in advance.
Eman
ID var1 var2 var3 var4 var5 var6
A 1 1 1
A 2 2 2
B 1 3 2 2 3 3
C 1 2 3 3 3 3
D 2 2 2
D 2 2 2 3 3 3
ID var1 var2 var3 var4 var5 var6
A 1 1 1 2 2 2
B 1 3 2 2 3 3
C 1 2 3 3 3 3
D 2 2 2 3 3 3
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I apologize. When pasting the data the format was lost.
See attached excel file for the "have" and "wanted" files.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Many of us (including me) will not download Microsoft office files as they can be a security threat. The preferred way to make your data available is as SAS data step code (instructions).
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
absolutely right. The easiest is to put a picture, here it is.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm not going to try to program from data in a screen capture. Please provide the data as a SAS data step, which you can create yourself, or you can use the instructions I gave.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Got it. I include SAS syntax
data have;
infile datalines delimiter=' ';
input ID$ v1-v6;
datalines;
A 1 1 1 . . .
A . . . 2 2 2
B 1 3 2 2 3 3
C 1 2 3 3 3 3
D 2 2 2 . . .
D 2 2 2 3 3 3
;
proc print;
data want;
infile datalines delimiter=' ';
input ID$ v1-v6;
datalines;
A 1 1 1 2 2 2
B 1 3 2 2 3 3
C 1 2 3 3 3 3
D 2 2 2 3 3 3
;
proc print;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please see the following previous SAS post
https://communities.sas.com/t5/SAS-Programming/Fill-missing-values-with-the-previous-values/td-p/326...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's an easy way to handle the situation:
data want;
update have (obs=0) have;
by id;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you, that was simple and works perfectly.
There is a further complication, which I will probably handle by hand, but which I outline below in case any of you has some idea.
For certain duplicate observations, one or more columns both have values in it, but they are not the same values.
I include a SAS syntax with the same have dataset than before, with one more variable, V7, which for the duplicate obs A has the same value (x) in both lines while for duplicate obs D it has different values (x, y)on the two lines.
SAS seems to choose one of the values, probably the value of the observation that comes first, I would guess, and of course it has no way of choosing which one is correct. What would be useful would be to get an output in which all the duplicate obs (that is, obs that have the same ID value) that have at least one column with different values, flagged.
Eman
data have;
infile datalines delimiter=' ';
input ID$ v1-v7;
datalines;
A 1 1 1 . . . x
A . . . 2 2 2 x
B 1 3 2 2 3 3 y
C 1 2 3 3 3 3 x
D 2 2 2 . . . x
D 2 2 2 3 3 3 y
;
eman
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just for the record, with multiple observations, UPDATE takes the last non-missing value for each variable.
To get the sort of "different nonmissing" cases that you describe, I would switch to PROC SUMMARY. It can calculate the minimum value and the maximum value for each variable, and you can program with that to find cases where minimum is not equal to maximum. It won't give you the specific observations, but it will identify both the IDs and the variables. This would be a good starting point for what you are asking, because of compications when there are multiple observations (not just two) for the same ID.
This would be a good starting point:
proc summary data=have;
by id;
var var1-var5;
output out=minimums (drop=_type_ _freq_) min=;
output out=maximums (drop=_type_ _freq_) max=;
run;
data want;
set minimums maximums;
by id;
array vars {5} var1-var5;
do _n_= 1 to 5;
change = dif(var{_n_});
if change ne 0 then do;
varname = vname{var{_n_});
output;
end;
end;
keep id varname change;
run;
It's untested code, but looks about right. See how it fits when you want.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content