- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am looking at subsets of data from a larger dataset to see which variables are contained in which event. The original data is structured similar to the following:
ID EVENT VAR1 VAR2 VAR3 VAR4 VAR5 ....
1 1 1 2 . . .
1 2 . . 4 0 .
1 3 . . 1 . 7
2 1 0 1 . . .
2 2 . . 3 1 .
3 1 1 1 . . .
3 2 . . 2 1 .
3 3 . . 0 . 6
etc.
After I subset by event using the WEHERE statement I plan to have something like:
Event 1
ID VAR1 VAR2 VAR3 VAR4 VAR5 ....
1 1 2 . . .
2 0 1 . . .
3 1 1 . . .
Event 2
ID VAR1 VAR2 VAR3 VAR4 VAR5 ....
1 . . 4 0 .
2 . . 3 1 .
3 . . 2 1 .
Event 3
ID VAR1 VAR2 VAR3 VAR4 VAR5 ....
1 . . 1 . 7
3 . . 0 . 6
I would then like to (1) delete any column within each subset that does not contain any information, then (2) merge the datasets to where I only have one row per ID, while (3) maintaining the information of the left most dataset if there is information for a variable in both datasets. So merging Event 3 to Event 2, I would want to keep the VAR3 observations from Event 2 and not Event 3. Is there a way to do this? (Note: actual dataset has about 100 variables) I suppose would it be better to right join Event 2 to Event 3 or left join Event 3 to Event 2?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
No, your understanding is not correct.
Missing values in the second data set overwrite non-missing values in the first data set. When you are unsure, you should actually test your program and see what happens. You could certainly observe this happening.
You have some work to do. MERGE is an important topic, one worth digging up documentation and papers, and studying. There's no substitute.for doing that work. I have to assume that since your understanding of MERGE is shaky, you have probably never heard of UPDATE. That one is less common, but is the right tool for the job here. Since you don't have EVENT in your data, I would just create it. Any ordered values would do (they don't need to begin with 1 for each ID):
data events;
set have;
event + 1;
run;
Then sort your data, and apply the program I gave you:
proc sort data=events;
by id descending event;
run;
data want (drop=event);
update events (obs=0) events;
by id;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If I understand your objective correctly, you can get there with a minimum of fuss. Try:
proc sort data=have;
by id descending event;
run;
data want (drop=event);
update have (obs=0) have;
by id;
run;
This should work regardless of how many variables you have.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Ah. My events actually aren't numbered. Sorry that was misleading.
I ended up subsetting, and then using %dropmiss to get rid of empty columns. I then used
DATA data = WORK.want;
merge WORK.have_event3 WORK.have_event2;
by ID;
RUN;
To my understanding with how merge works, if Event 3 and Event 2 share the same name for a variable, then non-missing Event 2 observations should overwrite the data in Event 3. Please let me know if this understanding is incorrect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
No, your understanding is not correct.
Missing values in the second data set overwrite non-missing values in the first data set. When you are unsure, you should actually test your program and see what happens. You could certainly observe this happening.
You have some work to do. MERGE is an important topic, one worth digging up documentation and papers, and studying. There's no substitute.for doing that work. I have to assume that since your understanding of MERGE is shaky, you have probably never heard of UPDATE. That one is less common, but is the right tool for the job here. Since you don't have EVENT in your data, I would just create it. Any ordered values would do (they don't need to begin with 1 for each ID):
data events;
set have;
event + 1;
run;
Then sort your data, and apply the program I gave you:
proc sort data=events;
by id descending event;
run;
data want (drop=event);
update events (obs=0) events;
by id;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Okay. Thank you for clarifying that for me. I have an events variable but they are given character names. Would you recommend I use IF THEN statement to assign numerical value to them so they arrange in the order I want?
Additionally, so I understand the code correctly when asked about it, you are first sorting ID and then descending order of event (5, 4, 3, 2, 1). When you use UPDATE and (obs = 1), you are collapsing the IDs down to one where the bottom event replaces the prior event given that the bottom event has any values but skips if missing?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes, assigning numeric values that match EVENT would be a good idea, necessary for sorting in the proper order.
And yes again, UPDATE ignores missing values, but replaces the current value with any nonmissing values it encounters. That's what makes the order essential.
Whatever appears in the BY statement, you will automatically get one observation for each (in this case one observation for each ID).
The obs=1 is a tricky concoction. UPDATE requires two data sets. So using obs=1 creates a second version of the data to use (so that UPDATE won't complain about having just one data set). There is only one observation brought in from that obs=1 data set, and it is immediately overwritten by an observation from the complete HAVE data set. It looks like you are well on your way to understanding the UPDATE process, so don't get bogged down in the tricky aspects at this point.
In all likelihood, this program eliminates the need to process intermediate data sets by dropping variables that are always missing. If it's desirable, you could apply that logic to the final data set.