- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a dataset at the person-month level with an ID variable that is not clean, another ID variable, and a few other categorical variables.
Dataset1:
ID1 ID2 categ vars.....
1 123
1.1 123
1.2 234
1.2 234
In ID1, the 1 is the valid value, while 1.1 and 1.2 are invalid values. ID 2 is the way I can distinguish one person from another, but I need to retain as many valid values of ID1 as possible. I need to do two things: One is that for every person (ie, value of ID2) that has at least one valid value of ID1, I want to fill the rest of their rows with that valid value of ID1. The other is that for every person that never has a valid value of ID1, I want a new variable, ID3, to give them a value.
Dataset 2
ID1 ID2 ID3 categ vars.....
1 123
1 123
. 234 1
. 234 2
This is what I have coded so far to accomplish this:
proc sql;
create table want1 as
select *, case when ID1 like '%.% then 1
else 0
end as invalid_ID1
from have;
quit;
proc sql;
create table want2 as
select *, count(ID2) as num_rows, count(invalid_ID1) as rows_w_invalid_ID1
from want1
group by ID2
order by ID2, invalid_ID1;
quit;
data want3;
set want2;
by ID2 invalid_ID1;
retain ID3 1;
if num_rows>rows_w_invalid_ID1 then ID3=.;
else if first.ID2 and num_rows=rows_w_invalid_ID1 then do;
ID3+1;
ID1=' ';end;
retain ID1_keep;
if first.ID2 then ID1_keep=ID1;
if rows_w_invalid_id=1 and num_rows>rows_w_invalid_id then ID1=ID1_keep;
run;
This code correctly completes my first objective of overwriting invalid values of ID1 with valid values for each person using ID2 as the person identifier. However, when assigning ID3, it resets to 1 every time it encounters a missing value, ie someone who would not be assigned a value of ID3. So it looks like this:
Dataset 2
ID1 ID2 ID3 categ vars.....
1 123
1 123
. 234 1
. 234 2
2 345
2 345
. 456 1
What I would want it to do is continue counting rather than resetting upon encountering the missing variable.
Any help is much appreciated.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
OK, that can be fixed. Unfortunately it means adding another variable to the mix:
data want;
do until (last.id2);
set have;
by id2;
if id1 = int(id1) then valid_id1 = id1;
end;
do until (last.id2);
set have;
by id2;
if valid_id1 > . then id1 = valid_id1;
else do;
if first.id2 then potential_id3 + 1;
id3 = potential_id3;
end;
output;
drop potential_id3;
end;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It might be simpler to use a single DATA step instead of a multi-step approach. Assuming your data is sorted by ID2:
data want;
do until (last.id2);
set have;
by id2;
if id1 = int(id1) then valid_id1 = id1;
end;
do until (last.id2);
set have;
by id2;
if valid_id1 > . then id1 = valid_id1;
else do;
if first.id2 then id3=1;
else id3 = id3 + 1;
end;
output;
end;
The top loop determines whether there is a valid ID1 value for that ID2 category. The bottom loop utilizes that information to populate ID1 and ID3.
drop valid_id1;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This worked except for one thing: ID3 is incrementing on every row instead of for every change in ID2, like this:
ID1 ID2 ID3
1.1 abc 1
1.1 abc 2
1.2 abc 3
1.1 def 4
Instead, it should only increase if the ID2 changes:
ID1 ID2 ID3
1.1 abc 1
1.1 abc 1
1.2 abc 1
1.1 def 2
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
OK, that can be fixed. Unfortunately it means adding another variable to the mix:
data want;
do until (last.id2);
set have;
by id2;
if id1 = int(id1) then valid_id1 = id1;
end;
do until (last.id2);
set have;
by id2;
if valid_id1 > . then id1 = valid_id1;
else do;
if first.id2 then potential_id3 + 1;
id3 = potential_id3;
end;
output;
drop potential_id3;
end;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Q1: How do you identify/differentiate good ID1 from those are bad.
Q2: If you have multiple goods and some bads, which one of the goods you want to use to populate the bads.
Q3: Will the goods always appear before the bads?
Q4: Have your data presorted by ID2 or at least you know ID2 are clustering together?
Anyway, I made some reasonable guess, and here are some codes to get you going ( Sorry, I haven't read through your code):
data have;
input (id1 id2) (:$8.);
cards;
1 123
1.1 123
1.2 234
1.2 234
;
/*using 2XDOW is to assume that you are not sure if goods appear before bads, otherwise, code can be simpler*/
/*this is to assume if anything but number appears will be redeemed as bad*/
data want;
do until (last.id2);
set have;
by id2 notsorted;
if notdigit(strip(id1))=0 then
_id1=id1;
end;
do until (last.id2);
set have;
by id2 notsorted;
if notdigit(strip(id1))>0 then
id1=_id1;
if missing(_id1) then
id3=sum(id3,1);
output;
end;
drop _id1;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Is this a continuation of what you did after this topic? https://communities.sas.com/t5/Base-SAS-Programming/Stacking-two-datasets-with-overlapping-IDs/m-p/2...
It may be that what you need is to revisit the requirements earlier in the process and describe the whole requirement. Some problems are iterative that way. You learn more about the data and sometimes it is appropriate to go back to the beginning as choices made early in a process may introduce the issues you are encountering now. The knowledge gained by working this far may give you clues as to which of other options not previously selected might help.