Hi,
I have a dataset at the person-month level with an ID variable that is not clean, another ID variable, and a few other categorical variables.
Dataset1:
ID1 ID2 categ vars.....
1 123
1.1 123
1.2 234
1.2 234
In ID1, the 1 is the valid value, while 1.1 and 1.2 are invalid values. ID 2 is the way I can distinguish one person from another, but I need to retain as many valid values of ID1 as possible. I need to do two things: One is that for every person (ie, value of ID2) that has at least one valid value of ID1, I want to fill the rest of their rows with that valid value of ID1. The other is that for every person that never has a valid value of ID1, I want a new variable, ID3, to give them a value.
Dataset 2
ID1 ID2 ID3 categ vars.....
1 123
1 123
. 234 1
. 234 2
This is what I have coded so far to accomplish this:
proc sql;
create table want1 as
select *, case when ID1 like '%.% then 1
else 0
end as invalid_ID1
from have;
quit;
proc sql;
create table want2 as
select *, count(ID2) as num_rows, count(invalid_ID1) as rows_w_invalid_ID1
from want1
group by ID2
order by ID2, invalid_ID1;
quit;
data want3;
set want2;
by ID2 invalid_ID1;
retain ID3 1;
if num_rows>rows_w_invalid_ID1 then ID3=.;
else if first.ID2 and num_rows=rows_w_invalid_ID1 then do;
ID3+1;
ID1=' ';end;
retain ID1_keep;
if first.ID2 then ID1_keep=ID1;
if rows_w_invalid_id=1 and num_rows>rows_w_invalid_id then ID1=ID1_keep;
run;
This code correctly completes my first objective of overwriting invalid values of ID1 with valid values for each person using ID2 as the person identifier. However, when assigning ID3, it resets to 1 every time it encounters a missing value, ie someone who would not be assigned a value of ID3. So it looks like this:
Dataset 2
ID1 ID2 ID3 categ vars.....
1 123
1 123
. 234 1
. 234 2
2 345
2 345
. 456 1
What I would want it to do is continue counting rather than resetting upon encountering the missing variable.
Any help is much appreciated.
OK, that can be fixed. Unfortunately it means adding another variable to the mix:
data want;
do until (last.id2);
set have;
by id2;
if id1 = int(id1) then valid_id1 = id1;
end;
do until (last.id2);
set have;
by id2;
if valid_id1 > . then id1 = valid_id1;
else do;
if first.id2 then potential_id3 + 1;
id3 = potential_id3;
end;
output;
drop potential_id3;
end;
It might be simpler to use a single DATA step instead of a multi-step approach. Assuming your data is sorted by ID2:
data want;
do until (last.id2);
set have;
by id2;
if id1 = int(id1) then valid_id1 = id1;
end;
do until (last.id2);
set have;
by id2;
if valid_id1 > . then id1 = valid_id1;
else do;
if first.id2 then id3=1;
else id3 = id3 + 1;
end;
output;
end;
The top loop determines whether there is a valid ID1 value for that ID2 category. The bottom loop utilizes that information to populate ID1 and ID3.
drop valid_id1;
run;
This worked except for one thing: ID3 is incrementing on every row instead of for every change in ID2, like this:
ID1 ID2 ID3
1.1 abc 1
1.1 abc 2
1.2 abc 3
1.1 def 4
Instead, it should only increase if the ID2 changes:
ID1 ID2 ID3
1.1 abc 1
1.1 abc 1
1.2 abc 1
1.1 def 2
OK, that can be fixed. Unfortunately it means adding another variable to the mix:
data want;
do until (last.id2);
set have;
by id2;
if id1 = int(id1) then valid_id1 = id1;
end;
do until (last.id2);
set have;
by id2;
if valid_id1 > . then id1 = valid_id1;
else do;
if first.id2 then potential_id3 + 1;
id3 = potential_id3;
end;
output;
drop potential_id3;
end;
Q1: How do you identify/differentiate good ID1 from those are bad.
Q2: If you have multiple goods and some bads, which one of the goods you want to use to populate the bads.
Q3: Will the goods always appear before the bads?
Q4: Have your data presorted by ID2 or at least you know ID2 are clustering together?
Anyway, I made some reasonable guess, and here are some codes to get you going ( Sorry, I haven't read through your code):
data have;
input (id1 id2) (:$8.);
cards;
1 123
1.1 123
1.2 234
1.2 234
;
/*using 2XDOW is to assume that you are not sure if goods appear before bads, otherwise, code can be simpler*/
/*this is to assume if anything but number appears will be redeemed as bad*/
data want;
do until (last.id2);
set have;
by id2 notsorted;
if notdigit(strip(id1))=0 then
_id1=id1;
end;
do until (last.id2);
set have;
by id2 notsorted;
if notdigit(strip(id1))>0 then
id1=_id1;
if missing(_id1) then
id3=sum(id3,1);
output;
end;
drop _id1;
run;
Is this a continuation of what you did after this topic? https://communities.sas.com/t5/Base-SAS-Programming/Stacking-two-datasets-with-overlapping-IDs/m-p/2...
It may be that what you need is to revisit the requirements earlier in the process and describe the whole requirement. Some problems are iterative that way. You learn more about the data and sometimes it is appropriate to go back to the beginning as choices made early in a process may introduce the issues you are encountering now. The knowledge gained by working this far may give you clues as to which of other options not previously selected might help.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.