Hi,
I have a dataset at the person-month level with ID and month.
ID Month
00001 Jan 2010
00001 Feb 2010
00001 Mar 2010
1 Apr 2010
00001 May 2010
The issue is that for some people in some months, the ID var has been stripped of the leading 0s. I thought that I could just add leading 0s until the length of the ID var was reached, but some of the IDs are 5 digits and some are 7. There is a second ID var which should correspond with ID, although it is not always populated (which is why I used ID in the first place). Is there a way to correct the values of ID which have been stripped that I haven't thought of?
Any help is much appreciated.
id=put(input(id,best.),z5.);
So read in the text as best. then put it out as z5 (which pads number out with 0) and put back into text ID. This should set them all the same. However you may want to fix the source issue, where does the data come from, why do some not have the "0"'s, we can't see this so can't advise.
From what I can see, you can one of the following options to standardize 'id':
1. Convert it to number. check INPUT function.
2.strip '0's for all of them. Check PRXCHANGE or some combinations of find, substr function etc.
3. you have already tried. use the length of the variable, then add leading '0's. Check LENGTH and CATS function.
If I strip 0s for all of them, don't I have to set a length? This could be difficult as without the zeroes, some would have a length of 5, some 4, some 3, some 2, some 1.
Yes, but you need to specify how many maximum you need to see. Once you know that all of them should be 5 numbers long, then fix that with z5. This will change all of them to be maximum five digits with any missing being 0 so:
54 = 00054
1234= 01234
12345=12345
Is the 5/7 length difference significant? For example is ID = 000235 different from ID = 0000235?
If not, make them all a length of 7.
I'm now finding some values of ID that are just nonsense and not corresponding with the "real" values (ie, abc when the real value is D12345), so I think I should change my question.
Each row will either have a valid value of ID1 or ID2, but not necessarily both, like this:
ID1 ID2 Month
00123 null Jan 2010
abc 12345678 Feb 2010
00123 12345678 Mar 2010
So row 3 has legitimate values for both ID1 and ID2; while row 1 has a legitimate value for ID1 and ID2=null; while row 2 has an illegitimate value for ID1 while ID2 is valid.
What I would like to do is use ID2 (which doesn't seem to have issues of invalid values like ID1, but is just missing in some rows) as the primary identifier, but somehow fill the rows where ID2=null with the correct ID2 value, which I will have to identify using ID1 so I know it's the same person on both rows. Is there a way to do that?
What would the expected output be for the above scenario?
ID1 ID2 Month
00123 null Jan 2010
abc 12345678 Feb 2010
00123 12345678 Mar 2010
Either ID2 should be filled in all 3 rows:
ID1 ID2 Month
00123 12345678 Jan 2010
abc 12345678 Feb 2010
00123 12345678 Mar 2010
Or there could be a new ID var if necessary, but I'm not sure what that would look like in terms of the other 3 IDs. The point is that I'm ultimately trying to create a person-level dataset so it's important that the dataset consistently identifies people across each month.
You have to create a master list and then either merge it back in or apply it using a format.
proc sql;
create table lookup_id as
select distinct id1, id2
from have
where not missing(id1) and not missing(id2);
quit;
proc sql;
create table want as
select a.id1, coalesce(a.id2, b.id2) as id2, a.month
from have as a
left join lookup_id as b
on a.id1=b.id1;
quit;
Now, that is a very different question. I would suggest you will need to create a list of valid pairs, then use that list to map back to original values, here is an example:
data have; length id1 id2 $50; id1="00123"; id2=""; output; id1="abc"; id2="12345678"; output; id1="00123"; id2="12345678"; output; run; data inter; set have; if lengthn(compress(id1," ","d"))=0 and lengthn(compress(id2," ","d"))=0 and id1 ne "" and id2 ne ""; run; proc sql; create table WANT as select case when lengthn(compress(A.ID1," ","d")) > 0 or A.ID1 is null then (select ID1 from INTER where ID2=A.ID2) else A.ID1 end as ID1, case when lengthn(compress(A.ID2," ","d")) > 0 or A.ID2 is null then (select ID2 from INTER where ID1=A.ID1) else A.ID2 end as ID2 from HAVE A; quit;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.