BookmarkSubscribeRSS Feed
Walternate
Obsidian | Level 7

Hi,

 

I have a dataset at the person-month level with ID and month. 

 

ID           Month

00001   Jan 2010

00001   Feb 2010

00001   Mar 2010

1           Apr 2010

00001  May 2010

 

The issue is that for some people in some months, the ID var has been stripped of the leading 0s. I thought that I could just add leading 0s until the length of the ID var was reached, but some of the IDs are 5 digits and some are 7. There is a second ID var which should correspond with ID, although it is not always populated (which is why I used ID in the first place). Is there a way to correct the values of ID which have been stripped that I haven't thought of?

 

Any help is much appreciated.

 

11 REPLIES 11
RW9
Diamond | Level 26 RW9
Diamond | Level 26

id=put(input(id,best.),z5.);

So read in the text as best. then put it out as z5 (which pads number out with 0) and put back into text ID.  This should set them all the same.  However you may want to fix the source issue, where does the data come from, why do some not have the "0"'s, we can't see this so can't advise.

Haikuo
Onyx | Level 15

From what I can see, you can one of the following options to standardize 'id':

 

1. Convert it to number. check INPUT function.

2.strip  '0's for all of them. Check PRXCHANGE or some combinations of find, substr function etc.

3. you have already tried. use the length of the variable, then add leading '0's. Check LENGTH and CATS function.

 

 

Walternate
Obsidian | Level 7

If I strip 0s for all of them, don't I have to set a length? This could be difficult as without the zeroes, some would have a length of 5, some 4, some 3, some 2, some 1. 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Yes, but you need to specify how many maximum you need to see.  Once you know that all of them should be 5 numbers long, then fix that with z5.  This will change all of them to be maximum five digits with any missing being 0 so:

54 = 00054

1234= 01234

12345=12345

Reeza
Super User

Is the 5/7 length difference significant? For example is ID = 000235 different from ID = 0000235?

 

If not, make them all a length of 7.

Walternate
Obsidian | Level 7

I'm now finding some values of ID that are just nonsense and not corresponding with the "real" values (ie, abc when the real value is D12345), so I think I should change my question.

 

Each row will either have a valid value of ID1 or ID2, but not necessarily both, like this:

 

ID1               ID2                Month

00123           null               Jan 2010

abc            12345678        Feb 2010

00123        12345678        Mar 2010

 

So row 3 has legitimate values for both ID1 and ID2; while row 1 has a legitimate value for ID1 and ID2=null; while row 2 has an illegitimate value for ID1 while ID2 is valid.

 

What I would like to do is use ID2 (which doesn't seem to have issues of invalid values like ID1, but is just missing in some rows) as the primary identifier, but somehow fill the rows where ID2=null with the correct ID2 value, which I will have to identify using ID1 so I know it's the same person on both rows. Is there a way to do that?

 

Reeza
Super User

What would the expected output be for the above scenario?

Walternate
Obsidian | Level 7

ID1               ID2                Month

00123           null               Jan 2010

abc            12345678        Feb 2010

00123        12345678        Mar 2010

 

Either ID2 should be filled in all 3 rows:

 

ID1               ID2                Month

00123        12345678        Jan 2010

abc            12345678        Feb 2010

00123        12345678        Mar 2010

 

Or there could be a new ID var if necessary, but I'm not sure what that would look like in terms of the other 3 IDs. The point is that I'm ultimately trying to create a person-level dataset so it's important that the dataset consistently identifies people across each month. 

Reeza
Super User

You have to create a master list and then either merge it back in or apply it using a format. 

 

proc sql;
create table lookup_id as
select distinct id1, id2
from have
where not missing(id1) and not missing(id2);
quit;

proc sql;
create table want as
select a.id1, coalesce(a.id2, b.id2) as id2, a.month
from have as a
left join lookup_id as b
on a.id1=b.id1;
quit;

 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Now, that is a very different question.  I would suggest you will need to create a list of valid pairs, then use that list to map back to original values, here is an example:

data have;
  length id1 id2 $50;
  id1="00123"; id2=""; output;
  id1="abc"; id2="12345678"; output;
  id1="00123"; id2="12345678"; output;
run;

data inter;
  set have;
  if lengthn(compress(id1," ","d"))=0 and lengthn(compress(id2," ","d"))=0 and id1 ne "" and id2 ne "";
run;

proc sql;
  create table WANT as
  select  case when lengthn(compress(A.ID1," ","d")) > 0 or A.ID1 is null then (select ID1 from INTER where ID2=A.ID2)
               else A.ID1 end as ID1,
          case when lengthn(compress(A.ID2," ","d")) > 0 or A.ID2 is null then (select ID2 from INTER where ID1=A.ID1)
               else A.ID2 end as ID2
  from    HAVE A;
quit;
LinusH
Tourmaline | Level 20
The Coalesce() function will do the job.
Data never sleeps

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 2562 views
  • 0 likes
  • 5 in conversation