DATA Step, Macro, Functions and more

Keeping ID vars consistent across multiple rows

Reply
Frequent Contributor
Posts: 138

Keeping ID vars consistent across multiple rows

Hi,

 

I have a dataset at the person-month level with ID and month. 

 

ID           Month

00001   Jan 2010

00001   Feb 2010

00001   Mar 2010

1           Apr 2010

00001  May 2010

 

The issue is that for some people in some months, the ID var has been stripped of the leading 0s. I thought that I could just add leading 0s until the length of the ID var was reached, but some of the IDs are 5 digits and some are 7. There is a second ID var which should correspond with ID, although it is not always populated (which is why I used ID in the first place). Is there a way to correct the values of ID which have been stripped that I haven't thought of?

 

Any help is much appreciated.

 

Super User
Super User
Posts: 7,401

Re: Keeping ID vars consistent across multiple rows

id=put(input(id,best.),z5.);

So read in the text as best. then put it out as z5 (which pads number out with 0) and put back into text ID.  This should set them all the same.  However you may want to fix the source issue, where does the data come from, why do some not have the "0"'s, we can't see this so can't advise.

Respected Advisor
Posts: 3,124

Re: Keeping ID vars consistent across multiple rows

From what I can see, you can one of the following options to standardize 'id':

 

1. Convert it to number. check INPUT function.

2.strip  '0's for all of them. Check PRXCHANGE or some combinations of find, substr function etc.

3. you have already tried. use the length of the variable, then add leading '0's. Check LENGTH and CATS function.

 

 

Frequent Contributor
Posts: 138

Re: Keeping ID vars consistent across multiple rows

If I strip 0s for all of them, don't I have to set a length? This could be difficult as without the zeroes, some would have a length of 5, some 4, some 3, some 2, some 1. 

Super User
Super User
Posts: 7,401

Re: Keeping ID vars consistent across multiple rows

Yes, but you need to specify how many maximum you need to see.  Once you know that all of them should be 5 numbers long, then fix that with z5.  This will change all of them to be maximum five digits with any missing being 0 so:

54 = 00054

1234= 01234

12345=12345

Super User
Posts: 17,837

Re: Keeping ID vars consistent across multiple rows

Is the 5/7 length difference significant? For example is ID = 000235 different from ID = 0000235?

 

If not, make them all a length of 7.

Frequent Contributor
Posts: 138

Re: Keeping ID vars consistent across multiple rows

I'm now finding some values of ID that are just nonsense and not corresponding with the "real" values (ie, abc when the real value is D12345), so I think I should change my question.

 

Each row will either have a valid value of ID1 or ID2, but not necessarily both, like this:

 

ID1               ID2                Month

00123           null               Jan 2010

abc            12345678        Feb 2010

00123        12345678        Mar 2010

 

So row 3 has legitimate values for both ID1 and ID2; while row 1 has a legitimate value for ID1 and ID2=null; while row 2 has an illegitimate value for ID1 while ID2 is valid.

 

What I would like to do is use ID2 (which doesn't seem to have issues of invalid values like ID1, but is just missing in some rows) as the primary identifier, but somehow fill the rows where ID2=null with the correct ID2 value, which I will have to identify using ID1 so I know it's the same person on both rows. Is there a way to do that?

 

Super User
Posts: 17,837

Re: Keeping ID vars consistent across multiple rows

What would the expected output be for the above scenario?

Frequent Contributor
Posts: 138

Re: Keeping ID vars consistent across multiple rows

ID1               ID2                Month

00123           null               Jan 2010

abc            12345678        Feb 2010

00123        12345678        Mar 2010

 

Either ID2 should be filled in all 3 rows:

 

ID1               ID2                Month

00123        12345678        Jan 2010

abc            12345678        Feb 2010

00123        12345678        Mar 2010

 

Or there could be a new ID var if necessary, but I'm not sure what that would look like in terms of the other 3 IDs. The point is that I'm ultimately trying to create a person-level dataset so it's important that the dataset consistently identifies people across each month. 

Super User
Posts: 17,837

Re: Keeping ID vars consistent across multiple rows

[ Edited ]

You have to create a master list and then either merge it back in or apply it using a format. 

 

proc sql;
create table lookup_id as
select distinct id1, id2
from have
where not missing(id1) and not missing(id2);
quit;

proc sql;
create table want as
select a.id1, coalesce(a.id2, b.id2) as id2, a.month
from have as a
left join lookup_id as b
on a.id1=b.id1;
quit;

 

Super User
Super User
Posts: 7,401

Re: Keeping ID vars consistent across multiple rows

Now, that is a very different question.  I would suggest you will need to create a list of valid pairs, then use that list to map back to original values, here is an example:

data have;
  length id1 id2 $50;
  id1="00123"; id2=""; output;
  id1="abc"; id2="12345678"; output;
  id1="00123"; id2="12345678"; output;
run;

data inter;
  set have;
  if lengthn(compress(id1," ","d"))=0 and lengthn(compress(id2," ","d"))=0 and id1 ne "" and id2 ne "";
run;

proc sql;
  create table WANT as
  select  case when lengthn(compress(A.ID1," ","d")) > 0 or A.ID1 is null then (select ID1 from INTER where ID2=A.ID2)
               else A.ID1 end as ID1,
          case when lengthn(compress(A.ID2," ","d")) > 0 or A.ID2 is null then (select ID2 from INTER where ID1=A.ID1)
               else A.ID2 end as ID2
  from    HAVE A;
quit;
Super User
Posts: 5,257

Re: Keeping ID vars consistent across multiple rows

The Coalesce() function will do the job.
Data never sleeps
Ask a Question
Discussion stats
  • 11 replies
  • 285 views
  • 0 likes
  • 5 in conversation