Re: Keeping ID vars consistent across multiple rows

Walternate · Posted 01-26-2016 11:00 AM

Hi,

I have a dataset at the person-month level with ID and month.

ID Month

00001 Jan 2010

00001 Feb 2010

00001 Mar 2010

1 Apr 2010

00001 May 2010

The issue is that for some people in some months, the ID var has been stripped of the leading 0s. I thought that I could just add leading 0s until the length of the ID var was reached, but some of the IDs are 5 digits and some are 7. There is a second ID var which should correspond with ID, although it is not always populated (which is why I used ID in the first place). Is there a way to correct the values of ID which have been stripped that I haven't thought of?

Any help is much appreciated.

RW9 · Posted 01-26-2016 11:12 AM

id=put(input(id,best.),z5.);

So read in the text as best. then put it out as z5 (which pads number out with 0) and put back into text ID. This should set them all the same. However you may want to fix the source issue, where does the data come from, why do some not have the "0"'s, we can't see this so can't advise.

Haikuo · Posted 01-26-2016 11:14 AM

From what I can see, you can one of the following options to standardize 'id':

1. Convert it to number. check INPUT function.

2.strip '0's for all of them. Check PRXCHANGE or some combinations of find, substr function etc.

3. you have already tried. use the length of the variable, then add leading '0's. Check LENGTH and CATS function.

Walternate · Posted 01-26-2016 11:18 AM

If I strip 0s for all of them, don't I have to set a length? This could be difficult as without the zeroes, some would have a length of 5, some 4, some 3, some 2, some 1.

RW9 · Posted 01-26-2016 11:23 AM

Yes, but you need to specify how many maximum you need to see. Once you know that all of them should be 5 numbers long, then fix that with z5. This will change all of them to be maximum five digits with any missing being 0 so:

54 = 00054

1234= 01234

12345=12345

Reeza · Posted 01-26-2016 11:25 AM

Is the 5/7 length difference significant? For example is ID = 000235 different from ID = 0000235?

If not, make them all a length of 7.

Walternate · Posted 01-26-2016 11:35 AM

I'm now finding some values of ID that are just nonsense and not corresponding with the "real" values (ie, abc when the real value is D12345), so I think I should change my question.

Each row will either have a valid value of ID1 or ID2, but not necessarily both, like this:

ID1 ID2 Month

00123 null Jan 2010

abc 12345678 Feb 2010

00123 12345678 Mar 2010

So row 3 has legitimate values for both ID1 and ID2; while row 1 has a legitimate value for ID1 and ID2=null; while row 2 has an illegitimate value for ID1 while ID2 is valid.

What I would like to do is use ID2 (which doesn't seem to have issues of invalid values like ID1, but is just missing in some rows) as the primary identifier, but somehow fill the rows where ID2=null with the correct ID2 value, which I will have to identify using ID1 so I know it's the same person on both rows. Is there a way to do that?

Reeza · Posted 01-26-2016 11:47 AM

What would the expected output be for the above scenario?

Walternate · Posted 01-26-2016 11:49 AM

ID1 ID2 Month

00123 null Jan 2010

abc 12345678 Feb 2010

00123 12345678 Mar 2010

Either ID2 should be filled in all 3 rows:

ID1 ID2 Month

00123 12345678 Jan 2010

abc 12345678 Feb 2010

00123 12345678 Mar 2010

Or there could be a new ID var if necessary, but I'm not sure what that would look like in terms of the other 3 IDs. The point is that I'm ultimately trying to create a person-level dataset so it's important that the dataset consistently identifies people across each month.

Reeza · Posted 01-26-2016 11:55 AM

You have to create a master list and then either merge it back in or apply it using a format.

proc sql;
create table lookup_id as
select distinct id1, id2
from have
where not missing(id1) and not missing(id2);
quit;

proc sql;
create table want as
select a.id1, coalesce(a.id2, b.id2) as id2, a.month
from have as a
left join lookup_id as b
on a.id1=b.id1;
quit;

RW9 · Posted 01-26-2016 11:54 AM

Now, that is a very different question. I would suggest you will need to create a list of valid pairs, then use that list to map back to original values, here is an example:

data have;
  length id1 id2 $50;
  id1="00123"; id2=""; output;
  id1="abc"; id2="12345678"; output;
  id1="00123"; id2="12345678"; output;
run;

data inter;
  set have;
  if lengthn(compress(id1," ","d"))=0 and lengthn(compress(id2," ","d"))=0 and id1 ne "" and id2 ne "";
run;

proc sql;
  create table WANT as
  select  case when lengthn(compress(A.ID1," ","d")) > 0 or A.ID1 is null then (select ID1 from INTER where ID2=A.ID2)
               else A.ID1 end as ID1,
          case when lengthn(compress(A.ID2," ","d")) > 0 or A.ID2 is null then (select ID2 from INTER where ID1=A.ID1)
               else A.ID2 end as ID2
  from    HAVE A;
quit;

LinusH · Posted 01-26-2016 11:49 AM

The Coalesce() function will do the job.

Data never sleeps

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away