DATA Step, Macro, Functions and more

Assigning an ID variable across missing rows

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 138
Accepted Solution

Assigning an ID variable across missing rows

Hi,

 

I have a dataset at the person-month level with an ID variable that is not clean, another ID variable, and a few other categorical variables. 

 

Dataset1:

ID1       ID2         categ vars.....

1          123

1.1      123

1.2       234

1.2       234

 

In ID1, the 1 is the valid value, while 1.1 and 1.2 are invalid values. ID 2 is the way I can distinguish one person from another, but I need to retain as many valid values of ID1 as possible. I need to do two things: One is that for every person (ie, value of ID2) that has at least one valid value of ID1, I want to fill the rest of their rows with that valid value of ID1. The other is that for every person that never has a valid value of ID1, I want a new variable, ID3, to give them a value.

 

Dataset 2

ID1       ID2       ID3     categ vars.....

1          123       

1          123

.           234       1

.           234       2

 

This is what I have coded so far to accomplish this:

 

 

proc sql;

create table want1 as

select *, case when ID1 like '%.% then 1

  else 0

end as invalid_ID1

from have;

quit;

 

proc sql;

create table want2 as

select *, count(ID2) as num_rows, count(invalid_ID1) as rows_w_invalid_ID1

from want1

group by ID2

order by ID2, invalid_ID1;

quit;

 

data want3;

set want2;

by ID2 invalid_ID1;

retain ID3 1;

if num_rows>rows_w_invalid_ID1 then ID3=.;

else if first.ID2 and num_rows=rows_w_invalid_ID1 then do;

ID3+1;

ID1=' ';end;

retain ID1_keep;

if first.ID2 then ID1_keep=ID1;

if rows_w_invalid_id=1 and num_rows>rows_w_invalid_id then ID1=ID1_keep;

run;

 

This code correctly completes my first objective of overwriting invalid values of ID1 with valid values for each person using ID2 as the person identifier. However, when assigning ID3, it resets to 1 every time it encounters a missing value, ie  someone who would not be assigned a value of ID3. So it looks like this:

 

Dataset 2

ID1       ID2       ID3     categ vars.....

1          123       

1          123

.           234       1

.           234       2

2          345      

2          345

.           456       1

 

What I would want it to do is continue counting rather than resetting upon encountering the missing variable. 

 

Any help is much appreciated. 

 

 

 


Accepted Solutions
Solution
‎02-12-2016 08:34 AM
Super User
Posts: 5,518

Re: Assigning an ID variable across missing rows

Posted in reply to Walternate

OK, that can be fixed.  Unfortunately it means adding another variable to the mix:

 

data want;

do until (last.id2);

   set have;

   by id2;

   if id1 = int(id1) then valid_id1 = id1;

end;

do until (last.id2);

   set have;

   by id2;

   if valid_id1 > . then id1 = valid_id1;

   else do;

      if first.id2 then potential_id3 + 1;

      id3 = potential_id3;

   end;

   output;

   drop potential_id3;

end;

View solution in original post


All Replies
Super User
Posts: 5,518

Re: Assigning an ID variable across missing rows

Posted in reply to Walternate

It might be simpler to use a single DATA step instead of a multi-step approach.  Assuming your data is sorted by ID2:

 

data want;

do until (last.id2);

   set have;

   by id2;

   if id1 = int(id1) then valid_id1 = id1;

end;

do until (last.id2);

   set have;

   by id2;

   if valid_id1 > . then id1 = valid_id1;

   else do;

      if first.id2 then id3=1;

      else id3 = id3 + 1;

   end;

   output;

end;

 

The top loop determines whether there is a valid ID1 value for that ID2 category.  The bottom loop utilizes that information to populate ID1 and ID3.

drop valid_id1;

run;

 

Frequent Contributor
Posts: 138

Re: Assigning an ID variable across missing rows

Posted in reply to Astounding

This worked except for one thing: ID3 is incrementing on every row instead of for every change in ID2, like this:

 

ID1     ID2     ID3

1.1     abc       1     

1.1     abc       2

1.2     abc       3

1.1     def        4

 

Instead, it should only increase if the ID2 changes:

 

ID1     ID2     ID3

1.1     abc       1     

1.1     abc       1

1.2     abc       1

1.1     def        2

Solution
‎02-12-2016 08:34 AM
Super User
Posts: 5,518

Re: Assigning an ID variable across missing rows

Posted in reply to Walternate

OK, that can be fixed.  Unfortunately it means adding another variable to the mix:

 

data want;

do until (last.id2);

   set have;

   by id2;

   if id1 = int(id1) then valid_id1 = id1;

end;

do until (last.id2);

   set have;

   by id2;

   if valid_id1 > . then id1 = valid_id1;

   else do;

      if first.id2 then potential_id3 + 1;

      id3 = potential_id3;

   end;

   output;

   drop potential_id3;

end;

Respected Advisor
Posts: 3,156

Re: Assigning an ID variable across missing rows

Posted in reply to Walternate

 

Q1: How do you identify/differentiate good ID1 from those are bad.

Q2: If you have multiple goods and some bads, which one of the goods you want to use to populate the bads.

Q3: Will the goods always appear before the bads?

Q4: Have your data presorted by ID2 or at least you know ID2 are clustering together?

 

Anyway, I made some reasonable guess, and here are some codes to get you going ( Sorry, I haven't read through your code):

data have;
	input (id1 id2) (:$8.);
	cards;
1          123
1.1      123
1.2       234
1.2       234
;

/*using 2XDOW is to assume that you are not sure if goods appear before bads, otherwise, code can be simpler*/
/*this is to assume if anything but number appears will be redeemed as bad*/
data want;
	do until (last.id2);
		set have;
		by id2 notsorted;

		if notdigit(strip(id1))=0 then 
			_id1=id1;
	end;

	do until (last.id2);
		set have;
		by id2 notsorted;

		if notdigit(strip(id1))>0 then
			id1=_id1;

		if missing(_id1) then
			id3=sum(id3,1);
		output;
	end;

	drop _id1;
run;
Super User
Posts: 11,343

Re: Assigning an ID variable across missing rows

Posted in reply to Walternate

Is this a continuation of what you did after this topic? https://communities.sas.com/t5/Base-SAS-Programming/Stacking-two-datasets-with-overlapping-IDs/m-p/2...

 

It may be that what you need is to revisit the requirements earlier in the process and describe the whole requirement. Some problems are iterative that way. You learn more about the data and sometimes it is appropriate to go back to the beginning as choices made early in a process may introduce the issues you are encountering now. The knowledge gained by working this far may give you clues as to which of other options not previously selected might help.

 

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 277 views
  • 0 likes
  • 4 in conversation