BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Walternate
Obsidian | Level 7

Hi,

 

I have a dataset at the person-month level with an ID variable that is not clean, another ID variable, and a few other categorical variables. 

 

Dataset1:

ID1       ID2         categ vars.....

1          123

1.1      123

1.2       234

1.2       234

 

In ID1, the 1 is the valid value, while 1.1 and 1.2 are invalid values. ID 2 is the way I can distinguish one person from another, but I need to retain as many valid values of ID1 as possible. I need to do two things: One is that for every person (ie, value of ID2) that has at least one valid value of ID1, I want to fill the rest of their rows with that valid value of ID1. The other is that for every person that never has a valid value of ID1, I want a new variable, ID3, to give them a value.

 

Dataset 2

ID1       ID2       ID3     categ vars.....

1          123       

1          123

.           234       1

.           234       2

 

This is what I have coded so far to accomplish this:

 

 

proc sql;

create table want1 as

select *, case when ID1 like '%.% then 1

  else 0

end as invalid_ID1

from have;

quit;

 

proc sql;

create table want2 as

select *, count(ID2) as num_rows, count(invalid_ID1) as rows_w_invalid_ID1

from want1

group by ID2

order by ID2, invalid_ID1;

quit;

 

data want3;

set want2;

by ID2 invalid_ID1;

retain ID3 1;

if num_rows>rows_w_invalid_ID1 then ID3=.;

else if first.ID2 and num_rows=rows_w_invalid_ID1 then do;

ID3+1;

ID1=' ';end;

retain ID1_keep;

if first.ID2 then ID1_keep=ID1;

if rows_w_invalid_id=1 and num_rows>rows_w_invalid_id then ID1=ID1_keep;

run;

 

This code correctly completes my first objective of overwriting invalid values of ID1 with valid values for each person using ID2 as the person identifier. However, when assigning ID3, it resets to 1 every time it encounters a missing value, ie  someone who would not be assigned a value of ID3. So it looks like this:

 

Dataset 2

ID1       ID2       ID3     categ vars.....

1          123       

1          123

.           234       1

.           234       2

2          345      

2          345

.           456       1

 

What I would want it to do is continue counting rather than resetting upon encountering the missing variable. 

 

Any help is much appreciated. 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

OK, that can be fixed.  Unfortunately it means adding another variable to the mix:

 

data want;

do until (last.id2);

   set have;

   by id2;

   if id1 = int(id1) then valid_id1 = id1;

end;

do until (last.id2);

   set have;

   by id2;

   if valid_id1 > . then id1 = valid_id1;

   else do;

      if first.id2 then potential_id3 + 1;

      id3 = potential_id3;

   end;

   output;

   drop potential_id3;

end;

View solution in original post

5 REPLIES 5
Astounding
PROC Star

It might be simpler to use a single DATA step instead of a multi-step approach.  Assuming your data is sorted by ID2:

 

data want;

do until (last.id2);

   set have;

   by id2;

   if id1 = int(id1) then valid_id1 = id1;

end;

do until (last.id2);

   set have;

   by id2;

   if valid_id1 > . then id1 = valid_id1;

   else do;

      if first.id2 then id3=1;

      else id3 = id3 + 1;

   end;

   output;

end;

 

The top loop determines whether there is a valid ID1 value for that ID2 category.  The bottom loop utilizes that information to populate ID1 and ID3.

drop valid_id1;

run;

 

Walternate
Obsidian | Level 7

This worked except for one thing: ID3 is incrementing on every row instead of for every change in ID2, like this:

 

ID1     ID2     ID3

1.1     abc       1     

1.1     abc       2

1.2     abc       3

1.1     def        4

 

Instead, it should only increase if the ID2 changes:

 

ID1     ID2     ID3

1.1     abc       1     

1.1     abc       1

1.2     abc       1

1.1     def        2

Astounding
PROC Star

OK, that can be fixed.  Unfortunately it means adding another variable to the mix:

 

data want;

do until (last.id2);

   set have;

   by id2;

   if id1 = int(id1) then valid_id1 = id1;

end;

do until (last.id2);

   set have;

   by id2;

   if valid_id1 > . then id1 = valid_id1;

   else do;

      if first.id2 then potential_id3 + 1;

      id3 = potential_id3;

   end;

   output;

   drop potential_id3;

end;

Haikuo
Onyx | Level 15

 

Q1: How do you identify/differentiate good ID1 from those are bad.

Q2: If you have multiple goods and some bads, which one of the goods you want to use to populate the bads.

Q3: Will the goods always appear before the bads?

Q4: Have your data presorted by ID2 or at least you know ID2 are clustering together?

 

Anyway, I made some reasonable guess, and here are some codes to get you going ( Sorry, I haven't read through your code):

data have;
	input (id1 id2) (:$8.);
	cards;
1          123
1.1      123
1.2       234
1.2       234
;

/*using 2XDOW is to assume that you are not sure if goods appear before bads, otherwise, code can be simpler*/
/*this is to assume if anything but number appears will be redeemed as bad*/
data want;
	do until (last.id2);
		set have;
		by id2 notsorted;

		if notdigit(strip(id1))=0 then 
			_id1=id1;
	end;

	do until (last.id2);
		set have;
		by id2 notsorted;

		if notdigit(strip(id1))>0 then
			id1=_id1;

		if missing(_id1) then
			id3=sum(id3,1);
		output;
	end;

	drop _id1;
run;
ballardw
Super User

Is this a continuation of what you did after this topic? https://communities.sas.com/t5/Base-SAS-Programming/Stacking-two-datasets-with-overlapping-IDs/m-p/2...

 

It may be that what you need is to revisit the requirements earlier in the process and describe the whole requirement. Some problems are iterative that way. You learn more about the data and sometimes it is appropriate to go back to the beginning as choices made early in a process may introduce the issues you are encountering now. The knowledge gained by working this far may give you clues as to which of other options not previously selected might help.

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1169 views
  • 0 likes
  • 4 in conversation