BookmarkSubscribeRSS Feed
mkt_apprentice
Obsidian | Level 7

Hellow SAS community,

 

After merging several data sets I have a data set like this:

 

id  day   age  distance  gender

1    1       21      2           F

1    2        .        0           . 

1    3        .        0           .

1    4       21      5           F

1    5       21      1           F

2    1       43      2           M

2    2       43      2           M

2    3        .        0            .

2    4       43      3           M

2    5        .       0             .

3    1       24      5           F 

3    2       24      5           F

3    3       24      5           F

3    4       24      5           F

3    5        .        0           . 

4    1       18      2          M

4    2        .        0           .

4    3        .        0           .

4    4        .        0           .

4    5        .        0           .

5   1       36       10        M 

5   2       36        8         M

5   3       36        8         M 

5   4       36        9         M 

5   5       36        7         M  

 

Basically the day a person doesn't run, I create a new observation with Distance = 0. However, my merge ended up that I have lots of missing data in all of the rest variables except id and day. How could I add data back to the missing cell where we can infer information from other cells: for example, id 1 is Female (F) by looking at day 1, 4, 5, so day 2, 3 we know that the person with id 1 is Female (F). 

 

Please share what you think can help me solve this issue

4 REPLIES 4
ballardw
Super User

One way:

data have;
input id  day   age  distance  gender $;
datalines;
1    1       21      2           F
1    2        .        0           . 
1    3        .        0           .
1    4       21      5           F
1    5       21      1           F
2    1       43      2           M
2    2       43      2           M
2    3        .        0            .
2    4       43      3           M
2    5        .       0             .
3    1       24      5           F 
3    2       24      5           F
3    3       24      5           F
3    4       24      5           F
3    5        .        0           . 
4    1       18      2          M
4    2        .        0           .
4    3        .        0           .
4    4        .        0           .
4    5        .        0           .
5   1       36       10        M 
5   2       36        8         M
5   3       36        8         M 
5   4       36        9         M 
5   5       36        7         M  
;
run;

data want;
   set have;
   by id;
   length r_gender $ 1.;
   retain r_age r_gender;
   if first.id then do;
      r_gender=gender;
   end;
   if not missing(age) then r_age=age;
   if missing (age) then age=r_age;
   if missing (gender) then gender=r_gender;
   drop r_age r_gender;
run;

Please note the data provided in the form of a data step and pasted into a code box opened with the forum's {I} icon. The data step so code can be tested and the code box to preserve formatting of code or log entries. The main message windows on this forum will reformat text, often making code not run.

 

The BY id allows use of the First. logic to identify the first record for each id. Assumes that gender will be present and not change during the course of the data.

Since Age might change during the course of your data check to see if the incoming record has an age value and don't assume that all records will be for the same value.

The Retain statement creates variables whose values will be kept for the following records. So they are available to assign when missing.

Check for missing and assign the value of the retained variable when they are.

heffo
Pyrite | Level 9

I would probably change the if statements for coalesce(c). Just less code and a bit easier to read IMHO. 

	age = coalesce(age,r_age);
	gender = coalescec(gender, r_gender);
heffo
Pyrite | Level 9

Another version of it would be the following. The data does not have to have an age or a gender in the first row for each individual. NB! It does not work if the age is calculated by the day itself. So, if a person is 21 day one and 22 day two, this solution will give you 22 for all rows. 

 

proc sql;
	create table run2 as 
	select ID, Day, age2, distance, gender2
		from run as r1 left join (select id as id1, max(Age) as Age2, max(gender) as Gender2 from run r2 group by id)
		on r1.id = r2.id
	order by id, day;
quit;
Astounding
PROC Star

If you do this the easy way, you don't even need to know the variable names.  You just need data that is in sorted order by ID:

 

data want;

update have (obs=0) have;

by id;

output;

run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 728 views
  • 4 likes
  • 4 in conversation