Re: Adding data into missing cells

mkt_apprentice · Posted 03-25-2019 07:02 PM

Hellow SAS community,

After merging several data sets I have a data set like this:

id day age distance gender

1 1 21 2 F

1 2 . 0 .

1 3 . 0 .

1 4 21 5 F

1 5 21 1 F

2 1 43 2 M

2 2 43 2 M

2 3 . 0 .

2 4 43 3 M

2 5 . 0 .

3 1 24 5 F

3 2 24 5 F

3 3 24 5 F

3 4 24 5 F

3 5 . 0 .

4 1 18 2 M

4 2 . 0 .

4 3 . 0 .

4 4 . 0 .

4 5 . 0 .

5 1 36 10 M

5 2 36 8 M

5 3 36 8 M

5 4 36 9 M

5 5 36 7 M

Basically the day a person doesn't run, I create a new observation with Distance = 0. However, my merge ended up that I have lots of missing data in all of the rest variables except id and day. How could I add data back to the missing cell where we can infer information from other cells: for example, id 1 is Female (F) by looking at day 1, 4, 5, so day 2, 3 we know that the person with id 1 is Female (F).

Please share what you think can help me solve this issue

ballardw · Posted 03-25-2019 07:34 PM

One way:

data have;
input id  day   age  distance  gender $;
datalines;
1    1       21      2           F
1    2        .        0           . 
1    3        .        0           .
1    4       21      5           F
1    5       21      1           F
2    1       43      2           M
2    2       43      2           M
2    3        .        0            .
2    4       43      3           M
2    5        .       0             .
3    1       24      5           F 
3    2       24      5           F
3    3       24      5           F
3    4       24      5           F
3    5        .        0           . 
4    1       18      2          M
4    2        .        0           .
4    3        .        0           .
4    4        .        0           .
4    5        .        0           .
5   1       36       10        M 
5   2       36        8         M
5   3       36        8         M 
5   4       36        9         M 
5   5       36        7         M  
;
run;

data want;
   set have;
   by id;
   length r_gender $ 1.;
   retain r_age r_gender;
   if first.id then do;
      r_gender=gender;
   end;
   if not missing(age) then r_age=age;
   if missing (age) then age=r_age;
   if missing (gender) then gender=r_gender;
   drop r_age r_gender;
run;

Please note the data provided in the form of a data step and pasted into a code box opened with the forum's {I} icon. The data step so code can be tested and the code box to preserve formatting of code or log entries. The main message windows on this forum will reformat text, often making code not run.

The BY id allows use of the First. logic to identify the first record for each id. Assumes that gender will be present and not change during the course of the data.

Since Age might change during the course of your data check to see if the incoming record has an age value and don't assume that all records will be for the same value.

The Retain statement creates variables whose values will be kept for the following records. So they are available to assign when missing.

Check for missing and assign the value of the retained variable when they are.

heffo · Posted 03-25-2019 07:55 PM

I would probably change the if statements for coalesce(c). Just less code and a bit easier to read IMHO.

	age = coalesce(age,r_age);
	gender = coalescec(gender, r_gender);

heffo · Posted 03-25-2019 07:49 PM

Another version of it would be the following. The data does not have to have an age or a gender in the first row for each individual. NB! It does not work if the age is calculated by the day itself. So, if a person is 21 day one and 22 day two, this solution will give you 22 for all rows.

proc sql;
	create table run2 as 
	select ID, Day, age2, distance, gender2
		from run as r1 left join (select id as id1, max(Age) as Age2, max(gender) as Gender2 from run r2 group by id)
		on r1.id = r2.id
	order by id, day;
quit;

Astounding · Posted 03-25-2019 08:26 PM

If you do this the easy way, you don't even need to know the variable names. You just need data that is in sorted order by ID:

data want;

update have (obs=0) have;

by id;

output;

run;

Adding data into missing cells