DATA Step, Macro, Functions and more

Replacing missing values with closest non-missing observation within the same group

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 110
Accepted Solution

Replacing missing values with closest non-missing observation within the same group

I am trying to replace the missing industry codes for a given company with its most closest industry code (by date). I was able to write a code that replaces the missing value if there all industry codes for the given company are identical. However, in my data, occasionally the industry code changes, but I am not sure how I can replace the nearest industry code with the ones that are missing. With the following data, my code works well:

DATA work.sample;
	INPUT ID Year Industry;
	DATALINES; 
	004605 1972 . 
	004605 1973 .
	004605 1974 6798
	004610 1984 6799
	004610 1985 .
;
RUN;


data work.non_missing;
	set work.sample;
	if industry^=.;
	industry_correct=industry;
run;

proc sql;
	create table work.adding_industry as 
	select * 
	from work.sample as a 
	left join work.non_missing as b 
	on a.id=b.id;
quit;


However, with the following data (with non-identical industry codes for a given company) my code fails:

DATA work.sample;
	INPUT ID Year Industry;
	DATALINES; 
	004605 1972 . 
	004605 1973 .
	004605 1974 6798
	004605 1979 7788
	004610 1984 6799
	004610 1985 .
;
RUN;

Here is what the correct code will generate:

DATA work.want;
	INPUT ID Year Industry;
	DATALINES; 
	004605 1972 6798 
	004605 1973 6798
	004605 1974 6798
	004605 1979 7788
	004610 1984 6799
	004610 1985 6799
;
RUN;


I thought of the following, but I am wondering if there is a more robust way of doing this? 

proc sql;
	create table work.adding_industry as 
	select * 
	from work.sample as a 
	left join work.non_missing as b 
	on a.id=b.id and a.year=<b.year;
quit;

Accepted Solutions
Solution
‎06-12-2017 01:12 AM
Trusted Advisor
Posts: 1,554

Re: Replacing missing values with closest non-missing observation within the same group

Try next code:

 

proc sort data=have; by id year; run;

data temp;
 set have;
      by id ;
          retain ic;   /* industry code */
          if first.id then ic=industry;
          if industry=. then industry=ic;
          else ic=industry; drop ic;
run;

proc sort data=temp; by id descending year; run;

data want;
 set temp;
      by id ;
          retain ic;   /* industry code */
          if first.id then ic=industry;
          if industry=. then industry=ic;
          else ic=industry; drop ic;
run;
             

View solution in original post


All Replies
Solution
‎06-12-2017 01:12 AM
Trusted Advisor
Posts: 1,554

Re: Replacing missing values with closest non-missing observation within the same group

Try next code:

 

proc sort data=have; by id year; run;

data temp;
 set have;
      by id ;
          retain ic;   /* industry code */
          if first.id then ic=industry;
          if industry=. then industry=ic;
          else ic=industry; drop ic;
run;

proc sort data=temp; by id descending year; run;

data want;
 set temp;
      by id ;
          retain ic;   /* industry code */
          if first.id then ic=industry;
          if industry=. then industry=ic;
          else ic=industry; drop ic;
run;
             
Frequent Contributor
Posts: 110

Re: Replacing missing values with closest non-missing observation within the same group

[ Edited ]

Thanks, @Shmuel. The code works very well. The retain command is very powerful, I have to make use of it more often! 

The following part (let's call it part 1) of your code replaces missing observations with non-missing historical values:

data temp;
 set have;
      by id ;
          retain ic;   /* industry code */
          if first.id then ic=industry;
          if industry=. then industry=ic;
          else ic=industry; drop ic;
run;

The last part of your code (let's call it part 2) replaces the the missing observations with non-missing future values:

proc sort data=temp; by id descending year; run;

data want;
 set temp;
      by id ;
          retain ic;   /* industry code */
          if first.id then ic=industry;
          if industry=. then industry=ic;
          else ic=industry; drop ic;
run;


Suppose I have the following data:

DATA work.sample;
	INPUT ID Year Industry;
	DATALINES; 
	004610 1983 .
	004610 1984 6799
	004610 1985 9999
	004610 1986 .
	004610 1987 .
	004610 1988 8888
	004610 1989 .
	004610 1990 9999
;
RUN;

Whenever there is a future non-missing industry code and the code replaces missing industry code with that particular future non-missing idustry code, then we have the following output:

DATA work.want1;
	INPUT ID Year Industry;
	DATALINES; 
	004610 1983 6799
	004610 1984 6799
	004610 1985 9999
	004610 1986 8888
	004610 1987 8888
	004610 1988 8888
	004610 1989 9999
	004610 1990 9999
;


However, with your code the year=1989 is replaced with 8888 rather than 9999. Would you suggest just to remove the order of the code?  I.e., replace part 1 with part 2?

 

Update: @Shmuel, I just tried the reverse order (part 2, then part 1), but the output seems incorrect. 

Trusted Advisor
Posts: 1,554

Re: Replacing missing values with closest non-missing observation within the same group

You can try change sort types:

- 1st to descending

- 2nd drop the ascending to be ascening

 

I hope that may help. I haven't tried it.

 

Otherwise: maybe next code should help , to be entered after ascending sort


if last.id and industry=. then industry=9999;
Frequent Contributor
Posts: 110

Re: Replacing missing values with closest non-missing observation within the same group

This is very helpful, @Shmuel. I will give it a try. Thanks again for all of your valuable  suggestions.

Frequent Contributor
Posts: 77

Re: Replacing missing values with closest non-missing observation within the same group

DATA work.sample; INPUT ID Year Industry; DATALINES; 004605 1972 . 004605 1973 . 004605 1974 6798 004610 1984 6799 004610 1985 . ; RUN; PROC SORT DATA=SAMPLE ;BY ID Industry ;RUN; DATA SAMPLE_1; SET SAMPLE; BY ID ; IF LAST.ID THEN OUTPUT; RUN; PROC SQL; CREATE TABLE WANT AS SELECT A.*,B.Industry AS Industry_NEW FROM SAMPLE AS A LEFT JOIN SAMPLE_1 AS B ON A.ID = B.ID; QUIT;
Frequent Contributor
Posts: 110

Re: Replacing missing values with closest non-missing observation within the same group

Posted in reply to TarunKumar

 

Thanks, @TarunKumar. I have tried your code with the following dataset, but the output seems incorrect:

DATA work.sample;
	INPUT ID Year Industry;
	DATALINES; 
	004610 1983 .
	004610 1984 6799
	004610 1985 9999
	004610 1986 .
	004610 1987 .
	004610 1988 8888
	004610 1989 .
	004610 1990 9999
;
RUN;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 201 views
  • 3 likes
  • 3 in conversation