I am trying to replace the missing industry codes for a given company with its most closest industry code (by date). I was able to write a code that replaces the missing value if there all industry codes for the given company are identical. However, in my data, occasionally the industry code changes, but I am not sure how I can replace the nearest industry code with the ones that are missing. With the following data, my code works well:
DATA work.sample;
INPUT ID Year Industry;
DATALINES;
004605 1972 .
004605 1973 .
004605 1974 6798
004610 1984 6799
004610 1985 .
;
RUN;
data work.non_missing;
set work.sample;
if industry^=.;
industry_correct=industry;
run;
proc sql;
create table work.adding_industry as
select *
from work.sample as a
left join work.non_missing as b
on a.id=b.id;
quit;
However, with the following data (with non-identical industry codes for a given company) my code fails:
DATA work.sample;
INPUT ID Year Industry;
DATALINES;
004605 1972 .
004605 1973 .
004605 1974 6798
004605 1979 7788
004610 1984 6799
004610 1985 .
;
RUN;
Here is what the correct code will generate:
DATA work.want;
INPUT ID Year Industry;
DATALINES;
004605 1972 6798
004605 1973 6798
004605 1974 6798
004605 1979 7788
004610 1984 6799
004610 1985 6799
;
RUN;
I thought of the following, but I am wondering if there is a more robust way of doing this?
proc sql;
create table work.adding_industry as
select *
from work.sample as a
left join work.non_missing as b
on a.id=b.id and a.year=<b.year;
quit;
Try next code:
proc sort data=have; by id year; run;
data temp;
set have;
by id ;
retain ic; /* industry code */
if first.id then ic=industry;
if industry=. then industry=ic;
else ic=industry; drop ic;
run;
proc sort data=temp; by id descending year; run;
data want;
set temp;
by id ;
retain ic; /* industry code */
if first.id then ic=industry;
if industry=. then industry=ic;
else ic=industry; drop ic;
run;
Try next code:
proc sort data=have; by id year; run;
data temp;
set have;
by id ;
retain ic; /* industry code */
if first.id then ic=industry;
if industry=. then industry=ic;
else ic=industry; drop ic;
run;
proc sort data=temp; by id descending year; run;
data want;
set temp;
by id ;
retain ic; /* industry code */
if first.id then ic=industry;
if industry=. then industry=ic;
else ic=industry; drop ic;
run;
Thanks, @Shmuel. The code works very well. The retain command is very powerful, I have to make use of it more often!
The following part (let's call it part 1) of your code replaces missing observations with non-missing historical values:
data temp;
set have;
by id ;
retain ic; /* industry code */
if first.id then ic=industry;
if industry=. then industry=ic;
else ic=industry; drop ic;
run;
The last part of your code (let's call it part 2) replaces the the missing observations with non-missing future values:
proc sort data=temp; by id descending year; run;
data want;
set temp;
by id ;
retain ic; /* industry code */
if first.id then ic=industry;
if industry=. then industry=ic;
else ic=industry; drop ic;
run;
Suppose I have the following data:
DATA work.sample;
INPUT ID Year Industry;
DATALINES;
004610 1983 .
004610 1984 6799
004610 1985 9999
004610 1986 .
004610 1987 .
004610 1988 8888
004610 1989 .
004610 1990 9999
;
RUN;
Whenever there is a future non-missing industry code and the code replaces missing industry code with that particular future non-missing idustry code, then we have the following output:
DATA work.want1;
INPUT ID Year Industry;
DATALINES;
004610 1983 6799
004610 1984 6799
004610 1985 9999
004610 1986 8888
004610 1987 8888
004610 1988 8888
004610 1989 9999
004610 1990 9999
;
However, with your code the year=1989 is replaced with 8888 rather than 9999. Would you suggest just to remove the order of the code? I.e., replace part 1 with part 2?
Update: @Shmuel, I just tried the reverse order (part 2, then part 1), but the output seems incorrect.
You can try change sort types:
- 1st to descending
- 2nd drop the ascending to be ascening
I hope that may help. I haven't tried it.
Otherwise: maybe next code should help , to be entered after ascending sort
if last.id and industry=. then industry=9999;
This is very helpful, @Shmuel. I will give it a try. Thanks again for all of your valuable suggestions.
Thanks, @TarunKumar. I have tried your code with the following dataset, but the output seems incorrect:
DATA work.sample;
INPUT ID Year Industry;
DATALINES;
004610 1983 .
004610 1984 6799
004610 1985 9999
004610 1986 .
004610 1987 .
004610 1988 8888
004610 1989 .
004610 1990 9999
;
RUN;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.