I am trying to replace the missing industry codes for a given company with its most closest industry code (by date). I was able to write a code that replaces the missing value if there all industry codes for the given company are identical. However, in my data, occasionally the industry code changes, but I am not sure how I can replace the nearest industry code with the ones that are missing. With the following data, my code works well:
DATA work.sample;
INPUT ID Year Industry;
DATALINES;
004605 1972 .
004605 1973 .
004605 1974 6798
004610 1984 6799
004610 1985 .
;
RUN;
data work.non_missing;
set work.sample;
if industry^=.;
industry_correct=industry;
run;
proc sql;
create table work.adding_industry as
select *
from work.sample as a
left join work.non_missing as b
on a.id=b.id;
quit;
However, with the following data (with non-identical industry codes for a given company) my code fails:
DATA work.sample;
INPUT ID Year Industry;
DATALINES;
004605 1972 .
004605 1973 .
004605 1974 6798
004605 1979 7788
004610 1984 6799
004610 1985 .
;
RUN;
Here is what the correct code will generate:
DATA work.want;
INPUT ID Year Industry;
DATALINES;
004605 1972 6798
004605 1973 6798
004605 1974 6798
004605 1979 7788
004610 1984 6799
004610 1985 6799
;
RUN;
I thought of the following, but I am wondering if there is a more robust way of doing this?
proc sql;
create table work.adding_industry as
select *
from work.sample as a
left join work.non_missing as b
on a.id=b.id and a.year=<b.year;
quit;
Try next code:
proc sort data=have; by id year; run;
data temp;
set have;
by id ;
retain ic; /* industry code */
if first.id then ic=industry;
if industry=. then industry=ic;
else ic=industry; drop ic;
run;
proc sort data=temp; by id descending year; run;
data want;
set temp;
by id ;
retain ic; /* industry code */
if first.id then ic=industry;
if industry=. then industry=ic;
else ic=industry; drop ic;
run;
Try next code:
proc sort data=have; by id year; run;
data temp;
set have;
by id ;
retain ic; /* industry code */
if first.id then ic=industry;
if industry=. then industry=ic;
else ic=industry; drop ic;
run;
proc sort data=temp; by id descending year; run;
data want;
set temp;
by id ;
retain ic; /* industry code */
if first.id then ic=industry;
if industry=. then industry=ic;
else ic=industry; drop ic;
run;
Thanks, @Shmuel. The code works very well. The retain command is very powerful, I have to make use of it more often!
The following part (let's call it part 1) of your code replaces missing observations with non-missing historical values:
data temp;
set have;
by id ;
retain ic; /* industry code */
if first.id then ic=industry;
if industry=. then industry=ic;
else ic=industry; drop ic;
run;
The last part of your code (let's call it part 2) replaces the the missing observations with non-missing future values:
proc sort data=temp; by id descending year; run;
data want;
set temp;
by id ;
retain ic; /* industry code */
if first.id then ic=industry;
if industry=. then industry=ic;
else ic=industry; drop ic;
run;
Suppose I have the following data:
DATA work.sample;
INPUT ID Year Industry;
DATALINES;
004610 1983 .
004610 1984 6799
004610 1985 9999
004610 1986 .
004610 1987 .
004610 1988 8888
004610 1989 .
004610 1990 9999
;
RUN;
Whenever there is a future non-missing industry code and the code replaces missing industry code with that particular future non-missing idustry code, then we have the following output:
DATA work.want1;
INPUT ID Year Industry;
DATALINES;
004610 1983 6799
004610 1984 6799
004610 1985 9999
004610 1986 8888
004610 1987 8888
004610 1988 8888
004610 1989 9999
004610 1990 9999
;
However, with your code the year=1989 is replaced with 8888 rather than 9999. Would you suggest just to remove the order of the code? I.e., replace part 1 with part 2?
Update: @Shmuel, I just tried the reverse order (part 2, then part 1), but the output seems incorrect.
You can try change sort types:
- 1st to descending
- 2nd drop the ascending to be ascening
I hope that may help. I haven't tried it.
Otherwise: maybe next code should help , to be entered after ascending sort
if last.id and industry=. then industry=9999;
This is very helpful, @Shmuel. I will give it a try. Thanks again for all of your valuable suggestions.
Thanks, @TarunKumar. I have tried your code with the following dataset, but the output seems incorrect:
DATA work.sample;
INPUT ID Year Industry;
DATALINES;
004610 1983 .
004610 1984 6799
004610 1985 9999
004610 1986 .
004610 1987 .
004610 1988 8888
004610 1989 .
004610 1990 9999
;
RUN;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.