Find the missing values in the year

Solved
Frequent Contributor
Posts: 140

Find the missing values in the year

Hi i am having data like this

data ;
input no year sal;
cards;
10 2001 100
10 2002 102
11 2001 90
11 2002 12
11 2003 13
23 2002  100
23 2003 112
run;

the this i want is if the no is there and if missing in the years 2001,2002,2003 then we should create new entry of the saame
no with then next value

output

no year sal  new_entry
10 2001 100
10 2002 102
10 2003 102   y
11 2001 90
11 2002 12
11 2003 13
23 2001  100   y
23 2002  100
23 2003 112

in  this the new _entry ="Y' means some of the year data is missing so we should give the next data for no 10 the 2003 is missing the data is picked for 2002
and for no 23 2001 is missing the data is picked from 2002.

Accepted Solutions
Solution
‎01-29-2013 10:45 AM
Posts: 3,852

Re: Find the missing values in the year

Good point about my look ahead that may not be far enough.  If I wasn't too worried about performance I would just sort descending and use LOCF to get NOCB.

data sal;
input no year sal;
cards;
10 2001 100
10 2002 102
11 2001 90
11 2002 12
11 2003 13
23 2002 100
23 2003 112
run;
proc summary data=sal nway completetypes;

class no year;
output out=sal2(drop=_type_) idgroup(out(sal)=);
run;
data sal3; *LOCF;

update sal2(obs=0) sal2 end=eof;
by no;
output;

run;
proc sort data=sal3;
key no;

key year / descending;

run;
data sal4; *NOCB;

update sal3(obs=0) sal3;
by no;
output;

run;

All Replies
Super Contributor
Posts: 543

Re: Find the missing values in the year

Hi!

Here is a piece of code that gets what you want but not very elegantly:smileyplain:

data in ;

input no year sal;

cards;

10 2001 100

10 2002 102

11 2001 90

11 2002 12

11 2003 13

23 2002  100

23 2003 112

run;

/*I want to change data from 'long' to 'wide*/

/*first need to sort*/

proc sort data = in;by no;run;

data temp;

set in;

by no;

/*create a couple of arrays, so now we have a column for each year and each sal*/

keep no yr2001-yr2003 t_sal2001-t_sal2003;

retain yr2001-yr2003 t_sal2001-t_sal2003;

array yr(2001:2003) yr2001-yr2003;

array t_sal(2001:2003) t_sal2001-t_sal2003;

if first.no then do;

do i = 2001 to 2003;

yr(i) = .;

t_sal(i) = .;

end;

end;

yr(year) = year;

t_sal(year) = sal;

if last.no then output;

run;

/*using a couple of IF-THEN we fill in the missing year*/

data temp_1;

set temp;

if yr2003 = . then do;

if yr2002 ne . then yr2003 = yr2002;

if t_sal2002 ne . then t_sal2003 = t_sal2002;

n_entry2003 = "y";

end;

if yr2001 = . then do;

if yr2002 ne . then yr2001 = yr2002;

if t_sal2002 ne . then t_sal2001 = t_sal2002;

n_entry2001 = "y";

end;

run;

/*re-organize the data to long*/

data want;

set temp_1;

array yr(2001:2003) yr2001-yr2003;

array t_sal(2001:2003) t_sal2001-t_sal2003;

array n_entry(2001:2003) n_entry2001-n_entry2003;

do i = 2001 to 2003;

year = yr(i);

sal = t_sal(i);

new_entry = n_entry(i);

output;

end;

drop yr: t_sal: n_entry:;

run;

There are probably more compact, efficient way to accomplish this...but this should get you started.

Good luck!

Anca.

Regular Contributor
Posts: 195

Re: Find the missing values in the year

Hi ,

try the following code...

proc sort data = test;

by no year;

run;

/* Identify the missing year */

proc sql;

create table test1 as

select no,year,count(sal) as sal

from test

group by 1

having sal < 3;

quit;

/* Insert the row for missing years */

proc sql;

insert into test

set no = 10,

year = 2003,

sal = 102

set no = 23,

year = 2001,

sal = 100;

quit;

/* And finally add the New Entry column for missing years */

data test;

set test;

if no = 10 and year = 2003 then new_entry = 'Y;

if no = 23 and year = 2001 then new_entry = 'Y;

run;

proc sort data = test;

by no year;

run;

you can also take the missing year values in macro variables from the test1 dataset and automate the process...

Thanks

Super Contributor
Posts: 578

Re: Find the missing values in the year

If you don't mind creating a tmp table to hold all of the possible records:

data have;

input no year sal;

cards;

10 2001 100

10 2002 102

11 2001 90

11 2002 12

11 2003 13

23 2002  100

23 2003 112

run;

data years;

input year;

cards;

2001

2002

2003

;

run;

proc sql;

create table work.tmp as

select

tmp.no

,tmp.year

,have.sal

from

(select distinct

t1.no

,t2.year

from have t1, years t2) tmp

left outer join have

on tmp.no=have.no and tmp.year=have.year

order by tmp.no, tmp.year;

quit;

data want(keep=no year sal);

set tmp nobs=_Nobs;

by no year;

sal_prev = lag(sal);

sal_next=.;

RecPointer=_N_+1;

if sal=. then do;

if first.no then do;

/* get next sal*/

if RecPointer < _Nobs then do;

set tmp (keep=sal rename=(sal=sal_next)) point=RecPointer;

sal=sal_next;

end;

end;

else sal = sal_prev;

end;

run;

Posts: 3,852

Re: Find the missing values in the year

First you need to create the missing years.  There are many ways to do this and SAS has some features to help make it easy.  I use PROC SUMMARY COMPLETETYPES to add the missing years.  Then the problems is a simple LOCF and Next obs carried back (NOCB?)

data sal;
input no year sal;
cards;
10 2001 100
10 2002 102
11 2001 90
11 2002 12
11 2003 13
23 2002 100
23 2003 112
run;
proc summary data=sal nway completetypes;

class no year;
output out=sal2(drop=_type_) idgroup(out(sal)=);
run;
data sal3;
update sal2(obs=0) sal2 end=eof;
by no;
if not eof then set sal2(firstobs=2 keep=sal rename=(sal=nextsal));
if first.no and not last.no and missing(sal) then sal=nextsal;

output;

drop nextsal;
run;
You could drop _FREQ_ if you like.  When it is 0 it is an indication the observation has been LOCF or NOCB.
Regular Contributor
Posts: 151

Re: Find the missing values in the year

Nice answer, I was working on something very similar (using PROC SUMMARY with COMPLETETYPES).  The bit I was struggling with to code relatively simply was missing imputation and coping with both lag and lead values (the lag part is easy).

The only problem I can see with your solution is if there is only a value for 2003 in the example data (i.e. 2001 and 2002 are missing).  Using firstobs=2 will still return a missing value here.  Ideally you would want to read sequential records until a value is encountered, then return that.  The nature of the data means that it is not possible to go to the next BY group before a non-missing value occurs, so that isn't a worry.

Solution
‎01-29-2013 10:45 AM
Posts: 3,852

Re: Find the missing values in the year

Good point about my look ahead that may not be far enough.  If I wasn't too worried about performance I would just sort descending and use LOCF to get NOCB.

data sal;
input no year sal;
cards;
10 2001 100
10 2002 102
11 2001 90
11 2002 12
11 2003 13
23 2002 100
23 2003 112
run;
proc summary data=sal nway completetypes;

class no year;
output out=sal2(drop=_type_) idgroup(out(sal)=);
run;
data sal3; *LOCF;

update sal2(obs=0) sal2 end=eof;
by no;
output;

run;
proc sort data=sal3;
key no;

key year / descending;

run;
data sal4; *NOCB;

update sal3(obs=0) sal3;
by no;
output;

run;
🔒 This topic is solved and locked.