BookmarkSubscribeRSS Feed
SeanZ
Obsidian | Level 7

There are missing values in my data set, and I want to backfill some missing values. See the example below

 

year     firm     value

2000     A          1

2001     A          .

2002     A          .

2003     A          9

2000     B          2

2001     B          .

2002     B          3

2003     B          1

 

I only want to backfill ONE year, after manipulation, I would like to get the results below.

 

 

year     firm     value

2000     A          1

2001     A          .

2002     A          9

2003     A          9

2000     B          2

2001     B          3

2002     B          3

2003     B          1

 

I actually know how to backfill values, but it will backfill all missing values. I want to know how to apply the criterion that only one year's data will be backfilled. Can any one help me with this? Thanks.

13 REPLIES 13
Reeza
Super User

In a roundabout way:

 

data have;
    input year firm  $   value;
    cards;
2000     A          1
2001     A          .
2002     A          .
2003     A          9
2000     B          2
2001     B          .
2002     B          3
2003     B          1
;
run;

proc sort data=have;
    by firm descending year;
run;

data want;
    set have;
    by firm;
    prev_value=lag(value);

    if first.firm then
        prev_value=.;

    if value=. then
        new_value=prev_value;
    else
        new_value=value;
run;

proc sort data=want;
    by firm year;
run;
SeanZ
Obsidian | Level 7

Thank you. But how can this way apply the year screening criterion? I only want to backfill three years of miss data.

Reeza
Super User

@SeanZ wrote:

Thank you. But how can this way apply the year screening criterion? I only want to backfill three years of miss data.


Then why did you ask for one year?

 

Please post data that reflects your new question.

SeanZ
Obsidian | Level 7

Thank you for your reply. I paste a new data set to reflect my question. I would like to backfill N years of missing data. In this case, I would like to set N=2. Eventually, I hope to control and set different N.

 

year     firm     value

2000     A          1

2001     A          .

2002     A          .

2003     A          .

2004     A          9

2000     B          2

2001     B          .

2002     B          3

2003     B          .

2004     B          .

2005     B          2

 

The data I want

 

year     firm     value

2000     A          1

2001     A          .

2002     A          9

2003     A          9

2004     A          9

2000     B          2

2001     B          3

2002     B          3

2003     B          2

2004     B          2

2005     B          2

 

Thanks.

Reeza
Super User

Slight variation, change the 2 to whatever number you want. Idea is basic, add a counter to determine how many records you've backfilled and then use that in your condition.

 

data have;
    input year firm  $   value;
    cards;
2000     A          1
2001     A          .
2002     A          .
2003     A          .
2004     A          9
2000     B          2
2001     B          .
2002     B          3
2003     B          .
2004     B          .
2005     B          2
;
run;

proc sort data=have;
    by firm descending year;
run;



data want;
    set have;
    by firm;
    retain new_value backfill;

    if first.firm then do;
       backfill=0;
        new_value=.;
    end;

    if value=. then do;
        backfill+1;
        if backfill > 2 then new_value=.;
    end;
    else do; 
        backfill=0;
        new_value=value;
    end;
run;

proc sort data=want;
    by firm year;
run;
PGStats
Opal | Level 21

Here is my take on it:

 

data have;
input year firm $ value;
datalines;
2000     A          1
2001     A          .
2002     A          .
2003     A          .
2004     A          9
2000     B          2
2001     B          .
2002     B          3
2003     B          .
2004     B          .
2005     B          2
;

proc sort data=have; by firm descending year; run;

%let n=2; /* The number of years to fill in */

data want(sortedby=firm descending year);
y = 99999;
do until(last.firm);
    set have; by firm;
    if missing(value) then do;
        if year + &n. >= y then value = v; 
        end;
    else do;
        y = year;
        v = value;
        end;
    output;
    end;
drop v y;
run;

proc sort data=want; by firm year; run;

proc print; run;

Note that by using the year in the replacement condition, omitted or duplicated years should be handled properly. The sortedby= dataset option is an attempt to improve the efficiency of the last sort (I don't know if it makes a real difference).

PG
PGStats
Opal | Level 21

Or use SQL:

 


proc sql;
create table want as
select a.year, a.firm, coalesce(a.value, b.value) as value
from have as a left join
    have as b on a.firm=b.firm and a.year+1 = b.year
order by firm, year;
quit;
PG
SeanZ
Obsidian | Level 7
Thanks for the solution.  I like this solution very much. But how to extend this to the case that I want to backfill 3 years instead if one?

Regards,

Sean
ballardw
Super User

@SeanZ wrote:
Thanks for the solution.  I like this solution very much. But how to extend this to the case that I want to backfill 3 years instead if one?

Regards,

Sean

Is your requirement specific numbers of years or "all but the first missing"?

SeanZ
Obsidian | Level 7

I will define the number of years to fill for missing values.

PGStats
Opal | Level 21

Although the SQL method can be extended to fill more than one year, it will not scale well to an arbitrary number of years. The same can be said for most methods presented so far, except for @Reeza's reverse sort approach. 

PG
Astounding
PROC Star

More specifically (but untested), it looks like you could take Reeza's approach and make minor changes.  First, indicate how many years you are willing to backfill:

 

%let n=2;

 

Then change one line in the middle.  Currently, the code contains:

 

if backfill > 2 then new_value = .;

 

Change that to read:

 

if backfill <= &N then value = new_value;

 

I'm not sure if Reeza intended that the proper result would be in VALUE or in NEW_VALUE.  In this modification, the proper value is VALUE, and you can drop NEW_VALUE.

Astounding
PROC Star

Assuming you don't want to backfill across firms, here's one way:

 

data want;

set have;

by firm;

set have (keep=value rename=(value=next_value) firstobs=2) have (drop=_all_);

if last.firm=0 and value=. then value = next_value;

drop next_value;

run;

 

It does replace all missing values.  However, if there are two missing values in a row, the replacement value (for the first one) will also be a missing value.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 3398 views
  • 1 like
  • 5 in conversation