Solved: set missing values based on conditions

LisaZ1 · Posted 08-10-2022 05:38 PM

Hi, I have two datasets right now and I want to merge them. The thing is they have in different size and there are missing values when merging them. Any suggestions are appreciated.

data return;
	input year month return;
	datalines;
	2011 1 34
	2011 2 33
	2011 3 88
	2011 4 67
	2011 5 39
	2011 6 45
	2011 7 20
	2011 8 29
	2011 9 39
	2011 10 23
	2011 11 28
	2011 12 38
	;
run;

data GDP;
	input year month gdp;
	datalines;
	2011 1 345
	2011 4 289
	2011 7 378
	2011 10 589
	;
run;

*this is what I did;
data combine;
	merge return(in=ina)
		gdp(in=inb);
	by year month;
	if in a;
	output;
run;
data combine;
	set combine;
	if gdp=. then gdp=lag(gdp);
run;

this is what the output is:

what I want for the output is:

data want;
	input year month return gdp;
	datalines;
	2011 1 34 345
	2011 2 33 345
	2011 3 88 345
	2011 4 67 289
	2011 5 39 289
	2011 6 45 289
	2011 7 20 378
	2011 8 29 378
	2011 9 39 378
	2011 10 23 589
	2011 11 28 589
	2011 12 38 589
	;
run;

Basically, I want the gdp not missing and if it is missing, then gdp that month should equal to the gdp that quarter the month belongs to.

Thank you!

Reeza · Posted 08-10-2022 05:58 PM

data combine;
	merge return(in=ina)
		gdp(in=inb);
	by year month;
run;
data combine2;
	set combine;
	retain gdp_filled;
      if not missing(gdp) then gdp_filled = gdp;
run;

@LisaZ1 wrote:

Hi, I have two datasets right now and I want to merge them. The thing is they have in different size and there are missing values when merging them. Any suggestions are appreciated.
data return;
	input year month return;
	datalines;
	2011 1 34
	2011 2 33
	2011 3 88
	2011 4 67
	2011 5 39
	2011 6 45
	2011 7 20
	2011 8 29
	2011 9 39
	2011 10 23
	2011 11 28
	2011 12 38
	;
run;

data GDP;
	input year month gdp;
	datalines;
	2011 1 345
	2011 4 289
	2011 7 378
	2011 10 589
	;
run;

*this is what I did;
data combine;
	merge return(in=ina)
		gdp(in=inb);
	by year month;
	if in a;
	output;
run;
data combine;
	set combine;
	if gdp=. then gdp=lag(gdp);
run;
this is what the output is:

what I want for the output is:
data want;
	input year month return gdp;
	datalines;
	2011 1 34 345
	2011 2 33 345
	2011 3 88 345
	2011 4 67 289
	2011 5 39 289
	2011 6 45 289
	2011 7 20 378
	2011 8 29 378
	2011 9 39 378
	2011 10 23 589
	2011 11 28 589
	2011 12 38 589
	;
run;
Basically, I want the gdp not missing and if it is missing, then gdp that month should equal to the gdp that quarter the month belongs to.

Thank you!

View solution in original post

Reeza · Posted 08-10-2022 05:58 PM

data combine;
	merge return(in=ina)
		gdp(in=inb);
	by year month;
run;
data combine2;
	set combine;
	retain gdp_filled;
      if not missing(gdp) then gdp_filled = gdp;
run;

@LisaZ1 wrote:

Hi, I have two datasets right now and I want to merge them. The thing is they have in different size and there are missing values when merging them. Any suggestions are appreciated.
data return;
	input year month return;
	datalines;
	2011 1 34
	2011 2 33
	2011 3 88
	2011 4 67
	2011 5 39
	2011 6 45
	2011 7 20
	2011 8 29
	2011 9 39
	2011 10 23
	2011 11 28
	2011 12 38
	;
run;

data GDP;
	input year month gdp;
	datalines;
	2011 1 345
	2011 4 289
	2011 7 378
	2011 10 589
	;
run;

*this is what I did;
data combine;
	merge return(in=ina)
		gdp(in=inb);
	by year month;
	if in a;
	output;
run;
data combine;
	set combine;
	if gdp=. then gdp=lag(gdp);
run;
this is what the output is:

what I want for the output is:
data want;
	input year month return gdp;
	datalines;
	2011 1 34 345
	2011 2 33 345
	2011 3 88 345
	2011 4 67 289
	2011 5 39 289
	2011 6 45 289
	2011 7 20 378
	2011 8 29 378
	2011 9 39 378
	2011 10 23 589
	2011 11 28 589
	2011 12 38 589
	;
run;
Basically, I want the gdp not missing and if it is missing, then gdp that month should equal to the gdp that quarter the month belongs to.

Thank you!

ballardw · Posted 08-10-2022 06:39 PM

Some details why your LAG did not work as desired. The LAG and DIF functions are queues. That means that when you use them conditionally, as in

if gdp=. then gdp=lag(gdp);

That the previous point in the queue was the last time that GDP was missing (the IF gdp=. part of your code). So that means the lag value was also missing.

The typical LAG based approaches, when appropriate, usually work a bit better when unconditionally referencing the Lag function. So you might try

Lg = lag(gdp);
if gdp=. then gdp=lg;
drop lg;

However there are some timing details as to where the Lag values are read. So if you run this code which does not drop the LG variable from the result you will see that because of when Lag pulls data from the input data that it would still be missing when you are more than one observation from the last value of the gdp variable.

data combinenew;
	set combine;
   lg = lag(gdp);
   if gdp=. then gdp=lg;
run;

The Retain keeps values of specified variables across iterations of the data step. So if you assign gdp to a variable when it is not missing it will be available for use similar to the way you intended the Lag to work.

Caution with Lag, Dif and Retain: make sure that if you are doing By group processing to reset or consider actual use of lagged or retained values at the boundaries of a by group. Suppose your data represented multiple areas for the same time frame. You would not want the first value of gdp, if missing for one region, to use the last value from the previous region. (Hint: By statment and First. and Last. automatic variables)

Tom · Posted 08-10-2022 08:18 PM

When you MERGE datasets and one of the datasets is not contributing to a particular BY group then the values of the variables contributed by that dataset are not retained.

Since your GDP values seem to be quarterly why not make a QTR variable and merge by that?

data return;
	input year month return;
  qtr = ceil(month/3);
datalines;
2011 1 34
2011 2 33
2011 3 88
2011 4 67
2011 5 39
2011 6 45
2011 7 20
2011 8 29
2011 9 39
2011 10 23
2011 11 28
2011 12 38
;

data GDP;
	input year month gdp;
  qtr = ceil(month/3);
datalines;
2011 1 345
2011 4 289
2011 7 378
2011 10 589
;

data want;
  merge return gdp ;
  by year qtr;
run;

Results:

Obs    year    month    return    qtr    gdp

  1    2011       1       34       1     345
  2    2011       2       33       1     345
  3    2011       3       88       1     345
  4    2011       4       67       2     289
  5    2011       5       39       2     289
  6    2011       6       45       2     289
  7    2011       7       20       3     378
  8    2011       8       29       3     378
  9    2011       9       39       3     378
 10    2011      10       23       4     589
 11    2011      11       28       4     589
 12    2011      12       38       4     589

Otherwise make a NEW variable so that you can retain its value.

data want;
  merge return gdp ;
  by year month;
  gdp_fixed = coalesce(gdp,gdp_fixed);
  retain gdp_fixed;
run;

Result

                                          gdp_
Obs    year    month    return    gdp    fixed

  1    2011       1       34      345     345
  2    2011       2       33        .     345
  3    2011       3       88        .     345
  4    2011       4       67      289     289
  5    2011       5       39        .     289
  6    2011       6       45        .     289
  7    2011       7       20      378     378
  8    2011       8       29        .     378
  9    2011       9       39        .     378
 10    2011      10       23      589     589
 11    2011      11       28        .     589
 12    2011      12       38        .     589

mkeintz · Posted 08-10-2022 08:25 PM

You want an LOCF (last observation carried forward) for the GDP variable. Using a conditional SET statement will do that. Once you understand the implications of the conditional SET, it is strikingly simple.

data return;
	input year month return;
	datalines;
	2011 1 34
	2011 2 33
	2011 3 88
	2011 4 67
	2011 5 39
	2011 6 45
	2011 7 20
	2011 8 29
	2011 9 39
	2011 10 23
	2011 11 28
	2011 12 38
run;

data GDP;
	input year month gdp;
	datalines;
	2011 1 345
	2011 4 289
	2011 7 378
	2011 10 589
run;

data want;
  set gdp (drop=gdp in=ingdp) return (in=inreturn);
  by year month;
  if ingdp=1 then set gdp (keep=gdp);
  if inreturn;
run;

The reason this carries forward the GDP value is because the GDP variable is only read in occasionally. Whenever the observation in hand is from the GDP dataset (ingdp=1) only then read in the GDP variable. It will never be automatically reset to missing (all vars from a SET statement are always retained until replaced by the next time that SET statement is executed). It will change only when a new GDP observation is encountered.

You could carry forward any number of variables from the gdp dataset. Just specify them in the appropriate DROP= and KEEP= dataset name options.

No need for lags. No need for RETAIN.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

set missing values based on conditions

Re: set missing values based on conditions

Re: set missing values based on conditions

Re: set missing values based on conditions

Re: set missing values based on conditions

Re: set missing values based on conditions

set missing values based on conditions

Re: set missing values based on conditions

Re: set missing values based on conditions

Re: set missing values based on conditions

Re: set missing values based on conditions

Re: set missing values based on conditions

Registration is open