Hi, I have two datasets right now and I want to merge them. The thing is they have in different size and there are missing values when merging them. Any suggestions are appreciated.
data return;
input year month return;
datalines;
2011 1 34
2011 2 33
2011 3 88
2011 4 67
2011 5 39
2011 6 45
2011 7 20
2011 8 29
2011 9 39
2011 10 23
2011 11 28
2011 12 38
;
run;
data GDP;
input year month gdp;
datalines;
2011 1 345
2011 4 289
2011 7 378
2011 10 589
;
run;
*this is what I did;
data combine;
merge return(in=ina)
gdp(in=inb);
by year month;
if in a;
output;
run;
data combine;
set combine;
if gdp=. then gdp=lag(gdp);
run;
this is what the output is:
what I want for the output is:
data want;
input year month return gdp;
datalines;
2011 1 34 345
2011 2 33 345
2011 3 88 345
2011 4 67 289
2011 5 39 289
2011 6 45 289
2011 7 20 378
2011 8 29 378
2011 9 39 378
2011 10 23 589
2011 11 28 589
2011 12 38 589
;
run;
Basically, I want the gdp not missing and if it is missing, then gdp that month should equal to the gdp that quarter the month belongs to.
Thank you!
data combine;
merge return(in=ina)
gdp(in=inb);
by year month;
run;
data combine2;
set combine;
retain gdp_filled;
if not missing(gdp) then gdp_filled = gdp;
run;
@LisaZ1 wrote:
Hi, I have two datasets right now and I want to merge them. The thing is they have in different size and there are missing values when merging them. Any suggestions are appreciated.
data return; input year month return; datalines; 2011 1 34 2011 2 33 2011 3 88 2011 4 67 2011 5 39 2011 6 45 2011 7 20 2011 8 29 2011 9 39 2011 10 23 2011 11 28 2011 12 38 ; run; data GDP; input year month gdp; datalines; 2011 1 345 2011 4 289 2011 7 378 2011 10 589 ; run; *this is what I did; data combine; merge return(in=ina) gdp(in=inb); by year month; if in a; output; run; data combine; set combine; if gdp=. then gdp=lag(gdp); run;
this is what the output is:
what I want for the output is:
data want; input year month return gdp; datalines; 2011 1 34 345 2011 2 33 345 2011 3 88 345 2011 4 67 289 2011 5 39 289 2011 6 45 289 2011 7 20 378 2011 8 29 378 2011 9 39 378 2011 10 23 589 2011 11 28 589 2011 12 38 589 ; run;
Basically, I want the gdp not missing and if it is missing, then gdp that month should equal to the gdp that quarter the month belongs to.
Thank you!
data combine;
merge return(in=ina)
gdp(in=inb);
by year month;
run;
data combine2;
set combine;
retain gdp_filled;
if not missing(gdp) then gdp_filled = gdp;
run;
@LisaZ1 wrote:
Hi, I have two datasets right now and I want to merge them. The thing is they have in different size and there are missing values when merging them. Any suggestions are appreciated.
data return; input year month return; datalines; 2011 1 34 2011 2 33 2011 3 88 2011 4 67 2011 5 39 2011 6 45 2011 7 20 2011 8 29 2011 9 39 2011 10 23 2011 11 28 2011 12 38 ; run; data GDP; input year month gdp; datalines; 2011 1 345 2011 4 289 2011 7 378 2011 10 589 ; run; *this is what I did; data combine; merge return(in=ina) gdp(in=inb); by year month; if in a; output; run; data combine; set combine; if gdp=. then gdp=lag(gdp); run;
this is what the output is:
what I want for the output is:
data want; input year month return gdp; datalines; 2011 1 34 345 2011 2 33 345 2011 3 88 345 2011 4 67 289 2011 5 39 289 2011 6 45 289 2011 7 20 378 2011 8 29 378 2011 9 39 378 2011 10 23 589 2011 11 28 589 2011 12 38 589 ; run;
Basically, I want the gdp not missing and if it is missing, then gdp that month should equal to the gdp that quarter the month belongs to.
Thank you!
Some details why your LAG did not work as desired. The LAG and DIF functions are queues. That means that when you use them conditionally, as in
if gdp=. then gdp=lag(gdp);
That the previous point in the queue was the last time that GDP was missing (the IF gdp=. part of your code). So that means the lag value was also missing.
The typical LAG based approaches, when appropriate, usually work a bit better when unconditionally referencing the Lag function. So you might try
Lg = lag(gdp); if gdp=. then gdp=lg; drop lg;
However there are some timing details as to where the Lag values are read. So if you run this code which does not drop the LG variable from the result you will see that because of when Lag pulls data from the input data that it would still be missing when you are more than one observation from the last value of the gdp variable.
data combinenew; set combine; lg = lag(gdp); if gdp=. then gdp=lg; run;
The Retain keeps values of specified variables across iterations of the data step. So if you assign gdp to a variable when it is not missing it will be available for use similar to the way you intended the Lag to work.
Caution with Lag, Dif and Retain: make sure that if you are doing By group processing to reset or consider actual use of lagged or retained values at the boundaries of a by group. Suppose your data represented multiple areas for the same time frame. You would not want the first value of gdp, if missing for one region, to use the last value from the previous region. (Hint: By statment and First. and Last. automatic variables)
When you MERGE datasets and one of the datasets is not contributing to a particular BY group then the values of the variables contributed by that dataset are not retained.
Since your GDP values seem to be quarterly why not make a QTR variable and merge by that?
data return;
input year month return;
qtr = ceil(month/3);
datalines;
2011 1 34
2011 2 33
2011 3 88
2011 4 67
2011 5 39
2011 6 45
2011 7 20
2011 8 29
2011 9 39
2011 10 23
2011 11 28
2011 12 38
;
data GDP;
input year month gdp;
qtr = ceil(month/3);
datalines;
2011 1 345
2011 4 289
2011 7 378
2011 10 589
;
data want;
merge return gdp ;
by year qtr;
run;
Results:
Obs year month return qtr gdp 1 2011 1 34 1 345 2 2011 2 33 1 345 3 2011 3 88 1 345 4 2011 4 67 2 289 5 2011 5 39 2 289 6 2011 6 45 2 289 7 2011 7 20 3 378 8 2011 8 29 3 378 9 2011 9 39 3 378 10 2011 10 23 4 589 11 2011 11 28 4 589 12 2011 12 38 4 589
Otherwise make a NEW variable so that you can retain its value.
data want;
merge return gdp ;
by year month;
gdp_fixed = coalesce(gdp,gdp_fixed);
retain gdp_fixed;
run;
Result
gdp_ Obs year month return gdp fixed 1 2011 1 34 345 345 2 2011 2 33 . 345 3 2011 3 88 . 345 4 2011 4 67 289 289 5 2011 5 39 . 289 6 2011 6 45 . 289 7 2011 7 20 378 378 8 2011 8 29 . 378 9 2011 9 39 . 378 10 2011 10 23 589 589 11 2011 11 28 . 589 12 2011 12 38 . 589
You want an LOCF (last observation carried forward) for the GDP variable. Using a conditional SET statement will do that. Once you understand the implications of the conditional SET, it is strikingly simple.
data return;
input year month return;
datalines;
2011 1 34
2011 2 33
2011 3 88
2011 4 67
2011 5 39
2011 6 45
2011 7 20
2011 8 29
2011 9 39
2011 10 23
2011 11 28
2011 12 38
run;
data GDP;
input year month gdp;
datalines;
2011 1 345
2011 4 289
2011 7 378
2011 10 589
run;
data want;
set gdp (drop=gdp in=ingdp) return (in=inreturn);
by year month;
if ingdp=1 then set gdp (keep=gdp);
if inreturn;
run;
The reason this carries forward the GDP value is because the GDP variable is only read in occasionally. Whenever the observation in hand is from the GDP dataset (ingdp=1) only then read in the GDP variable. It will never be automatically reset to missing (all vars from a SET statement are always retained until replaced by the next time that SET statement is executed). It will change only when a new GDP observation is encountered.
You could carry forward any number of variables from the gdp dataset. Just specify them in the appropriate DROP= and KEEP= dataset name options.
No need for lags. No need for RETAIN.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.