Hi, Community
I am trying to create a new variable combining two variables. My data looks like this (if pre_a has a value, post_a doesn't. If post_a has a value, pre_a doesn't..):
pre_a post_a pre_b post_b
2 . 3 .
. . . 2
. 3 3 .
. 3 . .
I would like to create a new variable 'Apple' combining pre_a and post_a, and 'Ball' combining pre_b and post_b. The dataset I would like to get looks like this:
pre_a post_a apple pre_b post_b ball
2 . 2 3 . 3
. . . . 2 2
. 3 3 3. . 3
. 3 3 . . .
I tried to use the following code:
apple=pre_a; if apple ne ' ' then output;
apple=post_a; if apple ne ' ' then output;
ball=pre_b; if ball ne ' ' then output;
ball=post_b; if ball ne ' ' then output;
keep apple ball; run;
If using the code, I got the apple variable, correctly, but not a ball variable. Could you give some advice to do this? Or, is there different way to create a new variable combining two variables?
Your answers would be very helpful! Thanks!
Part of what I believe to be your problem is that when you use OUTPUT the second variable has not been set yet. If you check the log you should see that your output data set has close to twice as many output records and input unless you have many records where both the pre and post values are missing.
I can't tell whether your variables are character or not (too many people mix statements with inappropriate operations for variable type)
Here is an example similar to the logic you are using. I am using numeric values but the behavior is similar with character.
data have; input x1 x2 y1 y2; datalines; 1 . . 2 . 2 . 2 1 . 2 . ; data example; set have; xx = x1; if xx ne . then output; xx = x2; if xx ne . then output; yy = y1; if yy ne . then output; yy = y2; if yy ne . then output; run;
Notice that the Example data set has 6 records in the output when only 3 are read in. Also the YY value is set on the 2nd, 4th and 6th records.
That is why I think you want the coalescec or coalesce function as suggested by @mkeintz . Example with the same data:
data better; set have; xx= coalesce (x1,x2); yy= coalesce (y1,y2); run;
If your variables are character then this may go what you are expecting:
data want; set have; apple = coalescec (pre_a, post_a); ball = coalescec (pre_b, post_b); run;
if the values are actually numeric instead of character the function would be Coalesce instead.
The two functions Coalesce and Coalescec return the first value found in the list of values which could be considerably more than just 2 values or variables.
When you have questions about unexpected behavior of code the best thing is to copy the entire data step or procedure code from the log with all the messages or warnings and then paste into a text box opened on the forum using the </> such as:
134 data example; 135 set have; 136 xx = x1; 137 if xx ne . then output; 138 xx = x2; 139 if xx ne . then output; 140 yy = y1; 141 if yy ne . then output; 142 yy = y2; 143 if yy ne . then output; 144 run; NOTE: There were 3 observations read from the data set WORK.HAVE. NOTE: The data set WORK.EXAMPLE has 6 observations and 6 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
The notes here clearly show reading 3 records and output 6. Since your example data has missing values for some of the pairs the number would not be exactly twice the number on the input set but I would expect it to be notably larger.
You also make it harder to tell what happened by only keeping the resulting Apple and Ball variables. When debugging code behavior you want to keep all of the variables that are involved so you can see what is calculated from which values.
Take a look at the Coalesce function.
And what should the value of "apple" be, if pre_a = 4 and post_a = 5? Or is it impossible that both variable are non-missing?
Part of what I believe to be your problem is that when you use OUTPUT the second variable has not been set yet. If you check the log you should see that your output data set has close to twice as many output records and input unless you have many records where both the pre and post values are missing.
I can't tell whether your variables are character or not (too many people mix statements with inappropriate operations for variable type)
Here is an example similar to the logic you are using. I am using numeric values but the behavior is similar with character.
data have; input x1 x2 y1 y2; datalines; 1 . . 2 . 2 . 2 1 . 2 . ; data example; set have; xx = x1; if xx ne . then output; xx = x2; if xx ne . then output; yy = y1; if yy ne . then output; yy = y2; if yy ne . then output; run;
Notice that the Example data set has 6 records in the output when only 3 are read in. Also the YY value is set on the 2nd, 4th and 6th records.
That is why I think you want the coalescec or coalesce function as suggested by @mkeintz . Example with the same data:
data better; set have; xx= coalesce (x1,x2); yy= coalesce (y1,y2); run;
If your variables are character then this may go what you are expecting:
data want; set have; apple = coalescec (pre_a, post_a); ball = coalescec (pre_b, post_b); run;
if the values are actually numeric instead of character the function would be Coalesce instead.
The two functions Coalesce and Coalescec return the first value found in the list of values which could be considerably more than just 2 values or variables.
When you have questions about unexpected behavior of code the best thing is to copy the entire data step or procedure code from the log with all the messages or warnings and then paste into a text box opened on the forum using the </> such as:
134 data example; 135 set have; 136 xx = x1; 137 if xx ne . then output; 138 xx = x2; 139 if xx ne . then output; 140 yy = y1; 141 if yy ne . then output; 142 yy = y2; 143 if yy ne . then output; 144 run; NOTE: There were 3 observations read from the data set WORK.HAVE. NOTE: The data set WORK.EXAMPLE has 6 observations and 6 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
The notes here clearly show reading 3 records and output 6. Since your example data has missing values for some of the pairs the number would not be exactly twice the number on the input set but I would expect it to be notably larger.
You also make it harder to tell what happened by only keeping the resulting Apple and Ball variables. When debugging code behavior you want to keep all of the variables that are involved so you can see what is calculated from which values.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.