Solved: Creation of new column values with respect to multiple column values.

Saser_123 · Posted 03-26-2020 09:21 AM

Hello everyone,
Thank you for taking your time to read this question!

I would like to create a variable if some conditions are satisfied by other variables.

For example:

DS-1 dataset checks for 'name' values in 'col-2' values and then if the value is found then 'col-3' gets 'name' values

DS-1

name col-2 col-3

Y Y,H,J Y

Z H,K,R

C Y,C,Z C

DS-2 dataset is like a lookup table, it has desired 'name' values along with their respective 'id' values. Plz note that the

'id' column's order is important.

DS-2

name id

Y 1

C 2

Z 3

DS-3 dataset is the output dataset based on some conditions.

It conditionally compares DS-1 and DS-2.
conditionally means- if 'id' value (1) 's 'name' value(Y) is present in DS-1 then DS-3 will have 'Y' AS DISPLAYED IN DS-3 DATASET
if 'id' value (1) 's 'name' value(Y) is NOT present in DS-1 then it checks for the next 'id' value(2)'s 'name' value(c) in DS-1.
it continues to do this conditional check for all 'id' values until it finds one. If it doesnt find any then 'col_new' will be empty value.
So, it checks chronologically and if any value is found then it stops and does not check for the rest of it.

DS-3 (desired output)

col_new

Y

Could some one please help me code this logic? If my explanation needs to be rephrased; please do let me know.

Thank you for taking your time to read my question!

Tom · Posted 03-26-2020 09:53 PM

Something like this?

proc sort data=step1 ;
  by id;
run;

data want ;
  set step1;
  length new_var $8 ;
  new_var= coalescec(new_var,col3);
  retain new_var;
run;

Result:

Obs    name    col2     col3    id    new_var

 1      Y      Y,H,J     Y       1       Y
 2      C      Y,C,Z     C       2       Y
 3      Z      H,K,R             3       Y

View solution in original post

Saser_123 · Posted 03-26-2020 11:35 AM

I have tried a couple of ways using "first. and last.", "select- when", "case-when", and "if- then-else". Below
sample code that did not work properly. I am willing to take a total new approach if needed.

Thanks again!

data ds-3;
set ds-1 ds-2;
select(id);
  when(1) do;
    if length(col_3) ne 0 then col_new = col_3;
    else col_3 = "";
    end;
  when(2) do;
    if ( length(col_3) ne 0 ) and (lag(length(col_new) eq 0))) then col_new  = col_3; /* lag is used to check if the previous obs of col_new is empty or not */
     end;
 when(3) do;
    if ( length(col_3) ne 0 ) and (lag(length(col_new) eq 0))) then col_new  = col_3;
     end;
otherwise col_new = "";
end;
run;

ballardw · Posted 03-26-2020 11:53 AM

@Saser_123 wrote:

Hello everyone,
Thank you for taking your time to read this question!

I would like to create a variable if some conditions are satisfied by other variables.

For example:

DS-1 dataset checks for 'name' values in 'col-2' values and then if the value is found then 'col-3' gets 'name' values

DS-1

name col-2 col-3

Y Y,H,J Y

Z H,K,R

C Y,C,Z C

DS-2 dataset is like a lookup table, it has desired 'name' values along with their respective 'id' values. Plz note that the

'id' column's order is important.

DS-2

name id

Y 1

C 2

Z 3

DS-3 dataset is the output dataset based on some conditions.

It conditionally compares DS-1 and DS-2.

conditionally means- if 'id' value (1) 's 'name' value(Y) is present in DS-1 then DS-3 will have 'Y' AS DISPLAYED IN DS-3 DATASET

if 'id' value (1) 's 'name' value(Y) is NOT present in DS-1 then it checks for the next 'id' value(2)'s 'name' value(c) in DS-1.

it continues to do this conditional check for all 'id' values until it finds one. If it doesnt find any then 'col_new' will be empty value.

So, it checks chronologically and if any value is found then it stops and does not check for the rest of it.

DS-3 (desired output)

col_new

Y

Could some one please help me code this logic? If my explanation needs to be rephrased; please do let me know.

Thank you for taking your time to read my question!

Since every single example of your DS2 data set has a corresponding name in DS1 then you appear to missing at least one rule. Probably involving col_3 because you create the variable but then do not use it in any rule.

Saser_123 · Posted 03-26-2020 03:00 PM

Thank you Ballardw for your reply.

Regarding col_3:- The part that I had mentioned before conditionally means- if 'id' value (1) 's 'name' value(Y) is present in DS-1 then DS-3 will have 'Y' is done comparing "col_3" from DS-1 value with "name' and 'id' in DS-3. Hope I answered your question.
Ballardw, The situation I have is I have to select only one value from "col_3" in DS-1. The selection criteria is from 'id' in DS-2. If 'id' is 1 and its corresponding 'name' is present in 'col_3' . Then I use that for 'col_new'. If 'id' (1) has no matching 'name" value in 'col_3', then I go and check for 'id' (2).....till any 'id' value mach. If nothing matches then I just leave 'col_new' blank.

Does that explanation help you? If not please let me know I could rephrase it

Thank you

Tom · Posted 03-26-2020 03:32 PM

Your are not explaining what you mean. You show two datasets. If you want code to operate on values that are sourced from the two datasets you need to combine them somehow first. Are you saying that they should be combined two NAME variables have the same values?

Let's try it with your example data.

data ds1;
  input name $  col2 $ col3 $ ;
cards;
Y           Y,H,J          Y
Z           H,K,R          .
C          Y,C,Z           C  
;
 
data ds2;
  input name $ id ;
cards;
Y                1
C                2
Z                3
;

proc sort data=ds1; by name; run;
proc sort data=ds2; by name; run;
data step1;
  merge ds1 ds2;
  by name;
run;
proc print;
run;

So we get this:

Obs    name    col2     col3    id

 1      C      Y,C,Z     C       2
 2      Y      Y,H,J     Y       1
 3      Z      H,K,R             3

Now how do you get from this dataset STEP1 to what you want as the output? It appears you only want the row where NAME='Y'. Why? Why not the row where NAME='C' or NAME='Z'?

Saser_123 · Posted 03-26-2020 05:35 PM

Hi Tom,
You are correct, there was an intermediate step( joining the two datasets) that I did not mention. Thank you for mentioning that. So from the dataset "step1" I would like to get the values of 'name' based on values in 'id'. something like the below:

if ((id =1) and (length(col3 ne 0)) ) then new_var = col3 /* length(col3...) is to make sure it is not empty) */

else if ((id = 2) and length(col3 ne 0) and (lag(new_var) ne 0))) then new_var = col3 /* lag(new_var) to make sure that the previous value of new_var is not empty */

else if ((id = 3) and length(col3 ne 0) and (lag(new_var) ne 0))) then new_var = col3

so based on the above conditions new_var will take only one value from col3...
Does the above explanation make sense? if not, please do let me know I could rephrase it.

Thank you,

Tom · Posted 03-26-2020 06:25 PM

LAG() implies there is some type of order to the observations. What order you want the observations to be processed in? The original dataset DS1 did not have any order variable. Once it is sorted by NAME to enable merging the original order is gone. Do you have some other order variable?

Also is the real data grouped in anyway? Is there some other variable that groups the data. When using LAG() you do not want to "lag" a value across group boundaries. You don't want the first observations for the new group to see the last value from the old group.

it is not clear what you mean by "one value from col3". The COL3 variable only ever has one value per observation. Just as the NEW_VAR will only ever have one value per observation. Why did you eliminate two of the observations? Perhaps you want to collapse to only one observation per group? Again, is there some type of grouping variable?

Saser_123 · Posted 03-26-2020 09:04 PM

Thank you Tom for your quick response!

I may have used lag() incorrectly. This is the first time I am using it. I thought it will sort of retain the value of the previous observation in the current observation to do some analysis. I will have to read more on lag().

lets start fresh...

Obs    name    col2     col3    id

 1      C      Y,C,Z     C       2
 2      Y      Y,H,J     Y       1
 3      Z      H,K,R             3

from the above dataset I want to create a new variable 'new_var' that will have the value of 'col_3'... if:

1) id = 1 then new_var = y

2) say id=1 's respective col3 value is blank, then only then, go to the next id value that is (2). if it has a respective observation in col3 then new_var = c.

3) say id= 1 and id =2 's corresponding col3 values are blank, then and then only go to id=3....and so on

how will I do this for the above dataset...If I were to sort I would sort by the dataset by id.

Can you please help me with the logic to do that? I hope this clears all the confusion.

Note: This is a small part of the entire program and the id is dynamic for a dataset. I will have many such datasets like above with different id values( meaning ds-1 will have id's 1,2...DS-2 will have ids 1,2,3, ds-3 will have ids 1 etc..). but the logic is the same for all the datasets...

Thank you

Tom · Posted 03-26-2020 09:53 PM

Something like this?

proc sort data=step1 ;
  by id;
run;

data want ;
  set step1;
  length new_var $8 ;
  new_var= coalescec(new_var,col3);
  retain new_var;
run;

Result:

Obs    name    col2     col3    id    new_var

 1      Y      Y,H,J     Y       1       Y
 2      C      Y,C,Z     C       2       Y
 3      Z      H,K,R             3       Y

Saser_123 · Posted 03-27-2020 06:51 AM

Thank you Tom for the code. Yes it works! I have not used coalescec() , thank you for introducing it to me. Since it is sorted by id and taking the first non empty value for new_var is a smart thought!

Once again, Thank you for taking your time to help me out 😊

Creation of new column values with respect to multiple column values.

Re: Creation of new column values with respect to multiple column values.

Re: Creation of new column values with respect to multiple column values.

Re: Creation of new column values with respect to multiple column values.

Re: Creation of new column values with respect to multiple column values.

Re: Creation of new column values with respect to multiple column values.

Re: Creation of new column values with respect to multiple column values.

Re: Creation of new column values with respect to multiple column values.

Re: Creation of new column values with respect to multiple column values.

Re: Creation of new column values with respect to multiple column values.

Re: Creation of new column values with respect to multiple column values.

Ready to join fellow brilliant minds for the SAS Hackathon?