Solved: Re: Value assignment based on multiple conditions resembling decision ...

dzurov · Posted 12-22-2021 06:54 PM

Hi

I need help translating the below table in SAS code to be able to assign the values found in the variable "Points" in another large target dataset. The target dataset contains the variables "Sport", "Tournament", "Held", "Age", "Country", "Season", "Year", "Length", and "Surface" except the variable "Points". So the table below can be seen as a map that has the assignment rules branching like tree to the "Point" value.

Some observations about the map table (below) and desired target dataset:

The variables "Sport" and "Tournament" will always have values populated and the other variables can have blank values.
The "Points" can be assigned in a decision tree format branching from first column "Sport" then down to the last column "Surface". Hence the order of the columns is very important.
The "Points" assignment is will be based on the criteria shown in each row. If perfect match is not found the row with a most closely match will be used to assign the "Points" value
Blank value can take any value not explicitly specified in the map table, e.g., see row #3... if in the target dataset a record matches all values (i.e., tennis, 34, Y,(15-30), France, Fall,2017, 5 Days) except the Surface variable which has a value "Grass" then this record will still get the Points assignment of 39

My initial idea was to create a conditional statement for each row and assign the "Points" value that way. For instance if we take the example above for row #3 I would have something like

if (Sport='Tennis'
 and Tournament=45
 and Held='Y'
 and Age='(15-30)'
 and Country='France'
 and Season='Fall'
 and Year=2017
 and Length ='5 Days'
 and Surface ne 'Hard'
 and Surface ne 'Clay'
) then Points=39;

The above seems pretty time consuming especially if the map table has 100+ conditional rows. I was wondering if there is a better way to tackle this problem.

Dataset A:

#	Sport	Tournament	Held	Age	Country	Season	Year	Length	Surface	Points
1	Tennis	45	Y	(15-30)	Italy	Fall	2017	5 Days	Hard	44
2	Tennis	45	Y	(15-30)	France	Summer	2017	5 Days	Clay	46
3	Tennis	45	Y	(15-30)	France	Fall	2017	5 Days		39
4	Tennis	45	Y	(15-30)	Austria	Summer	2016	5 Days		44
5	Tennis	45	Y	(30-60)	Italy	Summer	2014			44
6	Tennis	45	Y	(30-60)	France	Summer	2017			46
7	Tennis	45	Y	(15-30)	Italy	Summer	2014			45
8	Tennis	45	Y	(15-30)	France	Summer	2017			46
9	Tennis	45	Y	(15-30)	Austria	Summer	2014			44
10	Tennis	45	Y	(30-60)	France	Summer				44
11	Tennis	45	Y	(15-30)	France	Summer				44
12	Tennis	38	N	(5-15)	France	Spring				26
13	Tennis	38	N	(5-15)	Austria	Summer				27
14	Tennis	38	N	(30-60)	Italy	Winter				26
15	Tennis	38	N	(30-60)	Italy	Spring				26
16	Tennis	38	N	(15-30)	Austria	Spring				26
17	Tennis	38	N	(5-15)	Austria					26
18	Tennis	38	Y							30
19	Tennis	54	N							9
20	Tennis	54								21
21	Tennis	60								27
22	Golf	20	Y	(5-15)	Austria	Summer	2017	3 Day	Grass	18
23	Golf	20	Y	(30-60)	Italy	Summer	2016	3 Day		18
24	Golf	20	Y	(30-60)	France	Summer	2014			18
25	Golf	20	Y	(15-30)	Italy	Summer				18
26	Golf	20	Y	(15-30)	Austria	Summer				18
27	Golf	20	N							1
28	Golf	20								61
...	...	...	...	...	...	...	...	...	...	...

The target dataset with assigned values based on the rules provided in the table above should look like this:

Dataset B:

Sport	Tournament	Held	Age	Country	Season	Year	Length	Surface	Points
Tennis	45	Y	(15-30)	Italy	Fall	2017	5 Days	Hard	44
Tennis	45	Y	(15-30)	France	Summer	2017	5 Days	Clay	46
Tennis	45	Y	(15-30)	France	Fall	2017	5 Days	Grass	44
Tennis	45	Y	(15-30)	Austria	Summer	2016	5 Days	Green	44
Tennis	45	Y	(30-60)	Italy	Summer	2014	7 Days	Hard/Clay	44
Tennis	45	Y	(30-60)	France	Summer	2017	10 Days	Hard/Clay	46
Tennis	45	Y	(15-30)	Italy	Summer	2014	7 Days	Grass	44
Tennis	45	Y	(15-30)	France	Summer	2017	7 Days	Grass	46
Tennis	45	Y	(15-30)	Austria	Summer	2014	7 Days	Grass	44
Tennis	45	Y	(30-60)	France	Summer	2018	5 Days	Clay	44
Tennis	45	Y	(15-30)	France	Summer	2018	5 Days	Clay	44
Tennis	38	N	(5-15)	France	Spring	2017	5 Days	Clay	26
Tennis	38	N	(5-15)	Austria	Summer	2017	5 Days	Hard	27
Tennis	38	N	(30-60)	Italy	Spring	2017	5 Days	Hard	26
Tennis	38	N	(15-30)	Austria	Spring	2017	7 Days	Grass	26
Tennis	38	N	(5-15)	Austria	Winter	2017	7 Days	Grass	26
Tennis	38	Y	(15-30)	France	Summer	2017	5 Days	Clay	30
Tennis	54	N	(15-30)	Italy	Fall	2017	5 Days	Hard	9
Tennis	54	Y	(15-30)	Italy	Summer	2014	7 Days	Grass	21
Tennis	60	Y	(15-30)	France	Summer	2018	5 Days	Clay	27

Thank you for your help.

Tom · Posted 12-23-2021 11:53 AM

Note it is just as easy to post the sample data as SAS code as it is to go to the trouble of creating a table or spreadsheet.

data have;
  infile cards dsd dlm='|' truncover;
  input Sport :$20. Tournament Held :$1. Age :$20. Country :$20.
        Season :$10. Year Length :$20. Surface :$10. Points
  ;
cards;
Tennis|45|Y|(15-30)|Italy|Fall|2017|5 Days|Hard|44
Tennis|45|Y|(15-30)|France|Summer|2017|5 Days|Clay|46
Tennis|45|Y|(15-30)|France|Fall|2017|5 Days| |39
Tennis|45|Y|(15-30)|Austria|Summer|2016|5 Days| |44
Tennis|45|Y|(30-60)|Italy|Summer|2014| | |44
Tennis|45|Y|(30-60)|France|Summer|2017| | |46
;

You should be able to use the data to generate code. The data step is an excellent tool for generating text files.

filename code temp;
data _null_;
  set have ;
  file code column=col ;
  if _n_ > 1 then put 'else ' @;
  put 'if ' surface= :$quote. @;
  put 'and ' age= :$quote. @;
  if col>65 then put / @5 @;
  put 'and ' country= :$quote. @;
  if col>65 then put / @5 @;
  put 'and ' season= :$quote. @;
  if col>65 then put / @5 @;
  put 'and ' year= @;
  if not missing(surface) then put 'and ' surface= :$quote. @ ;
  if col>65 then put / @5 @;
  put 'then ' points= ';' ;
run;

Since you didn't provide any sample input data let's just use your rules database and drop the existing POINTS variable.

data want;
  set have (drop=points);
%include code / source2;
run;

Here is what the SAS log looks like:

649   data want;
650     set have (drop=points);
651   %include code / source2;
NOTE: %INCLUDE (level 1) file CODE is file C:\...\#LN00065.
652  +if Surface="Hard" and Age="(15-30)" and Country="Italy" and Season="Fall"
653  +    and Year=2017 and Surface="Hard" then Points=44 ;
654  +else if Surface="Clay" and Age="(15-30)" and Country="France" and Season="Summer"
655  +    and Year=2017 and Surface="Clay" then Points=46 ;
656  +else if Surface="" and Age="(15-30)" and Country="France" and Season="Fall"
657  +    and Year=2017 then Points=39 ;
658  +else if Surface="" and Age="(15-30)" and Country="Austria" and Season="Summer"
659  +    and Year=2016 then Points=44 ;
660  +else if Surface="" and Age="(30-60)" and Country="Italy" and Season="Summer"
661  +    and Year=2014 then Points=44 ;
662  +else if Surface="" and Age="(30-60)" and Country="France" and Season="Summer"
663  +    and Year=2017 then Points=46 ;
NOTE: %INCLUDE (level 1) ending.
664   run;

NOTE: There were 6 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 6 observations and 10 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds

View solution in original post

Reeza · Posted 12-22-2021 07:11 PM

If the original table doesn’t have the points then where did 39 come from?

Where do the rules for the points come from? A table? Rules you know in your head?

It would help to show what information you start with and what you expect as output but I’m not following that example at all.

@dzurov wrote:

Hi

I need help translating the below table in SAS code to be able to assign the values found in the variable "Points" in another large target dataset. The target dataset contains the variables "Sport", "Tournament", "Held", "Age", "Country", "Season", "Year", "Length", and "Surface" except the variable "Points". So the table below can be seen as a map that has the assignment rules branching like tree to the "Point" value.

Some observations about the map table (below) and desired target dataset:

The variables "Sport" and "Tournament" will always have values populated and the other variables can have blank values.

The "Points" can be assigned in a decision tree format branching from first column "Sport" then down to the last column "Surface". Hence the order of the columns is very important.

The "Points" assignment is will be based on the criteria shown in each row. If perfect match is not found the row with a most closely match will be used to assign the "Points" value

Blank value can take any value not explicitly specified in the map table, e.g., see row #3... if in the target dataset a record matches all values (i.e., tennis, 34, Y,(15-30), France, Fall,2017, 5 Days) except the Surface variable which has a value "Grass" then this record will still get the Points assignment of 39

My initial idea was to create a conditional statement for each row and assign the "Points" value that way. For instance if we take the example above for row #3 I would have something like
if ((Sport='Tennis' and Tournament=45 and Held='Y' and Age='(15-30)' and Country='France' and 
Season='Fall' and Year=2017 and Length ='5 Days' and Surface ne 'Hard' and Surface ne 'Clay') then Points=39;
The above seems pretty time consuming especially if the map table has 100+ conditional rows. I was wondering if there is a better way to tackle this problem.

# Sport Tournament Held Age Country Season Year Length Surface Points

1 Tennis 45 Y (15-30) Italy Fall 2017 5 Days Hard 44

2 Tennis 45 Y (15-30) France Summer 2017 5 Days Clay 46

3 Tennis 45 Y (15-30) France Fall 2017 5 Days 39

4 Tennis 45 Y (15-30) Austria Summer 2016 5 Days 44

5 Tennis 45 Y (30-60) Italy Summer 2014 44

6 Tennis 45 Y (30-60) France Summer 2017 46

7 Tennis 45 Y (15-30) Italy Summer 2014 45

8 Tennis 45 Y (15-30) France Summer 2017 46

9 Tennis 45 Y (15-30) Austria Summer 2014 44

10 Tennis 45 Y (30-60) France Summer 44

11 Tennis 45 Y (15-30) France Summer 44

12 Tennis 38 N (5-15) France Spring 26

13 Tennis 38 N (5-15) Austria Summer 27

14 Tennis 38 N (30-60) Italy Retail 26

15 Tennis 38 N (30-60) Italy Spring 26

16 Tennis 38 N (15-30) Austria Spring 26

17 Tennis 38 N (5-15) Austria 26

18 Tennis 38 Y 30

19 Tennis 54 N 9

20 Tennis 54 21

21 Tennis 60 27

22 Golf 20 Y (5-15) Austria Summer 2017 3 Day Grass 18

23 Golf 20 Y (30-60) Italy Summer 2016 3 Day 18

24 Golf 20 Y (30-60) France Summer 2014 18

25 Golf 20 Y (15-30) Italy Summer 18

26 Golf 20 Y (15-30) Austria Summer 18

27 Golf 20 N 1

28 Golf 20 61

... ... ... ... ... ... ... ... ... ... ...

Thank you for your help.

dzurov · Posted 12-23-2021 10:40 AM

Q: If the original table doesn’t have the points then where did 39 come from?
A:The target dataset (dataset B) does not have the points. The mapping table (dataset A) contains the rules. These are given I didn't come up with them.

Q:Where do the rules for the points come from? A table? Rules you know in your head?
A: The rules for the points are given in the mapping table (dataset A). What I want to do is translate these in SAS code to be able to apply them on the dataset B (10000+ obs).

In the original message I included an example of how the target dataset points assignment should look like (based on the rules from dataset A)
Thank you a bunch!

Reeza · Posted 12-23-2021 11:31 AM

But you didn't provide an example of each table so we only have a part of the picture.
Please provide a small workable example if you'd like help with code. Otherwise, you're looking at fuzzy merging essentially where the criteria won't match exactly. I would consider PROC PSMATCH possibly as an option but there are also greedy matching algorithms for this type of matching.
It's similar to a case control matching process.

andreas_lds · Posted 12-23-2021 12:57 AM

Imho

and Surface ne 'Hard' and Surface ne 'Clay'

is wrong for line three. It should be

and missing(Surface)

You could use a data step to write a data step ... i don't have the time to provide code, maybe later.

dzurov · Posted 12-23-2021 10:57 AM

I didn't use a missing function because in the target dataset (dataset B) the value of the variable "Surface" will not necessarily be missing.

As I had in my example, if Surface is Grass then points should still be assigned. i.e. in dataset A blank means that these can take any value. Said in other words the Points should be assigned in a tree like branches to the closest match available based on the mapping in dataset A.

Tom · Posted 12-23-2021 11:40 AM

Whether execution time makes much difference will depend on the sizes of the tables.

You don't seem to have supplied any example input data that you want to "score".

The example code you posted does not appear to match any of the lines in your rules dataset. Can you make example code that actually matches the example rules you have posted.

Tom · Posted 12-23-2021 11:53 AM

Note it is just as easy to post the sample data as SAS code as it is to go to the trouble of creating a table or spreadsheet.

data have;
  infile cards dsd dlm='|' truncover;
  input Sport :$20. Tournament Held :$1. Age :$20. Country :$20.
        Season :$10. Year Length :$20. Surface :$10. Points
  ;
cards;
Tennis|45|Y|(15-30)|Italy|Fall|2017|5 Days|Hard|44
Tennis|45|Y|(15-30)|France|Summer|2017|5 Days|Clay|46
Tennis|45|Y|(15-30)|France|Fall|2017|5 Days| |39
Tennis|45|Y|(15-30)|Austria|Summer|2016|5 Days| |44
Tennis|45|Y|(30-60)|Italy|Summer|2014| | |44
Tennis|45|Y|(30-60)|France|Summer|2017| | |46
;

You should be able to use the data to generate code. The data step is an excellent tool for generating text files.

filename code temp;
data _null_;
  set have ;
  file code column=col ;
  if _n_ > 1 then put 'else ' @;
  put 'if ' surface= :$quote. @;
  put 'and ' age= :$quote. @;
  if col>65 then put / @5 @;
  put 'and ' country= :$quote. @;
  if col>65 then put / @5 @;
  put 'and ' season= :$quote. @;
  if col>65 then put / @5 @;
  put 'and ' year= @;
  if not missing(surface) then put 'and ' surface= :$quote. @ ;
  if col>65 then put / @5 @;
  put 'then ' points= ';' ;
run;

Since you didn't provide any sample input data let's just use your rules database and drop the existing POINTS variable.

data want;
  set have (drop=points);
%include code / source2;
run;

Here is what the SAS log looks like:

649   data want;
650     set have (drop=points);
651   %include code / source2;
NOTE: %INCLUDE (level 1) file CODE is file C:\...\#LN00065.
652  +if Surface="Hard" and Age="(15-30)" and Country="Italy" and Season="Fall"
653  +    and Year=2017 and Surface="Hard" then Points=44 ;
654  +else if Surface="Clay" and Age="(15-30)" and Country="France" and Season="Summer"
655  +    and Year=2017 and Surface="Clay" then Points=46 ;
656  +else if Surface="" and Age="(15-30)" and Country="France" and Season="Fall"
657  +    and Year=2017 then Points=39 ;
658  +else if Surface="" and Age="(15-30)" and Country="Austria" and Season="Summer"
659  +    and Year=2016 then Points=44 ;
660  +else if Surface="" and Age="(30-60)" and Country="Italy" and Season="Summer"
661  +    and Year=2014 then Points=44 ;
662  +else if Surface="" and Age="(30-60)" and Country="France" and Season="Summer"
663  +    and Year=2017 then Points=46 ;
NOTE: %INCLUDE (level 1) ending.
664   run;

NOTE: There were 6 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 6 observations and 10 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds

dzurov · Posted 12-23-2021 02:45 PM

I updated my original post, Dataset B is the table that needs scoring. I assigned the points manually to illustrate how the mapping table should work. Dropping points in this Dataset B gives an exact view of how the target dataset looks like.

Thanks for the idea for generating the code in the log file. This is probably something I could use. It wont be pretty with many lines of code but it could work. With multiple IF statements like this I need to make sure the IF statements with least number of arguments get executed first (some type of sorting will be needed). Don't want to get the values overwritten if multiple IF statements can fulfil the criteria.

For example IF#1 should be executed first so it does not override IF#2 when trying to assign points to this record which should get 46

Sport	Tournament	Held	Age	Country	Season	Year	Length	Surface	Points
Tennis	45	Y	(15-30)	France	Summer	2017	5 Days	Clay

/* IF#1 rule based on row #11 in dataset A*/
if (Sport='Tennis'
 and Tournament=45
 and Held='Y'
 and Age='(15-30)'
 and Country='France'
 and Season='Summer'
) then Points=44;

/* IF#2 rule based on row #2 in dataset A */
if (Sport='Tennis'
 and Tournament=45
 and Held='Y'
 and Age='(15-30)'
 and Country='France'
 and Season='Summer'
 and Year=2017
 and Length ='5 Days'
 and Surface = 'Clay'
) then Points=46;

Tom · Posted 12-23-2021 10:57 PM

So use my program as it is generating ELSE statements so only one rule will apply.

Sort the data in the proper order before using it generate the code. To try your idea of apply the most complete rules first try using the value of CMISS() for the various conditions to sort. So rules with the fewest missing conditions will "win".

data rules;
  set rules;
  n_miss=cmiss(of Sport Tournament Held Age Country Season Year Length Surface);
run;
proc sort;
  by n_miss;
run;

Note that the code is generated to a FILE, not the log. The code in the LOG is the code that RAN because of the %INCLUDE statement.

Value assignment based on multiple if conditions

Re: Value assignment based on multiple conditions resembling decision tree

Re: Value assignment based on multiple conditions resembling decision tree

Re: Value assignment based on multiple conditions resembling decision tree

Re: Value assignment based on multiple conditions resembling decision tree

Re: Value assignment based on multiple conditions resembling decision tree

Re: Value assignment based on multiple conditions resembling decision tree

Re: Value assignment based on multiple conditions resembling decision tree

Re: Value assignment based on multiple conditions resembling decision tree

Re: Value assignment based on multiple conditions resembling decision tree

Re: Value assignment based on multiple conditions resembling decision tree

SAS Innovate 2025: Call for Content

Classroom Training Available!