Hi
I need help translating the below table in SAS code to be able to assign the values found in the variable "Points" in another large target dataset. The target dataset contains the variables "Sport", "Tournament", "Held", "Age", "Country", "Season", "Year", "Length", and "Surface" except the variable "Points". So the table below can be seen as a map that has the assignment rules branching like tree to the "Point" value.
Some observations about the map table (below) and desired target dataset:
My initial idea was to create a conditional statement for each row and assign the "Points" value that way. For instance if we take the example above for row #3 I would have something like
if (Sport='Tennis'
and Tournament=45
and Held='Y'
and Age='(15-30)'
and Country='France'
and Season='Fall'
and Year=2017
and Length ='5 Days'
and Surface ne 'Hard'
and Surface ne 'Clay'
) then Points=39;
The above seems pretty time consuming especially if the map table has 100+ conditional rows. I was wondering if there is a better way to tackle this problem.
Dataset A:
# | Sport | Tournament | Held | Age | Country | Season | Year | Length | Surface | Points |
1 | Tennis | 45 | Y | (15-30) | Italy | Fall | 2017 | 5 Days | Hard | 44 |
2 | Tennis | 45 | Y | (15-30) | France | Summer | 2017 | 5 Days | Clay | 46 |
3 | Tennis | 45 | Y | (15-30) | France | Fall | 2017 | 5 Days | 39 | |
4 | Tennis | 45 | Y | (15-30) | Austria | Summer | 2016 | 5 Days | 44 | |
5 | Tennis | 45 | Y | (30-60) | Italy | Summer | 2014 | 44 | ||
6 | Tennis | 45 | Y | (30-60) | France | Summer | 2017 | 46 | ||
7 | Tennis | 45 | Y | (15-30) | Italy | Summer | 2014 | 45 | ||
8 | Tennis | 45 | Y | (15-30) | France | Summer | 2017 | 46 | ||
9 | Tennis | 45 | Y | (15-30) | Austria | Summer | 2014 | 44 | ||
10 | Tennis | 45 | Y | (30-60) | France | Summer | 44 | |||
11 | Tennis | 45 | Y | (15-30) | France | Summer | 44 | |||
12 | Tennis | 38 | N | (5-15) | France | Spring | 26 | |||
13 | Tennis | 38 | N | (5-15) | Austria | Summer | 27 | |||
14 | Tennis | 38 | N | (30-60) | Italy | Winter | 26 | |||
15 | Tennis | 38 | N | (30-60) | Italy | Spring | 26 | |||
16 | Tennis | 38 | N | (15-30) | Austria | Spring | 26 | |||
17 | Tennis | 38 | N | (5-15) | Austria | 26 | ||||
18 | Tennis | 38 | Y | 30 | ||||||
19 | Tennis | 54 | N | 9 | ||||||
20 | Tennis | 54 | 21 | |||||||
21 | Tennis | 60 | 27 | |||||||
22 | Golf | 20 | Y | (5-15) | Austria | Summer | 2017 | 3 Day | Grass | 18 |
23 | Golf | 20 | Y | (30-60) | Italy | Summer | 2016 | 3 Day | 18 | |
24 | Golf | 20 | Y | (30-60) | France | Summer | 2014 | 18 | ||
25 | Golf | 20 | Y | (15-30) | Italy | Summer | 18 | |||
26 | Golf | 20 | Y | (15-30) | Austria | Summer | 18 | |||
27 | Golf | 20 | N | 1 | ||||||
28 | Golf | 20 | 61 | |||||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
The target dataset with assigned values based on the rules provided in the table above should look like this:
Dataset B:
Sport | Tournament | Held | Age | Country | Season | Year | Length | Surface | Points |
Tennis | 45 | Y | (15-30) | Italy | Fall | 2017 | 5 Days | Hard | 44 |
Tennis | 45 | Y | (15-30) | France | Summer | 2017 | 5 Days | Clay | 46 |
Tennis | 45 | Y | (15-30) | France | Fall | 2017 | 5 Days | Grass | 44 |
Tennis | 45 | Y | (15-30) | Austria | Summer | 2016 | 5 Days | Green | 44 |
Tennis | 45 | Y | (30-60) | Italy | Summer | 2014 | 7 Days | Hard/Clay | 44 |
Tennis | 45 | Y | (30-60) | France | Summer | 2017 | 10 Days | Hard/Clay | 46 |
Tennis | 45 | Y | (15-30) | Italy | Summer | 2014 | 7 Days | Grass | 44 |
Tennis | 45 | Y | (15-30) | France | Summer | 2017 | 7 Days | Grass | 46 |
Tennis | 45 | Y | (15-30) | Austria | Summer | 2014 | 7 Days | Grass | 44 |
Tennis | 45 | Y | (30-60) | France | Summer | 2018 | 5 Days | Clay | 44 |
Tennis | 45 | Y | (15-30) | France | Summer | 2018 | 5 Days | Clay | 44 |
Tennis | 38 | N | (5-15) | France | Spring | 2017 | 5 Days | Clay | 26 |
Tennis | 38 | N | (5-15) | Austria | Summer | 2017 | 5 Days | Hard | 27 |
Tennis | 38 | N | (30-60) | Italy | Spring | 2017 | 5 Days | Hard | 26 |
Tennis | 38 | N | (15-30) | Austria | Spring | 2017 | 7 Days | Grass | 26 |
Tennis | 38 | N | (5-15) | Austria | Winter | 2017 | 7 Days | Grass | 26 |
Tennis | 38 | Y | (15-30) | France | Summer | 2017 | 5 Days | Clay | 30 |
Tennis | 54 | N | (15-30) | Italy | Fall | 2017 | 5 Days | Hard | 9 |
Tennis | 54 | Y | (15-30) | Italy | Summer | 2014 | 7 Days | Grass | 21 |
Tennis | 60 | Y | (15-30) | France | Summer | 2018 | 5 Days | Clay | 27 |
Thank you for your help.
Note it is just as easy to post the sample data as SAS code as it is to go to the trouble of creating a table or spreadsheet.
data have;
infile cards dsd dlm='|' truncover;
input Sport :$20. Tournament Held :$1. Age :$20. Country :$20.
Season :$10. Year Length :$20. Surface :$10. Points
;
cards;
Tennis|45|Y|(15-30)|Italy|Fall|2017|5 Days|Hard|44
Tennis|45|Y|(15-30)|France|Summer|2017|5 Days|Clay|46
Tennis|45|Y|(15-30)|France|Fall|2017|5 Days| |39
Tennis|45|Y|(15-30)|Austria|Summer|2016|5 Days| |44
Tennis|45|Y|(30-60)|Italy|Summer|2014| | |44
Tennis|45|Y|(30-60)|France|Summer|2017| | |46
;
You should be able to use the data to generate code. The data step is an excellent tool for generating text files.
filename code temp;
data _null_;
set have ;
file code column=col ;
if _n_ > 1 then put 'else ' @;
put 'if ' surface= :$quote. @;
put 'and ' age= :$quote. @;
if col>65 then put / @5 @;
put 'and ' country= :$quote. @;
if col>65 then put / @5 @;
put 'and ' season= :$quote. @;
if col>65 then put / @5 @;
put 'and ' year= @;
if not missing(surface) then put 'and ' surface= :$quote. @ ;
if col>65 then put / @5 @;
put 'then ' points= ';' ;
run;
Since you didn't provide any sample input data let's just use your rules database and drop the existing POINTS variable.
data want;
set have (drop=points);
%include code / source2;
run;
Here is what the SAS log looks like:
649 data want; 650 set have (drop=points); 651 %include code / source2; NOTE: %INCLUDE (level 1) file CODE is file C:\...\#LN00065. 652 +if Surface="Hard" and Age="(15-30)" and Country="Italy" and Season="Fall" 653 + and Year=2017 and Surface="Hard" then Points=44 ; 654 +else if Surface="Clay" and Age="(15-30)" and Country="France" and Season="Summer" 655 + and Year=2017 and Surface="Clay" then Points=46 ; 656 +else if Surface="" and Age="(15-30)" and Country="France" and Season="Fall" 657 + and Year=2017 then Points=39 ; 658 +else if Surface="" and Age="(15-30)" and Country="Austria" and Season="Summer" 659 + and Year=2016 then Points=44 ; 660 +else if Surface="" and Age="(30-60)" and Country="Italy" and Season="Summer" 661 + and Year=2014 then Points=44 ; 662 +else if Surface="" and Age="(30-60)" and Country="France" and Season="Summer" 663 + and Year=2017 then Points=46 ; NOTE: %INCLUDE (level 1) ending. 664 run; NOTE: There were 6 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT has 6 observations and 10 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds
If the original table doesn’t have the points then where did 39 come from?
Where do the rules for the points come from? A table? Rules you know in your head?
It would help to show what information you start with and what you expect as output but I’m not following that example at all.
@dzurov wrote:
Hi
I need help translating the below table in SAS code to be able to assign the values found in the variable "Points" in another large target dataset. The target dataset contains the variables "Sport", "Tournament", "Held", "Age", "Country", "Season", "Year", "Length", and "Surface" except the variable "Points". So the table below can be seen as a map that has the assignment rules branching like tree to the "Point" value.
Some observations about the map table (below) and desired target dataset:
- The variables "Sport" and "Tournament" will always have values populated and the other variables can have blank values.
- The "Points" can be assigned in a decision tree format branching from first column "Sport" then down to the last column "Surface". Hence the order of the columns is very important.
- The "Points" assignment is will be based on the criteria shown in each row. If perfect match is not found the row with a most closely match will be used to assign the "Points" value
- Blank value can take any value not explicitly specified in the map table, e.g., see row #3... if in the target dataset a record matches all values (i.e., tennis, 34, Y,(15-30), France, Fall,2017, 5 Days) except the Surface variable which has a value "Grass" then this record will still get the Points assignment of 39
My initial idea was to create a conditional statement for each row and assign the "Points" value that way. For instance if we take the example above for row #3 I would have something like
if ((Sport='Tennis' and Tournament=45 and Held='Y' and Age='(15-30)' and Country='France' and Season='Fall' and Year=2017 and Length ='5 Days' and Surface ne 'Hard' and Surface ne 'Clay') then Points=39;
The above seems pretty time consuming especially if the map table has 100+ conditional rows. I was wondering if there is a better way to tackle this problem.
# Sport Tournament Held Age Country Season Year Length Surface Points 1 Tennis 45 Y (15-30) Italy Fall 2017 5 Days Hard 44 2 Tennis 45 Y (15-30) France Summer 2017 5 Days Clay 46 3 Tennis 45 Y (15-30) France Fall 2017 5 Days 39 4 Tennis 45 Y (15-30) Austria Summer 2016 5 Days 44 5 Tennis 45 Y (30-60) Italy Summer 2014 44 6 Tennis 45 Y (30-60) France Summer 2017 46 7 Tennis 45 Y (15-30) Italy Summer 2014 45 8 Tennis 45 Y (15-30) France Summer 2017 46 9 Tennis 45 Y (15-30) Austria Summer 2014 44 10 Tennis 45 Y (30-60) France Summer 44 11 Tennis 45 Y (15-30) France Summer 44 12 Tennis 38 N (5-15) France Spring 26 13 Tennis 38 N (5-15) Austria Summer 27 14 Tennis 38 N (30-60) Italy Retail 26 15 Tennis 38 N (30-60) Italy Spring 26 16 Tennis 38 N (15-30) Austria Spring 26 17 Tennis 38 N (5-15) Austria 26 18 Tennis 38 Y 30 19 Tennis 54 N 9 20 Tennis 54 21 21 Tennis 60 27 22 Golf 20 Y (5-15) Austria Summer 2017 3 Day Grass 18 23 Golf 20 Y (30-60) Italy Summer 2016 3 Day 18 24 Golf 20 Y (30-60) France Summer 2014 18 25 Golf 20 Y (15-30) Italy Summer 18 26 Golf 20 Y (15-30) Austria Summer 18 27 Golf 20 N 1 28 Golf 20 61 ... ... ... ... ... ... ... ... ... ... ...
Thank you for your help.
Q: If the original table doesn’t have the points then where did 39 come from?
A:The target dataset (dataset B) does not have the points. The mapping table (dataset A) contains the rules. These are given I didn't come up with them.
Q:Where do the rules for the points come from? A table? Rules you know in your head?
A: The rules for the points are given in the mapping table (dataset A). What I want to do is translate these in SAS code to be able to apply them on the dataset B (10000+ obs).
In the original message I included an example of how the target dataset points assignment should look like (based on the rules from dataset A)
Thank you a bunch!
Imho
and Surface ne 'Hard' and Surface ne 'Clay'
is wrong for line three. It should be
and missing(Surface)
You could use a data step to write a data step ... i don't have the time to provide code, maybe later.
I didn't use a missing function because in the target dataset (dataset B) the value of the variable "Surface" will not necessarily be missing.
As I had in my example, if Surface is Grass then points should still be assigned. i.e. in dataset A blank means that these can take any value. Said in other words the Points should be assigned in a tree like branches to the closest match available based on the mapping in dataset A.
Whether execution time makes much difference will depend on the sizes of the tables.
You don't seem to have supplied any example input data that you want to "score".
The example code you posted does not appear to match any of the lines in your rules dataset. Can you make example code that actually matches the example rules you have posted.
Note it is just as easy to post the sample data as SAS code as it is to go to the trouble of creating a table or spreadsheet.
data have;
infile cards dsd dlm='|' truncover;
input Sport :$20. Tournament Held :$1. Age :$20. Country :$20.
Season :$10. Year Length :$20. Surface :$10. Points
;
cards;
Tennis|45|Y|(15-30)|Italy|Fall|2017|5 Days|Hard|44
Tennis|45|Y|(15-30)|France|Summer|2017|5 Days|Clay|46
Tennis|45|Y|(15-30)|France|Fall|2017|5 Days| |39
Tennis|45|Y|(15-30)|Austria|Summer|2016|5 Days| |44
Tennis|45|Y|(30-60)|Italy|Summer|2014| | |44
Tennis|45|Y|(30-60)|France|Summer|2017| | |46
;
You should be able to use the data to generate code. The data step is an excellent tool for generating text files.
filename code temp;
data _null_;
set have ;
file code column=col ;
if _n_ > 1 then put 'else ' @;
put 'if ' surface= :$quote. @;
put 'and ' age= :$quote. @;
if col>65 then put / @5 @;
put 'and ' country= :$quote. @;
if col>65 then put / @5 @;
put 'and ' season= :$quote. @;
if col>65 then put / @5 @;
put 'and ' year= @;
if not missing(surface) then put 'and ' surface= :$quote. @ ;
if col>65 then put / @5 @;
put 'then ' points= ';' ;
run;
Since you didn't provide any sample input data let's just use your rules database and drop the existing POINTS variable.
data want;
set have (drop=points);
%include code / source2;
run;
Here is what the SAS log looks like:
649 data want; 650 set have (drop=points); 651 %include code / source2; NOTE: %INCLUDE (level 1) file CODE is file C:\...\#LN00065. 652 +if Surface="Hard" and Age="(15-30)" and Country="Italy" and Season="Fall" 653 + and Year=2017 and Surface="Hard" then Points=44 ; 654 +else if Surface="Clay" and Age="(15-30)" and Country="France" and Season="Summer" 655 + and Year=2017 and Surface="Clay" then Points=46 ; 656 +else if Surface="" and Age="(15-30)" and Country="France" and Season="Fall" 657 + and Year=2017 then Points=39 ; 658 +else if Surface="" and Age="(15-30)" and Country="Austria" and Season="Summer" 659 + and Year=2016 then Points=44 ; 660 +else if Surface="" and Age="(30-60)" and Country="Italy" and Season="Summer" 661 + and Year=2014 then Points=44 ; 662 +else if Surface="" and Age="(30-60)" and Country="France" and Season="Summer" 663 + and Year=2017 then Points=46 ; NOTE: %INCLUDE (level 1) ending. 664 run; NOTE: There were 6 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT has 6 observations and 10 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds
I updated my original post, Dataset B is the table that needs scoring. I assigned the points manually to illustrate how the mapping table should work. Dropping points in this Dataset B gives an exact view of how the target dataset looks like.
Thanks for the idea for generating the code in the log file. This is probably something I could use. It wont be pretty with many lines of code but it could work. With multiple IF statements like this I need to make sure the IF statements with least number of arguments get executed first (some type of sorting will be needed). Don't want to get the values overwritten if multiple IF statements can fulfil the criteria.
For example IF#1 should be executed first so it does not override IF#2 when trying to assign points to this record which should get 46
Sport | Tournament | Held | Age | Country | Season | Year | Length | Surface | Points |
Tennis | 45 | Y | (15-30) | France | Summer | 2017 | 5 Days | Clay |
/* IF#1 rule based on row #11 in dataset A*/
if (Sport='Tennis'
and Tournament=45
and Held='Y'
and Age='(15-30)'
and Country='France'
and Season='Summer'
) then Points=44;
/* IF#2 rule based on row #2 in dataset A */
if (Sport='Tennis'
and Tournament=45
and Held='Y'
and Age='(15-30)'
and Country='France'
and Season='Summer'
and Year=2017
and Length ='5 Days'
and Surface = 'Clay'
) then Points=46;
So use my program as it is generating ELSE statements so only one rule will apply.
Sort the data in the proper order before using it generate the code. To try your idea of apply the most complete rules first try using the value of CMISS() for the various conditions to sort. So rules with the fewest missing conditions will "win".
data rules;
set rules;
n_miss=cmiss(of Sport Tournament Held Age Country Season Year Length Surface);
run;
proc sort;
by n_miss;
run;
Note that the code is generated to a FILE, not the log. The code in the LOG is the code that RAN because of the %INCLUDE statement.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.