BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dzurov
Fluorite | Level 6

Hi 

I need help translating the below table in SAS code to be able to assign the values found in the variable "Points" in another large target dataset. The target dataset contains the variables "Sport", "Tournament", "Held", "Age", "Country", "Season", "Year", "Length", and "Surface" except the variable "Points". So the table below can be seen as a map that has the assignment rules branching like tree to the "Point" value. 

 

Some observations about the map table (below) and desired target dataset: 

  • The variables "Sport" and "Tournament" will always have values populated and the other variables can have blank values. 
  • The "Points" can be assigned in a decision tree format branching from first column "Sport" then down to the last column "Surface". Hence the order of the columns is very important. 
  • The "Points" assignment is will be based on the criteria shown in each row. If perfect match is not found the row with a most closely match will be used to assign the "Points" value  
  • Blank value can take any value not explicitly specified in the map table, e.g., see row #3... if in the target dataset a record matches all values (i.e.,  tennis, 34, Y,(15-30), France, Fall,2017, 5 Days) except the Surface variable which has a value "Grass" then this record will still get the Points assignment of 39

My initial idea was to create a conditional statement for each row and assign the "Points" value that way. For instance if we take the example above for row #3 I would have something like 

 

 

 

if (Sport='Tennis'
 and Tournament=45
 and Held='Y'
 and Age='(15-30)'
 and Country='France'
 and Season='Fall'
 and Year=2017
 and Length ='5 Days'
 and Surface ne 'Hard'
 and Surface ne 'Clay'
) then Points=39;

 

 

 

The above seems pretty time consuming especially if the map table has 100+ conditional rows. I was wondering if there is a better way to tackle this problem. 

Dataset A:

#SportTournamentHeldAgeCountrySeasonYearLengthSurfacePoints
1Tennis45Y(15-30)ItalyFall20175 DaysHard44
2Tennis45Y(15-30)FranceSummer20175 DaysClay46
3Tennis45Y(15-30)FranceFall20175 Days 39
4Tennis45Y(15-30)AustriaSummer20165 Days 44
5Tennis45Y(30-60)ItalySummer2014  44
6Tennis45Y(30-60)FranceSummer2017  46
7Tennis45Y(15-30)ItalySummer2014  45
8Tennis45Y(15-30)FranceSummer2017  46
9Tennis45Y(15-30)AustriaSummer2014  44
10Tennis45Y(30-60)FranceSummer   44
11Tennis45Y(15-30)FranceSummer   44
12Tennis38N(5-15)FranceSpring   26
13Tennis38N(5-15)AustriaSummer   27
14Tennis38N(30-60)ItalyWinter   26
15Tennis38N(30-60)ItalySpring   26
16Tennis38N(15-30)AustriaSpring   26
17Tennis38N(5-15)Austria    26
18Tennis38Y      30
19Tennis54N      9
20Tennis54       21
21Tennis60       27
22Golf20Y(5-15)AustriaSummer20173 DayGrass18
23Golf20Y(30-60)ItalySummer20163 Day 18
24Golf20Y(30-60)FranceSummer2014  18
25Golf20Y(15-30)ItalySummer   18
26Golf20Y(15-30)AustriaSummer   18
27Golf20N      1
28Golf20       61
......... ... ... ... ... ... ... ... ...

 

The target dataset with assigned values based on the rules provided in the table above should look like this:

Dataset B:

SportTournamentHeldAgeCountrySeasonYearLengthSurfacePoints
Tennis45Y(15-30)ItalyFall20175 DaysHard44
Tennis45Y(15-30)FranceSummer20175 DaysClay46
Tennis45Y(15-30)FranceFall20175 DaysGrass44
Tennis45Y(15-30)AustriaSummer20165 DaysGreen44
Tennis45Y(30-60)ItalySummer20147 DaysHard/Clay44
Tennis45Y(30-60)FranceSummer201710 DaysHard/Clay46
Tennis45Y(15-30)ItalySummer20147 DaysGrass44
Tennis45Y(15-30)FranceSummer20177 DaysGrass46
Tennis45Y(15-30)AustriaSummer20147 DaysGrass44
Tennis45Y(30-60)FranceSummer20185 DaysClay44
Tennis45Y(15-30)FranceSummer20185 DaysClay44
Tennis38N(5-15)FranceSpring20175 DaysClay26
Tennis38N(5-15)AustriaSummer20175 DaysHard27
Tennis38N(30-60)ItalySpring20175 DaysHard26
Tennis38N(15-30)AustriaSpring20177 DaysGrass26
Tennis38N(5-15)AustriaWinter20177 DaysGrass26
Tennis38Y(15-30)FranceSummer20175 DaysClay30
Tennis54N(15-30)ItalyFall20175 DaysHard9
Tennis54Y(15-30)ItalySummer20147 DaysGrass21
Tennis60Y(15-30)FranceSummer20185 DaysClay27

Thank you for your help.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Note it is just as easy to post the sample data as SAS code as it is to go to the trouble of creating a table or spreadsheet.

data have;
  infile cards dsd dlm='|' truncover;
  input Sport :$20. Tournament Held :$1. Age :$20. Country :$20.
        Season :$10. Year Length :$20. Surface :$10. Points
  ;
cards;
Tennis|45|Y|(15-30)|Italy|Fall|2017|5 Days|Hard|44
Tennis|45|Y|(15-30)|France|Summer|2017|5 Days|Clay|46
Tennis|45|Y|(15-30)|France|Fall|2017|5 Days| |39
Tennis|45|Y|(15-30)|Austria|Summer|2016|5 Days| |44
Tennis|45|Y|(30-60)|Italy|Summer|2014| | |44
Tennis|45|Y|(30-60)|France|Summer|2017| | |46
;

You should be able to use the data to generate code.  The data step is an excellent tool for generating text files.

filename code temp;
data _null_;
  set have ;
  file code column=col ;
  if _n_ > 1 then put 'else ' @;
  put 'if ' surface= :$quote. @;
  put 'and ' age= :$quote. @;
  if col>65 then put / @5 @;
  put 'and ' country= :$quote. @;
  if col>65 then put / @5 @;
  put 'and ' season= :$quote. @;
  if col>65 then put / @5 @;
  put 'and ' year= @;
  if not missing(surface) then put 'and ' surface= :$quote. @ ;
  if col>65 then put / @5 @;
  put 'then ' points= ';' ;
run;

Since you didn't provide any sample input data let's just use your rules database and drop the existing POINTS variable.

data want;
  set have (drop=points);
%include code / source2;
run;

Here is what the SAS log looks like:

649   data want;
650     set have (drop=points);
651   %include code / source2;
NOTE: %INCLUDE (level 1) file CODE is file C:\...\#LN00065.
652  +if Surface="Hard" and Age="(15-30)" and Country="Italy" and Season="Fall"
653  +    and Year=2017 and Surface="Hard" then Points=44 ;
654  +else if Surface="Clay" and Age="(15-30)" and Country="France" and Season="Summer"
655  +    and Year=2017 and Surface="Clay" then Points=46 ;
656  +else if Surface="" and Age="(15-30)" and Country="France" and Season="Fall"
657  +    and Year=2017 then Points=39 ;
658  +else if Surface="" and Age="(15-30)" and Country="Austria" and Season="Summer"
659  +    and Year=2016 then Points=44 ;
660  +else if Surface="" and Age="(30-60)" and Country="Italy" and Season="Summer"
661  +    and Year=2014 then Points=44 ;
662  +else if Surface="" and Age="(30-60)" and Country="France" and Season="Summer"
663  +    and Year=2017 then Points=46 ;
NOTE: %INCLUDE (level 1) ending.
664   run;

NOTE: There were 6 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 6 observations and 10 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds

View solution in original post

9 REPLIES 9
Reeza
Super User

If the original table doesn’t have the points then where did 39 come from?

Where do the rules for the points come from? A table? Rules you know in your head?

 

It would help to show what information you start with and what you expect as output but I’m not following that example at all.

 


@dzurov wrote:

Hi 

I need help translating the below table in SAS code to be able to assign the values found in the variable "Points" in another large target dataset. The target dataset contains the variables "Sport", "Tournament", "Held", "Age", "Country", "Season", "Year", "Length", and "Surface" except the variable "Points". So the table below can be seen as a map that has the assignment rules branching like tree to the "Point" value. 

 

Some observations about the map table (below) and desired target dataset: 

  • The variables "Sport" and "Tournament" will always have values populated and the other variables can have blank values. 
  • The "Points" can be assigned in a decision tree format branching from first column "Sport" then down to the last column "Surface". Hence the order of the columns is very important. 
  • The "Points" assignment is will be based on the criteria shown in each row. If perfect match is not found the row with a most closely match will be used to assign the "Points" value  
  • Blank value can take any value not explicitly specified in the map table, e.g., see row #3... if in the target dataset a record matches all values (i.e.,  tennis, 34, Y,(15-30), France, Fall,2017, 5 Days) except the Surface variable which has a value "Grass" then this record will still get the Points assignment of 39

My initial idea was to create a conditional statement for each row and assign the "Points" value that way. For instance if we take the example above for row #3 I would have something like 

if ((Sport='Tennis' and Tournament=45 and Held='Y' and Age='(15-30)' and Country='France' and 
Season='Fall' and Year=2017 and Length ='5 Days' and Surface ne 'Hard' and Surface ne 'Clay') then Points=39;

The above seems pretty time consuming especially if the map table has 100+ conditional rows. I was wondering if there is a better way to tackle this problem. 

 

# Sport Tournament Held Age Country Season Year Length Surface Points
1 Tennis 45 Y (15-30) Italy Fall 2017 5 Days Hard 44
2 Tennis 45 Y (15-30) France Summer 2017 5 Days Clay 46
3 Tennis 45 Y (15-30) France Fall 2017 5 Days   39
4 Tennis 45 Y (15-30) Austria Summer 2016 5 Days   44
5 Tennis 45 Y (30-60) Italy Summer 2014     44
6 Tennis 45 Y (30-60) France Summer 2017     46
7 Tennis 45 Y (15-30) Italy Summer 2014     45
8 Tennis 45 Y (15-30) France Summer 2017     46
9 Tennis 45 Y (15-30) Austria Summer 2014     44
10 Tennis 45 Y (30-60) France Summer       44
11 Tennis 45 Y (15-30) France Summer       44
12 Tennis 38 N (5-15) France Spring       26
13 Tennis 38 N (5-15) Austria Summer       27
14 Tennis 38 N (30-60) Italy Retail       26
15 Tennis 38 N (30-60) Italy Spring       26
16 Tennis 38 N (15-30) Austria Spring       26
17 Tennis 38 N (5-15) Austria         26
18 Tennis 38 Y             30
19 Tennis 54 N             9
20 Tennis 54               21
21 Tennis 60               27
22 Golf 20 Y (5-15) Austria Summer 2017 3 Day Grass 18
23 Golf 20 Y (30-60) Italy Summer 2016 3 Day   18
24 Golf 20 Y (30-60) France Summer 2014     18
25 Golf 20 Y (15-30) Italy Summer       18
26 Golf 20 Y (15-30) Austria Summer       18
27 Golf 20 N             1
28 Golf 20               61
... ... ...  ...  ...  ...  ...  ...  ...  ...  ...

 

Thank you for your help.


 

dzurov
Fluorite | Level 6

Q: If the original table doesn’t have the points then where did 39 come from?
A:The target dataset (dataset B) does not have the points. The mapping table (dataset A) contains the rules. These are given I didn't come up with them.

Q:Where do the rules for the points come from? A table? Rules you know in your head?
A: The rules for the points are given in the mapping table (dataset A). What I want to do is translate these in SAS code to be able to apply them on the dataset B (10000+ obs).

In the original message I included an example of how the target dataset points assignment should look like (based on the rules from dataset A)
Thank you a bunch!

Reeza
Super User
But you didn't provide an example of each table so we only have a part of the picture.
Please provide a small workable example if you'd like help with code. Otherwise, you're looking at fuzzy merging essentially where the criteria won't match exactly. I would consider PROC PSMATCH possibly as an option but there are also greedy matching algorithms for this type of matching.
It's similar to a case control matching process.
andreas_lds
Jade | Level 19

Imho

and Surface ne 'Hard' and Surface ne 'Clay'

is wrong for line three. It should be

and missing(Surface)

You could use a data step to write a data step ... i don't have the time to provide code, maybe later.

dzurov
Fluorite | Level 6

I didn't use a missing function because in the target dataset (dataset B) the value of the variable "Surface" will not necessarily be missing.

As I had in my example, if Surface is Grass then points should still be assigned. i.e. in dataset A blank means that these can take any value. Said in other words the Points should be assigned in a tree like branches to the closest match available based on the mapping in dataset A. 

Tom
Super User Tom
Super User

Whether execution time makes much difference will depend on the sizes of the tables.

You don't seem to have supplied any example input data that you want to "score".

The example code you posted does not appear to match any of the lines in your rules dataset.  Can you make example code that actually matches the example rules you have posted.

 

Tom
Super User Tom
Super User

Note it is just as easy to post the sample data as SAS code as it is to go to the trouble of creating a table or spreadsheet.

data have;
  infile cards dsd dlm='|' truncover;
  input Sport :$20. Tournament Held :$1. Age :$20. Country :$20.
        Season :$10. Year Length :$20. Surface :$10. Points
  ;
cards;
Tennis|45|Y|(15-30)|Italy|Fall|2017|5 Days|Hard|44
Tennis|45|Y|(15-30)|France|Summer|2017|5 Days|Clay|46
Tennis|45|Y|(15-30)|France|Fall|2017|5 Days| |39
Tennis|45|Y|(15-30)|Austria|Summer|2016|5 Days| |44
Tennis|45|Y|(30-60)|Italy|Summer|2014| | |44
Tennis|45|Y|(30-60)|France|Summer|2017| | |46
;

You should be able to use the data to generate code.  The data step is an excellent tool for generating text files.

filename code temp;
data _null_;
  set have ;
  file code column=col ;
  if _n_ > 1 then put 'else ' @;
  put 'if ' surface= :$quote. @;
  put 'and ' age= :$quote. @;
  if col>65 then put / @5 @;
  put 'and ' country= :$quote. @;
  if col>65 then put / @5 @;
  put 'and ' season= :$quote. @;
  if col>65 then put / @5 @;
  put 'and ' year= @;
  if not missing(surface) then put 'and ' surface= :$quote. @ ;
  if col>65 then put / @5 @;
  put 'then ' points= ';' ;
run;

Since you didn't provide any sample input data let's just use your rules database and drop the existing POINTS variable.

data want;
  set have (drop=points);
%include code / source2;
run;

Here is what the SAS log looks like:

649   data want;
650     set have (drop=points);
651   %include code / source2;
NOTE: %INCLUDE (level 1) file CODE is file C:\...\#LN00065.
652  +if Surface="Hard" and Age="(15-30)" and Country="Italy" and Season="Fall"
653  +    and Year=2017 and Surface="Hard" then Points=44 ;
654  +else if Surface="Clay" and Age="(15-30)" and Country="France" and Season="Summer"
655  +    and Year=2017 and Surface="Clay" then Points=46 ;
656  +else if Surface="" and Age="(15-30)" and Country="France" and Season="Fall"
657  +    and Year=2017 then Points=39 ;
658  +else if Surface="" and Age="(15-30)" and Country="Austria" and Season="Summer"
659  +    and Year=2016 then Points=44 ;
660  +else if Surface="" and Age="(30-60)" and Country="Italy" and Season="Summer"
661  +    and Year=2014 then Points=44 ;
662  +else if Surface="" and Age="(30-60)" and Country="France" and Season="Summer"
663  +    and Year=2017 then Points=46 ;
NOTE: %INCLUDE (level 1) ending.
664   run;

NOTE: There were 6 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 6 observations and 10 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds
dzurov
Fluorite | Level 6

I updated my original post, Dataset B is the table that needs scoring. I assigned the points manually to illustrate how the mapping table should work. Dropping points in this Dataset B gives an exact view of how the target dataset looks like.

Thanks for the idea for generating the code in the log file. This is probably something I could use. It wont be pretty with many lines of code but it could work. With multiple IF statements like this I need to make sure the IF statements with least number of arguments get executed first (some type of sorting will be needed). Don't want to get the values overwritten if multiple IF statements can fulfil the criteria. 

 

For example IF#1 should be executed first so it does not override IF#2 when trying to assign points to this record  which should get 46 

SportTournamentHeldAgeCountrySeasonYearLengthSurfacePoints
Tennis45Y(15-30)FranceSummer20175 DaysClay 

 

 

/* IF#1 rule based on row #11 in dataset A*/
if (Sport='Tennis'
 and Tournament=45
 and Held='Y'
 and Age='(15-30)'
 and Country='France'
 and Season='Summer'
) then Points=44;

/* IF#2 rule based on row #2 in dataset A */
if (Sport='Tennis'
 and Tournament=45
 and Held='Y'
 and Age='(15-30)'
 and Country='France'
 and Season='Summer'
 and Year=2017
 and Length ='5 Days'
 and Surface = 'Clay'
) then Points=46;

 

 

Tom
Super User Tom
Super User

So use my program as it is generating ELSE statements so only one rule will apply.

Sort the data in the proper order before using it generate the code. To try your idea of apply the most complete rules first try using the value of CMISS() for the various conditions to sort.  So rules with the fewest missing conditions will "win".

data rules;
  set rules;
  n_miss=cmiss(of Sport Tournament Held Age Country Season Year Length Surface);
run;
proc sort;
  by n_miss;
run;

Note that the code is generated to a FILE, not the log.  The code in the LOG is the code that RAN because of the %INCLUDE statement.

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 2602 views
  • 4 likes
  • 4 in conversation