It has been a while since I've seen anyone post a weekend programming challenge. And, since the World Series ended early, I thought some of you might like an interesting challenge.
SAS used to distribute two macros that, together, can be used to conduct decision tree and CHAID (Chi-Square Automatic Interaction Detection) types of analyses. The macros are still available on a number of sites (see, e.g., http://support.sas.com/kb/25/035.html (for the xmacro) and
http://www.psych.yorku.ca/friendly/lab/files/macros/treedisc.sas (for the treedisc macro)
The treedisc macro requires SAS/IML to run and, if one desires to print the decision tree, it also requires SAS/OR. However, one can see the results without actually printing the tree, thus SAS/OR isn’t essential for the task.
However, since most sites don't license IML, the challenge is to come up with base SAS (and possibly SAS/STAT) code that can adequately replace the call to IML in the treedisc macro.
Not having SAS at home means working on this challenge would mean I'd have to work on the weekend.
Since I didn't get any response as yet, the challenge has now become an anytime challenge.
Art
without the resources to launch the macros provided it is hard to imagine the objectives that the challenger must achieve.
Could you provide a before/after to explain~: inputs, processes and outputs ?
: as I think you already know, I am always interested in discovering ways that one can achieve various analyses without having to purchase expensive addons. Running CHAID is one of those that I think should be accomplishable via base SAS, but I'm not familiar with IML, thus don't know what the substitutes would be.
What I'd like to run is:
%inc "c:\xmacro.sas";
%inc "c:\treedisc.sas";
%treedisc(data=banksize,depvar=possible_target_bank,
ordinal=total_customers, outtree=trd, options=noformat)
That wouldn't require OR, as I don't need to print the tree, but the macro does some of its work in IML.
The challenge is to provide code that accomplishes the same thing, but only using base SAS.
I don't know if one needs the data for this challenge but, if so, it is the data for which a link is provided in a paper I did for MWSUG, namely
It's not CHAID, but it is ID3/C4.5
proc format;
value specname
1='SETOSA '
2='VERSICOLOR'
3='VIRGINICA ';
value specchar
1='S'
2='O'
3='V';
run;
data iris;
title 'Fisher (1936) Iris Data';
input sepallen sepalwid petallen petalwid species @@;
format species specname.;
label sepallen='Sepal Length in mm.'
sepalwid='Sepal Width in mm.'
petallen='Petal Length in mm.'
petalwid='Petal Width in mm.';
cards;
50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
63 33 60 25 3 53 37 15 02 1
;
proc export data=iris outfile='/tmp/iris.csv' dbms=csv replace; run;
proc groovy;
add classpath='/tmp/weka.jar'; *you need to get weka first...;
submit;
import weka.core.Instances
import weka.core.converters.CSVLoader
import weka.classifiers.trees.J48
loader = new CSVLoader()
loader.setSource(new File('/tmp/iris.csv'))
data = loader.getDataSet()
data.setClassIndex(4)
j48 = new J48()
j48.buildClassifier(data)
println j48
endsubmit;
run;
J48 pruned tree
------------------
petalwid <= 6: SETOSA (50.0)
petalwid > 6
| petalwid <= 17
| | petallen <= 49: VERSICOLOR (48.0/1.0)
| | petallen > 49
| | | petalwid <= 15: VIRGINICA (3.0)
| | | petalwid > 15: VERSICOLOR (3.0/1.0)
| petalwid > 17: VIRGINICA (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
: Much appreciated! I was starting to think that no one was going to accept the challenge.
I can't test your code at the moment, but definitely will and compare the result with that which I obtained trom the treedisc macro.
If I'm satisfied with the result, a no one else has provided a better alternative, I'll change your helping rating to a correct answer.
I have had you post in the back of my mind since you originally posted. Figured it was about time I post something... That being said, I consider my answer a clear cheat on the intention of the challenge.
This is excellent! FYI... the Fisher iris data is a sample data set in sashelp.iris.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.