BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ksharp
Super User

So you want overlay these two histogram ?

 

data a;
   set sashelp.heart;
   where sex='Male';
run;

data b;
   set sashelp.heart;
   where sex='Female';
run;
/* combine*/

data combined;
   set a b indsname=dsn;
   source=dsn;
run;

Proc sgplot data=combined;
   histogram height /group=source datalabel scale=count  transparency=0.5;
   density height/type=kernel group=source;
run;

Ksharp_0-1629538703800.png

 

NKormanik
Barite | Level 11

Remember, the 'machine learning' has attempted to show a 'sweet spot' -- some level along a continuous range of values that supposedly results in a best chance of outcome.

 

The top histogram is for Greater Than values.  The lower histogram is for Less Than values.

 

Of the nearly 400 such sets of histograms, NONE appear to be perfect -- i.e., a reasonable clear distance between GT and LT.  So, the objective is to find the 'best' of the 400.

 

Two more examples of "obviously nuts":

 

6​.png

 

See the overlap?  Like saying your 'sweet spot' is greater than 3, but less than 2.

 

Here's another:

 

1​.png

 

So, on a scale of -71 to 71, the 'sweet spot' lies between -71 and 71.

 

How helpful.  (NOT!)

 

 

Reeza
Super User
Do all intervals possible and calculate the metric of interest.
Then isolate the value of interest.
Ksharp
Super User

I think it is more like a OR  problem ,not histogram graph.

 

Calling @Rick_SAS  @RobPratt 

NKormanik
Barite | Level 11

By the way, what would the ideal set of histograms look like?

 

Glad you asked....

 

Ideal.png

 

So, the programming objective is to find among the actual 400 sets of histograms, the ones that are as close as possible to the ideal.

 

(Actually, I already know that NONE even come close.  But which one is 'best'?  Are any even usable?)

 

(This was, you know, a well-funded Government study.  Tax payers want....  well, maybe not....  At any rate.)

 

 

Reeza
Super User
What's the ideal? How is that defined?
Rick_SAS
SAS Super FREQ

Without a statistical description of the problem, there isn't much to say. Perhaps the OP has a binary classification problem? The two histograms represent the distributions of the data that are classified into each category. The search for "overlap" and "flat regions" for 400 pairs of histograms might be an attempt to understand how the classifier works with various parameters or threshold criteria. (Or maybe for random subsets of the data?) The search for a "best" pair might be an attempt to find a threshold value that maximizes the ability of the classifier to discriminate between the two categories.

 

If this is the case, the OP might be interested in learning about ROC curves for binary classifiers. I wrote a few articles that introduce some of the important statistical ideas for binary classifiers. See:

In SAS, you can use PROC LOGISTIC to overlay and compare ROC curves from different models or rules.

 

NKormanik
Barite | Level 11

@Rick_SAS  Yes, the initial problem was binomial classification -- along a continuum of values of a variable, can the algorithm arrive at a point where above that point is good, below that point is bad (1 vs. 0).

 

I used HPSplit to attempt this classification.  Many datasets.  The results were highly mixed, as shown above in the histograms of the results.

 

Rather than toss the lot, I wanted to at least home in on some of the results that looked most promising, and then do further testing on those.

 

So far, the winnowing of the histogram sets has to be done by eyeballing.  This post was an attempt to find a programmable way of going about it.

 

 

 

 

NKormanik
Barite | Level 11

@Rick_SAS @Ksharp @ballardw @ChrisNZ @Reeza   

Since I have your attention, please.... 

And any other 'machine learning' experts out there.

 

Below is a set of histograms that look somewhat reasonable, compared to others.  'Sweet spot' is somewhere in the middle -- that is, the values of the variable which the 'machine learning algorithm' selected as best (again, 1 vs. 0).

 

Trouble is, where's the 'sweet spot'??  Which range would you choose, and why?  Your thoughts greatly appreciated....

 

Question.png

 

 

ChrisNZ
Tourmaline | Level 20

Being very ignorant on the matter, it seems that B1-B2, being more narrow, is more discriminant (if that's the right term). Isn't a wider range less useful? If a wider range was useful, you could at the limit accept the whole data range as your sweet spot. 

This is made somewhat easy since the 2 ranges are contained in each other. Were they overlapping, the matter would be different. Just my uneducated comment.

 

NKormanik
Barite | Level 11

@ChrisNZ wrote:

Being very ignorant on the matter, it seems that B1-B2, being more narrow, is more discriminant (if that's the right term). Isn't a wider range less useful? If a wider range was useful, you could at the limit accept the whole data range as your sweet spot. 

This is made somewhat easy since the 2 ranges are contained in each other. Were they overlapping, the matter would be different. Just my uneducated comment.


 

I've yet to encounter anyone on here that's ignorant, least of all you.

 

What you suggest seems logical to me.  Looking forward if anyone else will venture a comment.

 

Range B is a subset of Range A.  Much narrower.

 

If we choose the whole data range, then there really is no 'sweet spot.'  Gee, thanks 'machine learning algorithm'.... (NOT!)

 

Notice that Range A has the highest histogram bars.  Shouldn't we give it, then, a whole bunch of extra cred?  Just sayin'....

 

What if Range B had the higher histogram bars??

 

Somebody has some 'splanin' to do.  Ain't me.

 

Would take someone with lots of 'machine learning' expertise, is my hunch.

 

 

Ksharp
Super User
These two histograms are corresponding to two variables ?
Or just from one variable , above is from GOOD ,bottom is from BAD ?
NKormanik
Barite | Level 11

The two histograms refer to only one single variable.

 

Top histogram is for values greater than, for inclusion in the supposed 'sweet spot.'

 

Lower histogram is for values less than, for inclusion in the supposed 'sweet spot.'

 

Anything outside the designated inclusion zone is rejected.

 

(Actually, just binary coding, 1 vs 0.  1 = 'sweet spot'.)

 

Ksharp
Super User

As Reeza pointed out . Go through all the cutpoint and pick one which have max ROC or K-S statistic.

Here I used K-S statistic.

 


/****** Get cutpoint ******/
proc delete data=cutpoint;run;
%macro cutpoint;
%do score=&score_min %to &score_max;
data test_total_score;
 set score_card;
 _status=ifc(total_score ge &score,'good','bad ');
run;
proc npar1way data=test_total_score edf noprint;
class _status;
var total_score;
output out=cutpoint_ks(keep=_KS_) edf;
run;
data temp_cutpoint;
 retain cutpoint &score ;
 set cutpoint_ks ;
 label cutpoint='分割点' _ks_='KS值';
run;
proc append base=cutpoint data=temp_cutpoint force;run;
%end;
%mend;

%cutpoint

proc sql noprint;
select cutpoint into : cutpoint
 from cutpoint
  having _KS_=max(_KS_);
quit;
data _null_;
 set cutpoint end=last;
 if _n_=1 then call symputx('min',cutpoint);
 if last then call symputx('max',cutpoint);
run;












data final_total_score;
 set score_card;
 *_status=ifc(total_score ge &cutpoint,'good','bad ');
run;


/*********** Confused Matrix  **********/
/*********** Also Check Validate Table **********/
title "训练集分割点为 &cutpoint 的混淆矩阵 - Confused Matrix";
proc freq data=final_total_score  ;
table good_bad*_status/nopercent  norow  contents=' ';
label good_bad='实际值' _status='预测值';
run;
title ' ';


/********* Graph for cutpoint and KS value *************/
data plot_cutpoint;
 set cutpoint;
 if cutpoint=&cutpoint then group=1;
  else group=0;
run;
proc sgplot data=plot_cutpoint noautolegend;
needle x=cutpoint y=_ks_ /group=group ;
yaxis  LABELATTRS=graphdata1 VALUEATTRS=graphdata1 ;
xaxis values=(&min &cutpoint &max);
run;

title "KS检验";
proc npar1way data=final_total_score plots=edfplot edf ;
class good_bad;
var total_score;
run;
title ' ';


/*********** Graph for Score Card  *************/
data plot_bad_percent;
 set report;
 length range $ 40;
 range=cats('[',put(int(min),8.0),',',put(int(max),8.0),']');
run;

proc sgplot data=plot_bad_percent ;
series x=range y=percent_bad/datalabel markers MARKERATTRS=(symbol=circlefilled
  size=12) MARKERFILLATTRS=(color=white) MARKEROUTLINEATTRS=graphdata1
  FILLEDOUTLINEDMARKERS DATALABELPOS=right datalabelattrs=(size=10);
xaxistable n_good n /valueattrs=(size=10);
label n='人数' ;
xaxis  FITPOLICY=stagger label='评分' labelposition=left
 LABELATTRS=graphdata1 VALUEATTRS=graphdata1(size=10) grid;
format percent_bad percent7.2;
run;
NKormanik
Barite | Level 11

@Ksharp   Ready to wrap it all up and move on, declaring you the winner.  But...  Can't get your script to run successfully.

 

Is it possible for you to use any of the SAS sample data?  Or show how to make up, and use, some random data?  That way we all can give your terrific code a try.

 

@Reeza   Since Ksharp referred to you, you can chime in here too, if you want.

 

Or anyone else, of course.

 

Ksharp, too, could you please try to explain what you were coding to do?  How does it relate to the histograms?

 

Now I'm wondering if other 'machine learning' algorithms would give similar results.  That doesn't even include Proc Logistic, Proc PLS, etc.

 

What a rabbit hole....

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 31 replies
  • 1196 views
  • 26 likes
  • 6 in conversation