Solved: Histogram Challenge - Page 2

Ksharp · Posted 08-21-2021 05:38 AM

So you want overlay these two histogram ?

data a;
   set sashelp.heart;
   where sex='Male';
run;

data b;
   set sashelp.heart;
   where sex='Female';
run;
/* combine*/

data combined;
   set a b indsname=dsn;
   source=dsn;
run;

Proc sgplot data=combined;
   histogram height /group=source datalabel scale=count  transparency=0.5;
   density height/type=kernel group=source;
run;

NKormanik · Posted 08-21-2021 10:36 PM

Remember, the 'machine learning' has attempted to show a 'sweet spot' -- some level along a continuous range of values that supposedly results in a best chance of outcome.

The top histogram is for Greater Than values. The lower histogram is for Less Than values.

Of the nearly 400 such sets of histograms, NONE appear to be perfect -- i.e., a reasonable clear distance between GT and LT. So, the objective is to find the 'best' of the 400.

Two more examples of "obviously nuts":

See the overlap? Like saying your 'sweet spot' is greater than 3, but less than 2.

Here's another:

So, on a scale of -71 to 71, the 'sweet spot' lies between -71 and 71.

How helpful. (NOT!)

Reeza · Posted 08-21-2021 10:41 PM

Do all intervals possible and calculate the metric of interest.
Then isolate the value of interest.

Ksharp · Posted 08-22-2021 05:47 AM

I think it is more like a OR problem ，not histogram graph.

Calling @Rick_SAS @RobPratt

NKormanik · Posted 08-22-2021 05:56 AM

By the way, what would the ideal set of histograms look like?

Glad you asked....

So, the programming objective is to find among the actual 400 sets of histograms, the ones that are as close as possible to the ideal.

(Actually, I already know that NONE even come close. But which one is 'best'? Are any even usable?)

(This was, you know, a well-funded Government study. Tax payers want.... well, maybe not.... At any rate.)

Reeza · Posted 08-23-2021 12:12 PM

What's the ideal? How is that defined?

Rick_SAS · Posted 08-22-2021 06:25 AM

Without a statistical description of the problem, there isn't much to say. Perhaps the OP has a binary classification problem? The two histograms represent the distributions of the data that are classified into each category. The search for "overlap" and "flat regions" for 400 pairs of histograms might be an attempt to understand how the classifier works with various parameters or threshold criteria. (Or maybe for random subsets of the data?) The search for a "best" pair might be an attempt to find a threshold value that maximizes the ability of the classifier to discriminate between the two categories.

If this is the case, the OP might be interested in learning about ROC curves for binary classifiers. I wrote a few articles that introduce some of the important statistical ideas for binary classifiers. See:

In SAS, you can use PROC LOGISTIC to overlay and compare ROC curves from different models or rules.

NKormanik · Posted 08-23-2021 08:24 PM

@Rick_SAS Yes, the initial problem was binomial classification -- along a continuum of values of a variable, can the algorithm arrive at a point where above that point is good, below that point is bad (1 vs. 0).

I used HPSplit to attempt this classification. Many datasets. The results were highly mixed, as shown above in the histograms of the results.

Rather than toss the lot, I wanted to at least home in on some of the results that looked most promising, and then do further testing on those.

So far, the winnowing of the histogram sets has to be done by eyeballing. This post was an attempt to find a programmable way of going about it.

NKormanik · Posted 08-24-2021 08:02 PM

@Rick_SAS @Ksharp @ballardw @ChrisNZ @Reeza

Since I have your attention, please....

And any other 'machine learning' experts out there.

Below is a set of histograms that look somewhat reasonable, compared to others. 'Sweet spot' is somewhere in the middle -- that is, the values of the variable which the 'machine learning algorithm' selected as best (again, 1 vs. 0).

Trouble is, where's the 'sweet spot'?? Which range would you choose, and why? Your thoughts greatly appreciated....

ChrisNZ · Posted 08-24-2021 10:34 PM

Being very ignorant on the matter, it seems that B1-B2, being more narrow, is more discriminant (if that's the right term). Isn't a wider range less useful? If a wider range was useful, you could at the limit accept the whole data range as your sweet spot.

This is made somewhat easy since the 2 ranges are contained in each other. Were they overlapping, the matter would be different. Just my uneducated comment.

High-Performance SAS Coding - Third Edition

NKormanik · Posted 08-25-2021 05:00 AM

@ChrisNZ wrote:

Being very ignorant on the matter, it seems that B1-B2, being more narrow, is more discriminant (if that's the right term). Isn't a wider range less useful? If a wider range was useful, you could at the limit accept the whole data range as your sweet spot.

This is made somewhat easy since the 2 ranges are contained in each other. Were they overlapping, the matter would be different. Just my uneducated comment.

I've yet to encounter anyone on here that's ignorant, least of all you.

What you suggest seems logical to me. Looking forward if anyone else will venture a comment.

Range B is a subset of Range A. Much narrower.

If we choose the whole data range, then there really is no 'sweet spot.' Gee, thanks 'machine learning algorithm'.... (NOT!)

Notice that Range A has the highest histogram bars. Shouldn't we give it, then, a whole bunch of extra cred? Just sayin'....

What if Range B had the higher histogram bars??

Somebody has some 'splanin' to do. Ain't me.

Would take someone with lots of 'machine learning' expertise, is my hunch.

Ksharp · Posted 08-25-2021 07:53 AM

These two histograms are corresponding to two variables ?
Or just from one variable , above is from GOOD ,bottom is from BAD ?

NKormanik · Posted 08-25-2021 08:27 PM

The two histograms refer to only one single variable.

Top histogram is for values greater than, for inclusion in the supposed 'sweet spot.'

Lower histogram is for values less than, for inclusion in the supposed 'sweet spot.'

Anything outside the designated inclusion zone is rejected.

(Actually, just binary coding, 1 vs 0. 1 = 'sweet spot'.)

Ksharp · Posted 08-26-2021 07:43 AM

As Reeza pointed out . Go through all the cutpoint and pick one which have max ROC or K-S statistic.

Here I used K-S statistic.


/****** Get cutpoint ******/
proc delete data=cutpoint;run;
%macro cutpoint;
%do score=&score_min %to &score_max;
data test_total_score;
 set score_card;
 _status=ifc(total_score ge &score,'good','bad ');
run;
proc npar1way data=test_total_score edf noprint;
class _status;
var total_score;
output out=cutpoint_ks(keep=_KS_) edf;
run;
data temp_cutpoint;
 retain cutpoint &score ;
 set cutpoint_ks ;
 label cutpoint='分割点' _ks_='KS值';
run;
proc append base=cutpoint data=temp_cutpoint force;run;
%end;
%mend;

%cutpoint

proc sql noprint;
select cutpoint into : cutpoint
 from cutpoint
  having _KS_=max(_KS_);
quit;
data _null_;
 set cutpoint end=last;
 if _n_=1 then call symputx('min',cutpoint);
 if last then call symputx('max',cutpoint);
run;












data final_total_score;
 set score_card;
 *_status=ifc(total_score ge &cutpoint,'good','bad ');
run;


/*********** Confused Matrix  **********/
/*********** Also Check Validate Table **********/
title "训练集分割点为 &cutpoint 的混淆矩阵 - Confused Matrix";
proc freq data=final_total_score  ;
table good_bad*_status/nopercent  norow  contents=' ';
label good_bad='实际值' _status='预测值';
run;
title ' ';


/********* Graph for cutpoint and KS value *************/
data plot_cutpoint;
 set cutpoint;
 if cutpoint=&cutpoint then group=1;
  else group=0;
run;
proc sgplot data=plot_cutpoint noautolegend;
needle x=cutpoint y=_ks_ /group=group ;
yaxis  LABELATTRS=graphdata1 VALUEATTRS=graphdata1 ;
xaxis values=(&min &cutpoint &max);
run;

title "KS检验";
proc npar1way data=final_total_score plots=edfplot edf ;
class good_bad;
var total_score;
run;
title ' ';


/*********** Graph for Score Card  *************/
data plot_bad_percent;
 set report;
 length range $ 40;
 range=cats('[',put(int(min),8.0),',',put(int(max),8.0),']');
run;

proc sgplot data=plot_bad_percent ;
series x=range y=percent_bad/datalabel markers MARKERATTRS=(symbol=circlefilled
  size=12) MARKERFILLATTRS=(color=white) MARKEROUTLINEATTRS=graphdata1
  FILLEDOUTLINEDMARKERS DATALABELPOS=right datalabelattrs=(size=10);
xaxistable n_good n /valueattrs=(size=10);
label n='人数' ;
xaxis  FITPOLICY=stagger label='评分' labelposition=left
 LABELATTRS=graphdata1 VALUEATTRS=graphdata1(size=10) grid;
format percent_bad percent7.2;
run;

NKormanik · Posted 08-27-2021 04:06 AM

@Ksharp Ready to wrap it all up and move on, declaring you the winner. But... Can't get your script to run successfully.

Is it possible for you to use any of the SAS sample data? Or show how to make up, and use, some random data? That way we all can give your terrific code a try.

@Reeza Since Ksharp referred to you, you can chime in here too, if you want.

Or anyone else, of course.

Ksharp, too, could you please try to explain what you were coding to do? How does it relate to the histograms?

Now I'm wondering if other 'machine learning' algorithms would give similar results. That doesn't even include Proc Logistic, Proc PLS, etc.

What a rabbit hole....

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Classroom Training Available!