Solved: Histogram Challenge

NKormanik · Posted 08-16-2021 08:19 PM

Two data sets, A and B. Continuous variables. X in A. Y in B.

Using Proc Univariate (or other) we can create histograms of the single variable in each dataset.

Imagine plotting the two histograms one above the other:

Objective: Sum the corresponding bins. (And then histogram that.)

Perhaps the same result can be obtained using potentially generated kernel data. Summing bin kernel data, then, and histogramming that would yield the same conclusions.

There might be other approaches as well.

Your thoughts appreciated....

Nicholas Kormanik

(As has been the case so many times before, my question above was not asked in the best of ways. Sorry. It evolves below....)

Ksharp · Posted 08-27-2021 08:23 AM

OK. I made a sample data for running my code.


data score_card;
call streaminit(123);
 do i=1 to 10000;
  *total_score=rand('integer',400,600);
  total_score=int(rand('uniform')*200) + 400;
  output;
 end;
drop i;
run;

proc sql noprint;
select min(total_score)  ,max(total_score) into :score_min, :score_max from score_card;
quit;


/****** Get cutpoint ******/
proc delete data=cutpoint;run;
%macro cutpoint;
%do score=&score_min %to &score_max;
data test_total_score;
 set score_card;
 _status=ifc(total_score ge &score,'good','bad ');
run;
proc npar1way data=test_total_score edf noprint;
class _status;
var total_score;
output out=cutpoint_ks(keep=_KS_) edf;
run;
data temp_cutpoint;
 retain cutpoint &score ;
 set cutpoint_ks ;
 label cutpoint='分割点' _ks_='KS值';
run;
proc append base=cutpoint data=temp_cutpoint force;run;
%end;
%mend;

%cutpoint

proc sql noprint;
select cutpoint into : cutpoint
 from cutpoint
  having _KS_=max(_KS_);
quit;

data final_total_score;
 set score_card;
 call streaminit(123);
good_bad=ifc(rand('bern',0.2),'bad','good');

run;

proc sgpanel data=final_total_score;
panelby good_bad/layout=rowlattice onepanel noborder novarname;
histogram total_score;
refline &cutpoint /axis=x;
run;

View solution in original post

Reeza · Posted 08-16-2021 08:43 PM

I’m not sure what the problem is … if you combine the data sets univariate allows you to specify the bins interval and widths via the options and the histogram statement in proc sgplot also allows you to control the binwidths and intervals but not density overlays.

No procedure works with multiple data sets as the main data source AFAIK (mapping excluded).

NKormanik · Posted 08-17-2021 01:11 AM

No "problem", per se. Just was interested in your thoughts.

A priori, I haven't a clue what the bins distribution or sizes might be, until Proc Univariate, Histogram, shows me.

And, as one might anticipate, this case is just one of many. Say, 400 or so.

I am particularly interested in LOW areas, and especially areas where there is as little 'overlap' as possible between the two histograms.

"No procedure works with multiple data sets as the main data source..."

Important to be aware of this. Thus, some combining of the two datasets will be necessary.

There are options to output potentially created datasets: outhistogram and outkernel. But not sure if that would be the way to go in present case. Perhaps it is.

ballardw · Posted 08-17-2021 07:24 AM

My basic approach to what I think you are asking if it is to create a histogram is to combine data sets adding an identifier for which set each record comes from. Then use Proc SGplot with a histogram and the identifier as a Group variable

/* first make a couple of data sets with common variables*/
data a;
   set sashelp.cars;
   where type='Sedan';
run;

data b;
   set sashelp.cars;
   where type='SUV';
run;
/* combine*/

data combined;
   set a b indsname=dsn;
   source=dsn;
run;

Proc sgplot data=combined;
   histogram wheelbase /group=source transparency=.5;
run;

If you haven't seen it before the INDSNAME option creates a temporary variable that holds the name of the data set contributing the current observation. The Source=dsn; then makes a variable that stays in the data set with that information.

Then the Histogram statement overlays the two data sources with different color. The transparency allows seeing a "short" bar hidden by taller one with similar values of the variable. Options in the Histogram statement allow you to control the size of bins with options like BINWIDTH, which sets the range of values represented by bars Binwidth=5 would mean bins represent multiples of 5; NBINS where you specify the number of bins and SAS calculates the widths based on that.

Proc sgplot data=combined;
   histogram wheelbase /group=source transparency=.5
                        binwidth=5
   ;
run;


Proc sgplot data=combined;
   histogram wheelbase /group=source transparency=.5
                        nbins=25
   ;
run;

There are more options available. Read the documentation.

Ksharp · Posted 08-17-2021 09:17 AM

data a;
   set sashelp.cars;
   where type='Sedan';
run;

data b;
   set sashelp.cars;
   where type='SUV';
run;
/* combine*/

data combined;
   set a b indsname=dsn;
   source=dsn;
run;

Proc sgpanel data=combined;
   panelby source/layout=rowlattice novarname;
   histogram wheelbase /datalabel scale=count dataskin=pressed ;
   density wheelbase/type=kernal;
   density wheelbase/type=normal;
run;

NKormanik · Posted 08-18-2021 08:51 PM

Further explanation....

The histograms have been generated through 'machine learning' (Proc HPSplit).

The goal was to find a 'sweet spot' for some process.

The top histogram says the 'sweet spot' is GREATER THAN the parameter values indicated.

The lower histogram says the 'sweet spot' is LESS THAN the parameter values indicated.

Where there is a lot of overlap in parameter estimation, well, that's not good. 'Sweet spot' can't be GT and LT some value simultaneously.

Thus, the notion of adding the two histograms, and looking for the low area. Not at all perfect.

Bear in mind that there are over 400 such sets of histograms. And the objective is to find out which sets represents the best 'sweet spots' possible.

Presently the approach I'm using is to eyeball it.

ChrisNZ · Posted 08-19-2021 03:26 AM

Like this?

data a;
   set sashelp.cars;
   where type='Sedan';
run;

data b;
   set sashelp.cars;
   where type='SUV' ;
   wheelbase=wheelbase*1.4;
run;

/* combine*/
data combined;
   set a b indsname=dsn;
   source=dsn;
run;

proc sql noprint; select wheelbase into : lineval from (select wheelbase, count(*) as freq from combined group by wheelbase) having freq=min(freq);

proc sgpanel data=combined;
   panelby source/layout=rowlattice novarname;
   histogram wheelbase /datalabel scale=count dataskin=pressed ;lineparm X=&lineval Y=150 slope=. ;
   density wheelbase/type=kernal;
   density wheelbase/type=normal;
run;

High-Performance SAS Coding - Third Edition

NKormanik · Posted 08-20-2021 03:23 AM

@ChrisNZ @Ksharp @ballardw @Reeza All appreciated examples of what one might do. Good to have them here.

My original question probably wasn't exactly clear enough.

Using an analogy I'll try to illustrate a bit more:

Imagine a river. Suppose there are two banks (sides, rocks, soil). Call one a "Left Bank" (Greater Than) and the other "Right Bank" (Less Than). There is water in the middle between the Left Bank and Right Bank.

I'm trying to find the 'water' between the banks. I.e., a flat area between the histograms.

Below is another image that shows too much water between the banks:

I.e., huge open area between the histograms. Not helpful.

Looking for a Goldilocks-sized river.

ballardw · Posted 08-20-2021 05:04 AM

Still clear as mud.

NKormanik · Posted 08-21-2021 10:21 PM

@ballardw Apt and clever response. That's what we see in the histograms shown at top.

Below is one that's even worse:

Clear as rocky banks and mud.

ballardw · Posted 08-22-2021 02:41 AM

@NKormanik wrote:

@ballardw Apt and clever response. That's what we see in the histograms shown at top.

Below is one that's even worse:

Clear as rocky banks and mud.

Can you add something to point out what you are finding of interest? Or mention coordinates?

I think we have, at least partially, one of those things where you have become so involved with your data that you expect others to know it without actually describing it clearly. I see lots of "low" spot. But haven't a clue what makes this "even worse" than the first one.

If you actually want to "sum" the two then the first part is to create the discrete bins, without actual data that might be interesting challenge by itself but with ODS graphics you can get usually get the graphed data using ODS output. If the two sets have x axis values/increments in common then combining two output sets with the counts and summing is a pretty trivial Proc freq or Proc means/summary step.

But starting data would be needed and likely the specific options you are /want to use for binning the data.Sgplot where you set bidwidths and and a starting offset might work best.

ChrisNZ · Posted 08-20-2021 06:04 AM

Love the analogy and the pics! 🙂

What do you look for when you eyeball the charts by hand? What's the pattern you look for that could be coded?

High-Performance SAS Coding - Third Edition

Reeza · Posted 08-20-2021 11:41 AM

Plateau - you're looking for the plateaus.

NKormanik · Posted 08-20-2021 08:20 PM

Plateau? No, the opposite. A valley. And a nice medium-sized river running through the valley.

ChrisNZ · Posted 08-21-2021 04:25 AM

> nice medium-sized river running through the valley

This does not describe an algorithm. My simple logic looks for the lowest point and plots it. What would the logic to implement be?

High-Performance SAS Coding - Third Edition

Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Re: Histogram Challenge

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!