BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
NKormanik
Barite | Level 11

Two data sets, A and B.  Continuous variables.  X in A.  Y in B.

Using Proc Univariate (or other) we can create histograms of the single variable in each dataset.

 

Imagine plotting the two histograms one above the other:

 

Two Histograms.png

 

Objective:  Sum the corresponding bins.  (And then histogram that.)

 

Perhaps the same result can be obtained using potentially generated kernel data.  Summing bin kernel data, then, and histogramming that would yield the same conclusions.

 

There might be other approaches as well.

 

Your thoughts appreciated....

 

Nicholas Kormanik

 

(As has been the case so many times before, my question above was not asked in the best of ways.  Sorry.  It evolves below....)

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

OK. I made a sample data for running my code.

 


data score_card;
call streaminit(123);
 do i=1 to 10000;
  *total_score=rand('integer',400,600);
  total_score=int(rand('uniform')*200) + 400;
  output;
 end;
drop i;
run;

proc sql noprint;
select min(total_score)  ,max(total_score) into :score_min, :score_max from score_card;
quit;


/****** Get cutpoint ******/
proc delete data=cutpoint;run;
%macro cutpoint;
%do score=&score_min %to &score_max;
data test_total_score;
 set score_card;
 _status=ifc(total_score ge &score,'good','bad ');
run;
proc npar1way data=test_total_score edf noprint;
class _status;
var total_score;
output out=cutpoint_ks(keep=_KS_) edf;
run;
data temp_cutpoint;
 retain cutpoint &score ;
 set cutpoint_ks ;
 label cutpoint='分割点' _ks_='KS值';
run;
proc append base=cutpoint data=temp_cutpoint force;run;
%end;
%mend;

%cutpoint

proc sql noprint;
select cutpoint into : cutpoint
 from cutpoint
  having _KS_=max(_KS_);
quit;

data final_total_score;
 set score_card;
call streaminit(123);
good_bad=ifc(rand('bern',0.2),'bad','good');
run; proc sgpanel data=final_total_score; panelby good_bad/layout=rowlattice onepanel noborder novarname; histogram total_score; refline &cutpoint /axis=x; run;

View solution in original post

31 REPLIES 31
Reeza
Super User
I’m not sure what the problem is … if you combine the data sets univariate allows you to specify the bins interval and widths via the options and the histogram statement in proc sgplot also allows you to control the binwidths and intervals but not density overlays.

No procedure works with multiple data sets as the main data source AFAIK (mapping excluded).

NKormanik
Barite | Level 11

No "problem", per se.  Just was interested in your thoughts.

 

A priori, I haven't a clue what the bins distribution or sizes might be, until Proc Univariate, Histogram, shows me.

 

And, as one might anticipate, this case is just one of many.  Say, 400 or so.

 

I am particularly interested in LOW areas, and especially areas where there is as little 'overlap' as possible between the two histograms.

 

"No procedure works with multiple data sets as the main data source..."

Important to be aware of this.  Thus, some combining of the two datasets will be necessary.

 

There are options to output potentially created datasets: outhistogram and outkernel.  But not sure if that would be the way to go in present case.  Perhaps it is.

 

 

ballardw
Super User

My basic approach to what I think you are asking if it is to create a histogram is to combine data sets adding an identifier for which set each record comes from. Then use Proc SGplot with a histogram and the identifier as a Group variable

 

/* first make a couple of data sets with common variables*/
data a;
   set sashelp.cars;
   where type='Sedan';
run;

data b;
   set sashelp.cars;
   where type='SUV';
run;
/* combine*/

data combined;
   set a b indsname=dsn;
   source=dsn;
run;

Proc sgplot data=combined;
   histogram wheelbase /group=source transparency=.5;
run;

If you haven't seen it before the INDSNAME option creates a temporary variable that holds the name of the data set contributing the current observation. The Source=dsn; then makes a variable that stays in the data set with that information.

 

Then the Histogram statement overlays the two data sources with different color. The transparency allows seeing a "short" bar hidden by taller one with similar values of the variable. Options in the Histogram statement allow you to control the size of bins with options like BINWIDTH, which sets the range of values represented by bars Binwidth=5 would mean bins represent multiples of 5; NBINS where you specify the number of bins and SAS calculates the widths based on that.

Proc sgplot data=combined;
   histogram wheelbase /group=source transparency=.5
                        binwidth=5
   ;
run;


Proc sgplot data=combined;
   histogram wheelbase /group=source transparency=.5
                        nbins=25
   ;
run;

There are more options available. Read the documentation.

Ksharp
Super User
data a;
   set sashelp.cars;
   where type='Sedan';
run;

data b;
   set sashelp.cars;
   where type='SUV';
run;
/* combine*/

data combined;
   set a b indsname=dsn;
   source=dsn;
run;

Proc sgpanel data=combined;
   panelby source/layout=rowlattice novarname;
   histogram wheelbase /datalabel scale=count dataskin=pressed ;
   density wheelbase/type=kernal;
   density wheelbase/type=normal;
run;

Ksharp_0-1629206236262.png

 

NKormanik
Barite | Level 11

Further explanation....

 

The histograms have been generated through 'machine learning' (Proc HPSplit).

 

The goal was to find a 'sweet spot' for some process.

 

The top histogram says the 'sweet spot' is GREATER THAN the parameter values indicated.

 

The lower histogram says the 'sweet spot' is LESS THAN the parameter values indicated.

 

Where there is a lot of overlap in parameter estimation, well, that's not good.  'Sweet spot' can't be GT and LT some value simultaneously.

 

Thus, the notion of adding the two histograms, and looking for the low area.  Not at all perfect.

 

Bear in mind that there are over 400 such sets of histograms.  And the objective is to find out which sets represents the best 'sweet spots' possible.

 

Presently the approach I'm using is to eyeball it.

 

ChrisNZ
Tourmaline | Level 20

Like this?

data a;
   set sashelp.cars;
   where type='Sedan';
run;

data b;
   set sashelp.cars;
   where type='SUV' ;
   wheelbase=wheelbase*1.4;
run;

/* combine*/
data combined;
   set a b indsname=dsn;
   source=dsn;
run;

proc sql noprint; select wheelbase into : lineval from (select wheelbase, count(*) as freq from combined group by wheelbase) having freq=min(freq);

proc sgpanel data=combined;
   panelby source/layout=rowlattice novarname;
   histogram wheelbase /datalabel scale=count dataskin=pressed ;lineparm X=&lineval Y=150 slope=. ;
   density wheelbase/type=kernal;
   density wheelbase/type=normal;
run;

ChrisNZ_0-1629357921742.png

 

 

NKormanik
Barite | Level 11

@ChrisNZ @Ksharp @ballardw @Reeza  All appreciated examples of what one might do.  Good to have them here.

 

My original question probably wasn't exactly clear enough.

 

Using an analogy I'll try to illustrate a bit more:

 

Imagine a river.  Suppose there are two banks (sides, rocks, soil).  Call one a "Left Bank" (Greater Than) and the other "Right Bank" (Less Than). There is water in the middle between the Left Bank and Right Bank.

 

Two Banks.jpg

 

I'm trying to find the 'water' between the banks.  I.e., a flat area between the histograms.

 

Below is another image that shows too much water between the banks:

 

Crested Banks.jpg

 

I.e., huge open area between the histograms.  Not helpful.

 

Looking for a Goldilocks-sized river.

 

 

ballardw
Super User

Still clear as mud.

 

 

NKormanik
Barite | Level 11

@ballardw  Apt and clever response.  That's what we see in the histograms shown at top.

 

Below is one that's even worse:

 

2​.png

 

Clear as rocky banks and mud.

 

ballardw
Super User

@NKormanik wrote:

@ballardw  Apt and clever response.  That's what we see in the histograms shown at top.

 

Below is one that's even worse:

 

2​.png

 

Clear as rocky banks and mud.

 


Can you add something to point out what you are finding of interest? Or mention coordinates?

I think we have, at least partially, one of those things where you have become so involved with your data that you expect others to know it without actually describing it clearly. I see lots of "low" spot. But haven't a clue what makes this "even worse" than the first one.

 

If you actually want to "sum" the two then the first part is to create the discrete bins, without actual data that might be interesting challenge by itself but with ODS graphics you can get usually get the graphed data using ODS output. If the two sets have x axis values/increments in common then combining two output sets with the counts and summing is a pretty trivial Proc freq or Proc means/summary step.

But starting data would be needed and likely the specific options you are /want to use for binning the data.Sgplot where you set bidwidths and and a starting offset might work best.

 

ChrisNZ
Tourmaline | Level 20

Love the analogy and the pics! 🙂

What do you look for when you eyeball the charts by hand? What's the pattern you look for that could be coded?

Reeza
Super User
Plateau - you're looking for the plateaus.
NKormanik
Barite | Level 11

Plateau?  No, the opposite.  A valley.  And a nice medium-sized river running through the valley.

 

ChrisNZ
Tourmaline | Level 20

> nice medium-sized river running through the valley

This does not describe an algorithm. My simple logic looks for the lowest point and plots it. What would the logic to implement be?

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 31 replies
  • 1214 views
  • 26 likes
  • 6 in conversation