BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jtorourke
Fluorite | Level 6

I have a large database and created a Tabulate process that gives me the stats that I need.  However, I also need to go back into the tabulated table and erase certain variables in certain rows, and then reprint the table.  I need to include the "ALL" tabulations in the "sanitized" output, but using the ALL function in a Proc Report does not calculate the correct means and percentiles.

 

Is there a way to output the correct ALL tabulations in the resulting table before I sanitize the output or task Proc Report to use a different table than the one I am reporting to calculate the correct ALL stats?  Thanks for anything you can do for me.

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@jtorourke

You should go for a coding option which never prints the value in the table (hidden or not) independent of the output destination.

 

"if I really blank out the values, then my "Summarize" Stats will be incorrect"

The compute block sets the calculated variable salary_MEAN to missing (not the source variable salary). I believe this doesn't impact on the correctness of other calculations. At least in my sample the Mean in the total row is correct and uses all values.

 

But I agree: Explicitly setting a variable to missing when it's only about printing is "ugly". Using a format conditionally feels like a better way as this will definitely only change "display" and not "calculation".

Unlike using a style to hide the value using a format will never write the actual value to the output destination - plus using a format would also allow you to easily replace values with something else than a blank (i.e. text like "hidden" or a dash). 

 

Sample 38776: How to apply a format to a cell or multiple cells within a row with PROC REPORT

 

Applying above to my sample code could then look like: 

 

  compute salary_MEAN;
    if person_N<=5 then
      do;
        call define(_col_,'format','blank.');
      end;
  endcomp;

 

And here the full code just to have everything together in a single post.

data have;
  group='A';
  do person=1 to 10;
    salary=1;
    output;
  end;
  group='B';
  do person=11 to 15;
    salary=4;
    output;
  end;
run;

proc format;
  value blank (default=1)
    other=' '
    ;
quit;

proc report data=WORK.HAVE nowd;
  column group person, N=person_N salary, MEAN=salary_MEAN;
  define group / group 'group' missing;
  define person / analysis N 'person' missing;
  define salary / analysis MEAN 'salary' missing;

  compute group;
    if _break_ eq ' ' then do;
      if group ne ' ' then hold1=group;
    end;
    if upcase(_break_)="GROUP" then do;
      call define("group", 'style', 'style=[pretext="Subtotal"]');
    end;
    if _break_='_RBREAK_' then do;
      call define("group", 'style', 'style=[pretext="Total"]');
    end;
  endcomp;

  compute salary_MEAN;
    if person_N<=5 then
      do;
        call define(_col_,'format','blank.');
      end;
  endcomp;

  rbreak after / summarize;
  run;
quit;
TITLE; FOOTNOTE;

 

View solution in original post

14 REPLIES 14
ballardw
Super User

@jtorourke wrote:

I have a large database and created a Tabulate process that gives me the stats that I need.  However, I also need to go back into the tabulated table and erase certain variables in certain rows, and then reprint the table.  I need to include the "ALL" tabulations in the "sanitized" output, but using the ALL function in a Proc Report does not calculate the correct means and percentiles.

 

Is there a way to output the correct ALL tabulations in the resulting table before I sanitize the output or task Proc Report to use a different table than the one I am reporting to calculate the correct ALL stats?  Thanks for anything you can do for me.


I strongly suggest that you provide a small example data set in the form of data step code the is similar to your data and tabulate code to generate your output. Be sure to include some values that you need to "exclude". Then tell use, referencing the appearance of the output generated by your proc tabulate code what needs to "be erased", and if erased do you leave blank cells or remove entire rows/columns of output.

 

Depending on your actual example I think can think of a couple possible approaches but we really need to know the extent and rules behind what is erased.

 

jtorourke
Fluorite | Level 6

Sorry, attached is a sample of the output before I blank out some of the data:

 

Report Heading.GIFReport Detail.GIF

 

 I want to insert blanks when the Sum of incumbents for any breakout of the data is less than 5.  But I want the calculations for the "ALL" summary to remain the same because they are accurate.  Here is what happens when I run a report of this table using Summarize in the report code:

 

PDF Report Headings.GIFPDF Report Detail.GIF

 

You can see that the "Summarize" function could not produce the correct "Average" or Percentiles because the detailed data table was not used.  Not shown in this example is how I blanked the salary data when the number of incumbents reported in a peer group was less than 5.  Here is the code I used to build the initial tabulation and then blank the variables:

 

PROC TABULATE Data=WORK.SABA_DATA CLASSDATA=WORK.SABA_DATA out=Work.OUT1;
options nodate pageno=1 linesize=64 
        pagesize=40;

	Class JobCode JobTitle Peer_Code / missing ;
	var Numincum Salary Bonus Total_Cash;

	table JobCode*JobTitle*(Peer_Code ALL), (Sum*Numincum="Incumbents"*f=6.) (Mean="Average"*Salary*f=COMMA8.) 
			(p25="25th Percentile"*Salary*f=COMMA8.)(Median*Salary*f=COMMA8.) (p75="75th Percentile"*Salary*f=COMMA8.) 
			(N="Count"*Bonus) (Mean="Average"*Bonus*f=COMMA8.) (Mean="Average Total Cash"*Total_Cash*f=COMMA8.) ;


	
Run;

Data Work.OUT1;
	Set Work.out1;
	If Numincum_Sum<5 Then
	DO;
		Salary_Mean="";
		Salary_P25="";
		Salary_Median=" ";
		Salary_P75=" ";
		Bonus_Mean=" ";
		Total_Cash_Mean=" ";
	End;
Run;

The final report after blanking the variables where necessary was created by the following code:

Proc Report Data=Work.Out1;

define JobCode / order ;
define JobTitle / order ;
define Peer_Code / order order=formatted ;

Column JobCode JobTitle Peer_Code Numincum_Sum Salary_Mean Salary_P25 Salary_Median Salary_P75 Bonus_N Bonus_Mean Total_Cash_Mean; 
define Numincum_Sum / format=COMMA8. "Incumbents" '';
define Salary_Mean / format=COMMA9. 'Average Salary' Mean;
define Salary_P25 / format=COMMA9. '25th Percentile Salary' P25;
define Salary_Median / format=COMMA9. 'Median Salary' Median;
define Salary_P75 / format=COMMA9. '75th Percentile Salary' P75;
define Bonus_N / format=COMMA8. 'Number of Bonuses';
define Bonus_Mean / format=COMMA9. 'Average Bonus' Mean;
define Total_Cash_Mean / format=COMMA9. 'Average Total Cash' Mean;



Break AFTER JobCode / ul Ol summarize suppress skip;
compute after;
      line 'ALL' Format=Comma9.;
EndComp; 
Run;

So, I thought about adding code to the Proc Report to calculate the weighted average salary, but that will not help with the Percentiles.  It would be great if I could blank variables as needed in the Proc Tabulate for a number of reason. 

 

Thanks for your help.

 

 

ballardw
Super User

No data so can't test code.

The image of the "before" is very hard to read and when clicked on to view large seems to become 3 separate images. So I still can't quite tell what is offensive.

 

Perhaps you need to investigate using PROC SUMMARY to create a data set.

Proc summary with this

Class JobCode JobTitle Peer_Code ;

would create different summaries of the variables that would include the "all".

Examine the output data set of this code (you should have this set in your SAS install)

proc summary data=sashelp.class;
  class sex age;
  var height weight;
  output out=work.classsummary mean= sum= n= median= p25= p75= /autoname;
run;

The variable _type_ can be used to select or group results. The value for _type_ of 0 is the "ALL" for all records, 1 is the "all" for each Age, 2 is the "all" for each gender. Since you have 3 class variables you would end up with 8 levels of "type" and each one except 7 would be the "all" of all records, one or two level "groups" of data. The /autoname creates a new variable with the Var variable suffixed with the statistic.

 

 

So since there is an N per variable (or the overall _freq_) that indicates the number records used for any given row of data you could test and not display those rows but the remaining values could be left in.

 

I use this approach for summarizing data in one pass at STATE, Region, county, school district, school type and grade. Then select the appropriate _type_ for a Proc Print/Tabulate or Report output.

 

You may or may not want to do something like assigning specific values to the missing values for some types to indicate the "all" nature desired and use a custom format to display desired text. For example I have 7 regions but I use the _type_=0 to assign 0 for the region in that block and have a format to display "State" when 0 appears in the region field. Then Type 0,1 are used for a single table.

Reeza
Super User

Build a custom format that masks the numbers that are less than 5 instead. However, you’ll have a residual disclosure possibility with this situation. 

Patrick
Opal | Level 21

@Reeza

Totally agree with your disclosure point. I don't believe though that a custom format will do as this is about hiding cell values based on other cell values.

Reeza
Super User

@Patrick conditional formatting? One way would be to use PROC TABULATE on the data, use the OUT= option to save the data set and then use PROC REPORT to present the data using conditional formatting or you could pre-process the data and use random rounding to make the numbers consistent and prevent residual disclosure. 

 

Page 13/14 on conditional formatting with PROC REPORT. 

https://support.sas.com/resources/papers/proceedings13/366-2013.pdf

 

Patrick
Opal | Level 21

@jtorourke

I'm not sure that I fully understand what you're trying to do. It would help if you could reduce the complexity of your problem to the core challenge and then post some sample data for this (a SAS data step creating the data) and then show us the desired result.

 

What I don't fully understand: 

- Is this just about not displaying cell values under certain conditions or do you also need to exclude values from actual calculations under certain conditions?

- Why can't you do everything in a single Proc Report?

 

And just as a thought: If this is simply about not displaying cell values under certain conditions then conditional formatting could be the way to go. Just assign to such values a style of invisible.

https://communities.sas.com/t5/SAS-Procedures/Conditionally-formatting-data-using-proc-tabulate/td-p...

https://support.sas.com/resources/papers/proceedings13/366-2013.pdf 

 

If you Google with terms like "Proc Report Conditional Formatting" you'll find quite a few additional code samples.

jtorourke
Fluorite | Level 6

Many thanks to @Patrick@Reeza and @ballardw for all of your suggestions.  I am new to SAS as you must have figured out by now.  I will look into all of your suggestions and get back with you.  Sorry for the confusion.  My core issue is that we have promised our clients that we will not reveal the salary data they submit to our survey if the total number of individual salaries reported for a particular breakout is fewer than 5.  However, I want to include all of the data in the averages and percentiles calculated in the ALL rows.  More later when I have access to my database.

Reeza
Super User

My core issue is that we have promised our clients that we will not reveal the salary data they submit to our survey if the total number of individual salaries reported for a particular breakout is fewer than 5.  However, I want to include all of the data in the averages and percentiles calculated in the ALL rows.  More later when I have access to my database.

 

This may not be possible because unless you suppress some of the information it's a trivial exercise to recalculate the suppressed information. 

Patrick
Opal | Level 21

@jtorourke wrote:

Many thanks to @Patrick@Reeza and @ballardw for all of your suggestions.  I am new to SAS as you must have figured out by now.  I will look into all of your suggestions and get back with you.  Sorry for the confusion.  My core issue is that we have promised our clients that we will not reveal the salary data they submit to our survey if the total number of individual salaries reported for a particular breakout is fewer than 5.  However, I want to include all of the data in the averages and percentiles calculated in the ALL rows.  More later when I have access to my database.


@jtorourke

It appears the links I've posted in my previous post aren't working anymore. Weird! 

I believe what you want to do is very possible with Proc Report. As you've realized already, you can't just modify source values (set them to missing) as this will impact on all your stats also for the Total row. What you can do in Proc Report is apply styles conditional. So you actually calculate all values but when it comes to printing these values you apply a style to the cells where you don't want to show anything (you could also replace the value with a text like "hidden").

 

The following code illustrates the approach.

data have;
  group='A';
  do person=1 to 10;
    salary=1;
    output;
  end;
  group='B';
  do person=11 to 15;
    salary=4;
    output;
  end;
run;

proc report data=WORK.HAVE nowd;
  column group person, N=person_N salary, MEAN=salary_MEAN;
  define group / group 'group' missing;
  define person / analysis N 'person' missing;
  define salary / analysis MEAN 'salary' missing;

  compute group;
    if _break_ eq ' ' then do;
      if group ne ' ' then hold1=group;
    end;
    if upcase(_break_)="GROUP" then do;
      call define("group", 'style', 'style=[pretext="Subtotal"]');
    end;
    if _break_='_RBREAK_' then do;
      call define("group", 'style', 'style=[pretext="Total"]');
    end;
  endcomp;

  compute salary_MEAN;
    if person_N<=5 then
      do;
        call define("salary_MEAN", 'style', 'style=[color=transparent]');
      end;
  endcomp;

  rbreak after / summarize;
  run;
quit;
TITLE; FOOTNOTE;

Result

Capture.JPG

 

The relevant code section is:

  compute salary_MEAN;
    if person_N<=5 then
      do;
        call define("salary_MEAN", 'style', 'style=[color=transparent]');
      end;
  endcomp;

And once you understand how that works the door is open for almost anything.

 

 

....and thinking a bit more: Above would work for printing BUT you'd still have the value in the source (i.e. for HTML output). So a safer way is eventually to really blank out the value either by setting it to missing or by applying a format. 

Code like below should do:

  compute salary_MEAN;
    if person_N<=5 then
      do;
        salary_MEAN=.;
      end;
jtorourke
Fluorite | Level 6

@Patrickthanks for the careful thought you put into my issue and for the code examples.  on your last point, if I really blank out the values, then my "Summarize" Stats will be incorrect.  If there is no way to save the "ALL" row for all records in the JobCode into the out table, then it looks like with will need to use the transparent formatting you suggested and avoid using HTML output.  Am I missing something.

Reeza
Super User

Be careful with that. I'd also check if you render it with PDF and then use one of the automated tools to scrape the data back out and see if it was visible. Adobe PDF Pro and R both have methods to do this.

 

Your safest bet is to set it as missing.

Patrick
Opal | Level 21

@jtorourke

You should go for a coding option which never prints the value in the table (hidden or not) independent of the output destination.

 

"if I really blank out the values, then my "Summarize" Stats will be incorrect"

The compute block sets the calculated variable salary_MEAN to missing (not the source variable salary). I believe this doesn't impact on the correctness of other calculations. At least in my sample the Mean in the total row is correct and uses all values.

 

But I agree: Explicitly setting a variable to missing when it's only about printing is "ugly". Using a format conditionally feels like a better way as this will definitely only change "display" and not "calculation".

Unlike using a style to hide the value using a format will never write the actual value to the output destination - plus using a format would also allow you to easily replace values with something else than a blank (i.e. text like "hidden" or a dash). 

 

Sample 38776: How to apply a format to a cell or multiple cells within a row with PROC REPORT

 

Applying above to my sample code could then look like: 

 

  compute salary_MEAN;
    if person_N<=5 then
      do;
        call define(_col_,'format','blank.');
      end;
  endcomp;

 

And here the full code just to have everything together in a single post.

data have;
  group='A';
  do person=1 to 10;
    salary=1;
    output;
  end;
  group='B';
  do person=11 to 15;
    salary=4;
    output;
  end;
run;

proc format;
  value blank (default=1)
    other=' '
    ;
quit;

proc report data=WORK.HAVE nowd;
  column group person, N=person_N salary, MEAN=salary_MEAN;
  define group / group 'group' missing;
  define person / analysis N 'person' missing;
  define salary / analysis MEAN 'salary' missing;

  compute group;
    if _break_ eq ' ' then do;
      if group ne ' ' then hold1=group;
    end;
    if upcase(_break_)="GROUP" then do;
      call define("group", 'style', 'style=[pretext="Subtotal"]');
    end;
    if _break_='_RBREAK_' then do;
      call define("group", 'style', 'style=[pretext="Total"]');
    end;
  endcomp;

  compute salary_MEAN;
    if person_N<=5 then
      do;
        call define(_col_,'format','blank.');
      end;
  endcomp;

  rbreak after / summarize;
  run;
quit;
TITLE; FOOTNOTE;

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1997 views
  • 4 likes
  • 4 in conversation