Obsidian | Level 7

## Anonymize data in "proc tabulate"

Hello everyone!

I have the following sample data:

DATA offenders;
INPUT group :\$1. age :3. residenceStatus :\$1. sex :\$1.;
INFILE DATALINES DSD;
DATALINES;
A,10,A,m
A,14,A,m
A,11,B,m
B,21,A,f
B,20,B,m
B,,B,m
B,34,C,m
A,60,C,f
A,32,C,f
A,23,B,f
A,11,B,m
A,9,B,m
A,9,C,m
C,55,A,m
C,56,A,m
C,70,B,w
C,31,B,w
C,45,B,w
C,66,A,w
C,80,A,w
C,,C,w
C,53,A,w
C,38,C,m
D,10,C,m
;
RUN;

PROC FORMAT;
VALUE agegrp 0-10 = "0-10"
11-20 = "11-20"
21-30 = "21-30"
31-40 = "31-40"
41-high = "41+"
. = "unknown";
RUN;

And the following cross table:

PROC TABULATE;
class group sex;
class residenceStatus;
TABLE group, (ALL="Total residence status" residenceStatus="") * (ALL="Total age" age="") * (ALL="Total sex" sex="") /misstext="0";
FORMAT age agegrp.;
RUN;

As to privacy concerns I have to anonymize all the data if the total number is below a certain threshold. Let's suppose that if this threshold is 5 all the values on that line should be masked by "x":

9  3  6  0  0  0  0  3  3 ...
4  x  x  x  x  x  x  x  x ...  (4 is smaller than 5)
10  0  3  7  1  0  1  0  0 ...
1  x  x  x  x  x  x  x  x  ... (1 is smaller than 5)

Is it possible to do this in SAS at all? My first idea was to export it as an excel-sheet and then use VBA-Script instead. Thanks for your ideas!

5 REPLIES 5
Super User

## Re: Anonymize data in "proc tabulate"

You are going to have two basic choices:

1) summarize the data and do some severe data modification such as comparing all of the values needed for a row and setting key value to all of them when the rule is met prior to display.

Which may be awkward getting all the summaries to compare as need since you showing several nesting levels in you columns. Proc Summary with all of your class variables as class in Proc summary should work but you'll need to do some post processing to get the "totals" as separate values, based on _type_ out of proc summary.

This approach may yield a set more appropriate for Proc Print than tabulate.

2) Proc Report will let you examine the statistics generated and modify other "stuff" but may not be trivial and past my experience level at this time. Maybe calling @Cynthia_sas can provide a suggestion

Opal | Level 21

## Re: Anonymize data in "proc tabulate"

Could be done like this:

proc summary data=offenders(keep=group sex age residenceStatus) nway completetypes;
format age ageGrp.;
class group sex residenceStatus;
output out=offTableRaw;
run;

proc sql;
create table offTable as
select
group,
age,
sex,
residenceStatus,
case when sum(_freq_) < 5 then .x else _freq_ end as x
from
offTableRaw
group by group;
quit;

PROC TABULATE data=offTable format=3.0;
class group sex age;
class residenceStatus;
var x;
TABLE group,
(ALL="Total residence status" residenceStatus="") *
(ALL="Total age" age="") *
(ALL="Total sex" sex="") *
x = "" * sum="" /misstext="x";
RUN;
PG
Tourmaline | Level 20

## Re: Anonymize data in "proc tabulate"

This does what you want, but it's a bit convoluted.

Proc report might be better.

I set all anonymised data to negative numbers and format that to x.

The total is formatted to its value.

If there is only one record, a second one is created to ensure the total has a unique value in the row.

Ensure there is no sparse data to get rid of the zeros.

proc sql ;
create table SMALL as
select unique GROUP, count(*) as N
from OFFENDERS group by GROUP having count(*) < 5;
quit;

proc sort data=OFFENDERS;
by GROUP;
run;

data REPORT;
length STR \$800;
retain STR;
merge OFFENDERS SMALL(in=SMALL);
by GROUP;
if SMALL then do;
GRPNO+1;
LINENO+1;
NB=-1;
if last.GROUP then do;
NB=-1000*GRPNO;
STR=catx(' ',STR,NB-LINENO+1-first.GROUP,'="',N,'"');
LINENO=0;
end;
end;
else NB=1;
output;
if first.GROUP and last.GROUP then do;
if SEX='m' then SEX='f'; else SEX='m';
NB=-1; output;
end;
call symput('fmt', STR);
run;
%put &=fmt;

proc format;
value grp  0-10  = "0-10"
11-20  = "11-20"
21-30  = "21-30"
31-40  = "31-40"
41-high= "41+"
. = "unknown";
value anon .='0'  &fmt. 0-high=[5.] other='x';
run;

proc tabulate data=REPORT missing ;
class GROUP SEX AGE RESIDENCESTATUS;
var NB;
table GROUP, (ALL="Total residence status" RESIDENCESTATUS="")
* (ALL="Total age" AGE="")
* (ALL="Total sex" SEX="")
* NB=' '*sum=' '*f=anon. ;
format AGE agegrp. ;
run;

Barite | Level 11

## Re: Anonymize data in "proc tabulate"

A PROC REPORT offers much finer control over conditional data value rendering.

A stacked column specification (level1, level2, level3)of /ACROSS columns is equivalent to crossed classes level1 * level2 * level3 in a TABULATE TABLE statement.  REPORT does not have an in-built ALL concept like TABULATE, so additional data processing is needed to 'synthetically' create rows for all combinations of the ALL states to be grouped.  The AGEGRP format also needs to be modified to handle a coded value representing an ALL grouping (I chose negative one (-1))

Example:

PROC FORMAT;
VALUE agegrp (notsorted)
-1 = " Total age "
.  = "unknown"
0-10 = "0-10"
11-20 = "11-20"
21-30 = "21-30"
31-40 = "31-40"
41-high = "41+"
;
RUN;

Data processing to add observations for ALL 'grouping' in a PROC REPORT across variable.

data have_all;
length status sex \$50;
set have;

h2 = age;  drop h2;
h3 = sex;  drop h3;

output;

* outputs to force the tiers of ALL groupings;
*----;                         * ALL tierage;

sex    = ' Total sex';
output;                        * . . A ;

*----;

age = -1;
sex = h3;
output;                        * . A . ;

sex = ' Total sex';
output;                        * . A A ;

*----;

status = ' Total residence status';
age = h2;
sex = h3;
output;                        * A . . ;

sex = ' Total sex';
output;                        * A . A ;

age = -1;
sex = h3;
output;                        * A A . ;

sex = ' Total sex';
output;                        * A A A ;
run;

PROC Report with COMPUTE block that applies masking format XFORMAT on a conditional basis

proc format;
value XFORMAT low-high, other = 'X';

proc report data=have_all missing;
columns
group
(status,age,sex)
sentinel
;

define group / group;
define status / ' ' across NOZERO;
define age    / ' ' across NOZERO format=agegrp. preloadfmt order=data;
define sex    / ' ' across NOZERO;
define sentinel / computed noprint;

* _c1_ is group;
* _c2_ is all*all*all total;
* NOTE: _c#_ references will include NOZERO columns that do not render;

compute before;
* count the number of columns created by stacking the across columns;
length colref var \$32;

do index = 3 by 1 until (missing(var));
colref = cats('_c',index,'_');
var = vnamex(colref);
end;
topcol = index - 1;
endcomp;

compute sentinel;
if missing (topcol) then return;
do index = 3 to topcol;
colref = cats('_c',index,'_');

* conditionally apply the masking format;
if _c2_ < 5 then call define(trim(colref), 'FORMAT', 'XFORMAT.');
end;

endcomp;
run;

Compare Tabulate to Report output

Tourmaline | Level 20

## Re: Anonymize data in "proc tabulate"

Something a bit simpler: Find the rows to hide and apply the hiding format in a compute block.

proc sql noprint;
select quote(put(NAME,\$1.)) into :hide separated by ',' from SASHELP.CLASS group by put(NAME,\$1.) having count(*)=1;
quit;
%put &=hide;

proc format ;
value anon other='x' ;
run;

%macro hide;
%local i;
%do i=2 %to 13;
if NAME in:(&hide) then call define(&i.+1,'format', 'anon.') ;
%end;
%mend;

options missing='0';
proc report data=sashelp.class;
columns NAME  (N (SEX, (AGE ,N))) DUMMY;
define NAME / group format=\$1.;
define SEX / across;
define AGE / across;
define DUMMY / noprint;
compute DUMMY;
%hide
endcomp;
run;
options missing='.';

Discussion stats
• 5 replies
• 867 views
• 1 like
• 5 in conversation