Solved: Mask data based on condition

kevsma · Posted 09-30-2023 01:00 AM

Hello folks, I need to mask values of three data columns using the following condition:

if the value is less than 11 and not equal to 0, mask with one asterisk (primary suppression)
if only one value was masked, mask the second smallest value (complementary suppression)

the table (named: tbl25) i am trying to mask values looks like below, data points are made-up:

I figured out a very clumsy way of doing it and below is my code, as you can see there're three issues with it:

i have to mask the three columns one by one: nopos, yespos, and totcnt
i also noticed that my "eth" column was missing in the following code
after i did the 3 columns separately, i had to merge them back again to come up with something like tbl25

Given this is just one of the many tables i need to mask values, can anyone provide a concise method to achieve the same goal? many many thanks!!!

data want1;
flg=0;
array s(8);
do i = 1 to 8;
    set tbl25;
    if nopos LE 11 AND NOPOS NE 0 then flg=1;
    	else s(i) = NOPOS;	/*<11 became missing*/
    end;
	if nmiss(flg) ne 1 then do;
    	second = min(of s(*));
		end;
    do i = 1 to 8;
        if s(i) = second and second ne 0 then do;
		s(i)=-100; /*complementary became -100*/
       end;
	   end;
	 
	do i = 1 to 8;
    	nopos = s(i);
		output;
    end;
keep nopos;
run;

**;
data want2;
array s(8);
do i = 1 to 8;
    set tbl25;
    if yespos LE 11 AND yesPOS NE 0 then flg=1;
    	else s(i) = yesPOS;	/*<11 became missing*/
    end;
	if nmiss(flg) ne 1 then do;
    	second = min(of s(*));
		end;
    do i = 1 to 8;
        if s(i) = second and second ne 0 then do;
		s(i)=-100; /*complementary became -100*/
       end;
	   end;
	 
do i = 1 to 8;
    yespos = s(i);
	output;
    end;

keep yespos;
run;

**;
data want3;
array s(8);
do i = 1 to 8;
    set tbl25;
    if totcnt LE 11 AND totcnt NE 0 then flg=1;
    	else s(i) = totcnt;	/*<11 became missing*/
    end;
	if nmiss(flg) ne 1 then do;
    	second = min(of s(*));
		end;
    do i = 1 to 8;
        if s(i) = second and second ne 0 then do;
		s(i)=-100; /*complementary became -100*/
       end;
	   end;
	 
do i = 1 to 8;
    totcnt = s(i);
	output;
    end;

keep totcnt;
run;

data want1; set want1;
	id=_N_; run;
data want2; set want2;
	id=_N_; run;
data want3; set want3;
	id=_N_; run;

proc sort data=want1; by id; run;
proc sort data=want2; by id; run;
proc sort data=want3; by id; run;

data all;
	merge want1(in=a) want2(in=b) want3(in=c);
	by id;
	if a or b or c; run;

mkeintz · Posted 09-30-2023 10:18 PM

You effectively have to read your dataset twice, because if you have only 1 value in a column that is not zero (I take that to mean greater than zero) and less than eleven, you need to know what the second smallest value is, in order to indicate it as complimentary suppression.

You can do that via a single data step, as below (I'm using .P for primary suppression and .C for complimentary).

data have ;
  input eth $3. nopos yespos totcnt ;
  cards ;
gp1 111   100  200
gp2 119   100   60
gp3  15     7  800
gp4 100   584  693
gp5   3     5    8
gp6 999   300  400
gp7 593  2400 3000
gp8 5081 9000 4000
run;
data want (drop=i);
  set have (in=firstpass) have (in=secondpass);
  array vars {*} nopos yespos totcnt ;
  array count_primaries{3}        _temporary_;
  array smallest_complimentary{3} _temporary_;

  if firstpass then do i=1 to dim(vars);
    if 0<vars{i}<11 then count_primaries{i}+1;
    else smallest_complimentary{i}=min(smallest_complimentary{i},vars{i});
  end;
  if secondpass;
  do i=1 to dim(vars); 
    if 0<vars{i}<11 then vars{i}=.P;
    if count_primaries{i}=1 and vars{i}=smallest_complimentary{i} then vars{i}=.C;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

ballardw · Posted 09-30-2023 01:55 AM

I don't understand what this means: "if only one value was masked, mask the second smallest value (complementary suppression)".

Walk us through some example data indicating which is second smallest (in each variable across all three) describe what role that ETH variable may have.

Show what the output for this is supposed to be.

Arrays only use values from one observation so that is pretty much a complete bust if you were expecting that loop to use different observations.

kevsma · Posted 09-30-2023 02:25 AM

Hello @ballardw

sorry for not explaining it well, so take my following table as an example: when i mask the column "nopos", value 3 will be masked as a primary suppression because it's smaller than 11, while all the rest of the values are > =11, I need to mask 15 as a complementary suppression because gp8 is my column total, and one can easily derive 3 (although masked) using gp8 (total) minus the rest of the values if 15 is not suppressed.

for the column "yespos", things will be easier because i have two values < 11 (7,5) so all I need to do is the primary suppression.

eth is my ethnicity class variable and each gp represents a race/ethnicity whereas gp8 represents the total row.

Hope this makes sense.

Quentin · Posted 09-30-2023 09:36 AM

It helps if you post sample data as a DATA step with CARDS data, instead of a picture. Like:

data have ;
  input eth $3. nopos yespos totcnt ;
  cards ;
gp1 111   100  200
gp2 119   100   60
gp3  15     7  800
gp4 100   584  693
gp5   3     5    8
gp6 999   300  400
gp7 593  2400 3000
gp8 5081 9000 4000
;

For the first step, you can use an array to recode all the numeric values that are between 0 and 10 (your text says less than 11, but your code says less than or equal to 11):

data m1 ;
  set have ;
  array nums {*} _numeric_ ;
  do i=1 to dim(nums) ;
    if 0<=nums{i}<11 then nums{i}=.M ;
  end ;
  drop i ;
run ;

For the second step, I use PROC SQL to test whether there is one missing value in a column and whether a value is equal to the minimum of the column:

proc sql ;
  create table m2 as
  select eth
        ,ifn( nopos=min( nopos) and nmiss( nopos)=1,.S, nopos) as  nopos
        ,ifn(yespos=min(yespos) and nmiss(yespos)=1,.S,yespos) as yespos
        ,ifn(totcnt=min(totcnt) and nmiss(totcnt)=1,.S,totcnt) as totcnt
  from m1
  ;
quit ;

Since you have a lot of datasets to mask, you could use a macro to generate the list of IFN() expressions for the select statement.

Result is (I think) what you want for this data. Assuming you want to mask TOTCNT as well:

 Obs    eth    nopos    yespos    totcnt

  1     gp1      111      100       200
  2     gp2      119      100         S
  3     gp3        S        M       800
  4     gp4      100      584       693
  5     gp5        M        M         M
  6     gp6      999      300       400
  7     gp7      593     2400      3000
  8     gp8     5081     9000      4000

One potential issue with this approach is that in the second step, if there are two records that are tied for having the minimum value, they will both be masked. Which may not be what you want. If in that circumstance you would only want to mask one value, I think I would use a DATA step approach for the second step.

I used special missing values .M and .S for masking just to differentiate results from each step. When you print the results, you could format these values as asterisks or whatever you want.

The Boston Area SAS Users Group is hosting free webinars!

Register now at https://www.basug.org/events.

mkeintz · Posted 09-30-2023 06:40 PM

And what do you intend to do if you have a tie for second smallest? I.e. assume nopos has a single 3, and two 15's.

Oh yes, and what if a variable has NO values less than 11 but not equal to zero?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 09-30-2023 10:18 PM

You effectively have to read your dataset twice, because if you have only 1 value in a column that is not zero (I take that to mean greater than zero) and less than eleven, you need to know what the second smallest value is, in order to indicate it as complimentary suppression.

You can do that via a single data step, as below (I'm using .P for primary suppression and .C for complimentary).

data have ;
  input eth $3. nopos yespos totcnt ;
  cards ;
gp1 111   100  200
gp2 119   100   60
gp3  15     7  800
gp4 100   584  693
gp5   3     5    8
gp6 999   300  400
gp7 593  2400 3000
gp8 5081 9000 4000
run;
data want (drop=i);
  set have (in=firstpass) have (in=secondpass);
  array vars {*} nopos yespos totcnt ;
  array count_primaries{3}        _temporary_;
  array smallest_complimentary{3} _temporary_;

  if firstpass then do i=1 to dim(vars);
    if 0<vars{i}<11 then count_primaries{i}+1;
    else smallest_complimentary{i}=min(smallest_complimentary{i},vars{i});
  end;
  if secondpass;
  do i=1 to dim(vars); 
    if 0<vars{i}<11 then vars{i}=.P;
    if count_primaries{i}=1 and vars{i}=smallest_complimentary{i} then vars{i}=.C;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Quentin · Posted 10-01-2023 10:52 AM

Thanks for sharing a great DATA step solution @mkeintz .

I was thinking along the same lines but with a double DOW loop. I don't think I've seen this approach of using SET statement concatenation with a subsetting IF to achieve a similar effect. I'm sure you've probably showed it before, but either I didn't notice what you were doing or I managed to forget it.

One thing I like about the DATA step approach is it's easy to extend to handle ties differently. So if in the case where there is a tie you only want to mask one value, you can use an additional array to keep track of whether or not you have masked a value yet, e.g.:


*I added duplicate records for gp3 with nopos=15, but only one value will be masked;
data have ;
  input eth $3. nopos yespos totcnt ;
  cards ;
gp1 111   100  200
gp2 119   100   60
gp3  15     7  800
gp3  15     7  800
gp4 100   584  693
gp5   3     5    8
gp6 999   300  400
gp7 593  2400 3000
gp8 5081 9000 4000
run;

data want (drop=i);
  set have (in=firstpass) have (in=secondpass);
  array vars {*} nopos yespos totcnt ;
  array count_primaries{3}        _temporary_;
  array smallest_complimentary{3} _temporary_;
  array masked_complimentary{3} _temporary_;

  if firstpass then do i=1 to dim(vars);
    if 0<vars{i}<11 then count_primaries{i}+1;
    else smallest_complimentary{i}=min(smallest_complimentary{i},vars{i});
  end;

  if secondpass;

  do i=1 to dim(vars); 
    if 0<vars{i}<11 then vars{i}=.P;
    if count_primaries{i}=1 and vars{i}=smallest_complimentary{i} and not (masked_complimentary{i}) then do ;
      vars{i}=.C;
      masked_complimentary{i}=1 ;
    end ;
  end;
run;

proc print data=want ;
run ;

The Boston Area SAS Users Group is hosting free webinars!

Register now at https://www.basug.org/events.

kevsma · Posted 10-03-2023 11:57 AM

@mkeintz thanks so very much!! this works exactly as what I'd want, thank you sooo much!!

Mask data based on condition

Re: Mask data based on condition

Re: Mask data based on condition

Re: Mask data based on condition

Re: Mask data based on condition

Re: Mask data based on condition

Re: Mask data based on condition

Re: Mask data based on condition

Re: Mask data based on condition

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away