Set values to missing if less than 10 records in any of two classes

Frk_bolsen · Posted 01-05-2024 06:29 AM

Hello,

I am working with a larger version of the dataset shown below:

yield	cellcount	htd	hya
2.2	260	1	11120101
.	110	1	11120101
5	90	1	11120101
2	400	1	11120101
3	560	1	11120101
4.5	90	2	11120101
2.2	60	2	11120101
3.4	.	2	11120101
1.2	.	2	11120101
1.6	280	2	11120101

I want to set yield and cellcount to missing if the htd or hya have less than 10 records. Edits in htd will affect hya, and the other way around.

I use the code below and repeat this a number of times so that I am sure that all htd’s and hya’s have minimum 10 records each in the end. Can you help me with a loop, or something like that, that does this automatically and stops when all htd’s and hya’s have at least 10 records?

Title 'Minimum 10 records per htd - first test';
proc means noprint n data=test maxdec=3;
class htd;
var yield cellcount;
output out=htd_n
n=yieldn cellcountn;
run;

proc sort data=htd_n;	by htd;	run;
proc sort data=test;	by htd;	run;

data test;
merge test (in=a) htd_n;
by htd;
if a=1;
    if yieldn<10 then yield=.;
	if cellcountn<10 then cellcount=.;
	drop _TYPE_ _FREQ_;
run;
proc means data=test ;
	var yield cellcount;
run;

Title 'Minimum 10 records per hya – first test';
proc means noprint n data=test maxdec=3;
class hya;
var yield cellcount;
output out=hya_n
n=yieldn cellcountn;
run;

proc sort data=test;	by hya;	run;
proc sort data=hya_n;	by hya;	run;

data test;
merge test (in=a) hya_n;
by hya;
if a=1;
    if yieldn<10 then yield=.;
	if cellcountn<10 then cellcount=.;
	drop _TYPE_ _FREQ_;
run;
proc means data=test;
	var yield cellcount;
run;

PaigeMiller · Posted 01-05-2024 07:12 AM

Some clarification would be helpful. Your words are:

I want to set yield and cellcount to missing if the htd or hya have less than 10 records. Edits in htd will affect hya, and the other way around.

I use the code below and repeat this a number of times so that I am sure that all htd’s and hya’s have minimum 10 records each in the end. Can you help me with a loop, or something like that, that does this automatically and stops when all htd’s and hya’s have at least 10 records?

The part that really confuses me is "that does this automatically and stops when all htd’s and hya’s have at least 10 records". I am also very unclear about this part: "Edits in htd will affect hya, and the other way around" as it doesn't seem to fit in with the rest of the problem.

Does this mean:

You are working with data that has some htd/hya combinations that have <10 records and some that have >=10 records, and you want to set cellcount and yield to missing if the count for the htd/hya combination is <10; OR
You are deleting records such that what remains is only htd/hya combinations that have >=10 records; OR
You are adding records such that what every htd/hya combinations have >=10 records; OR
something else?

--
Paige Miller

Frk_bolsen · Posted 01-05-2024 07:31 AM

Sorry about the confusion. i will try to clarify.

In the first htd test, I check number of records of yield and cellcount in all classes of htd. If any htd have less than 10 records, I set all yield and/or cellcount values to missing in that group.

Then I do the same for hya, but doing this will affect the number of records in htd, meaning that I have to repeat these steps a number of times to end up with a dataset that has at least 10 records of both htd and hya. The code I put here, is repeated 5 times in my program, but I didnt show that here.

I am thinking that it has to be a better way of doing this, having sas repeating these steps for me and stopping when the number of records are above 10 for yield and cellcount in both htd and hya.

PaigeMiller · Posted 01-05-2024 07:37 AM

In the first htd test, I check number of records of yield and cellcount in all classes of htd. If any htd have less than 10 records, I set all yield and/or cellcount values to missing in that group.

Then I do the same for yha, but doing this will affect the number of records in htd

Not following this either. If there were five records for a htd level, and you perform this activity of setting missing value of yield and cell count to this htd level, there are still the same number of records for each yha level. The number of records doesn't change when you set yield and cell count to missing.

--
Paige Miller

Frk_bolsen · Posted 01-05-2024 08:15 AM

Yes, they are depended on each other. By number of records I mean number of non-missing records. Excluding records of htd will affect some hya's and the other way around. I understand I have to come back with a better explanation, but that has to be at a later time.

PaigeMiller · Posted 01-05-2024 08:30 AM

@Frk_bolsen wrote:

By number of records I mean number of non-missing records.

Records cannot be missing. Perhaps you mean number of records with non-missing value for BOTH yield and cellcount? What about those records where either yield is missing or cellcount is missing, but they are not both missing, how are those counted?

--
Paige Miller

ballardw · Posted 01-05-2024 01:23 PM

Suggestion: Change your rules for demonstration purposes to something smaller than 10, say 3 or 4.

Then provide a working data step to create source such that a couple of your rule values are met and few where not met.

Then show us a working data step of the desired result after your rules are implemented. NOT code to create the result, the actual result. You may have to point out which rule(s) were met or violated for some of the source records.

The recommendation for a smaller count for the rule is to reduce possible complications when trying to match up larger number of observations.

Patrick · Posted 01-05-2024 11:49 PM

Agree with all that has been said in regards of how to provide sample data, desired result etc.

I tried to understand and work with what you've provided. Is below returning what you're after? And if not what needs change and why?

I didn't overwrite variables but populated new ones so it's easier to follow what's happening.

data work.have;
  infile datalines truncover dsd;
  input yield cellcount htd hya;
  datalines;
2.2,260,1,11120101
.,110,1,11120101
5,90,1,11120101
2,400,1,11120101
3,560,1,11120101
4.5,90,2,11120101
2.2,60,2,11120101
3.4,.,2,11120101
1.2,.,2,11120101
1.6,280,2,11120101
;

proc sort data=work.have out=work.inter;
  by hya htd;
run;

proc means data=work.inter noprint n maxdec=3 nway;
  class hya htd;
  var yield cellcount;
  output 
    out=work.hya_htd_n
    n=yieldn cellcountn
    ;
run;

%let hya_htd_threshold=4;
%let hya_threshold    =8;

data  
   work.lev_hya_htd (keep=hya htd yieldn_hya_htd cellcountn_hya_htd)
   work.lev_hya     (keep=hya     yieldn_hya cellcountn_hya)
  ;
  set work.hya_htd_n;
  by hya htd;
  
  if yieldn     >= &hya_htd_threshold then yieldn_hya_htd     =yieldn;
  if cellcountn >= &hya_htd_threshold then cellcountn_hya_htd =cellcountn;
  output work.lev_hya_htd;

  yieldn_hya+yieldn_hya_htd;
  cellcountn_hya+cellcountn_hya_htd;

  if last.hya then
    do;
      if yieldn_hya     < &hya_threshold then call missing(yieldn_hya);
      if cellcountn_hya < &hya_threshold then call missing(cellcountn_hya);
      output work.lev_hya;
      call missing(yieldn_hya, cellcountn_hya);
    end;
run;

data work.want;
  merge work.inter work.lev_hya_htd;
  by hya htd;
  if _n_=1 then
    do;
      if 0 then set work.lev_hya(keep=yieldn_hya cellcountn_hya);
      dcl hash h1(dataset:'work.lev_hya');
      h1.defineKey('hya');
      h1.defineData('yieldn_hya','cellcountn_hya');
      h1.defineDone();
    end;
  if h1.find() ne 0 then call missing(yieldn_hya,cellcountn_hya);
run;

proc print data=want;
run;

proc print data=lev_hya;
run;
proc print data=lev_hya_htd;
run;

FreelanceReinh · Posted 01-09-2024 11:02 AM

Hello @Frk_bolsen,

Your question reminded me of a thread that I contributed a solution to in 2016. So let me suggest a somewhat similar approach using a SAS macro which iteratively alternates between two "elimination" tasks until a stopping condition is met.

/* Create sample data for demonstration */

data have;
call streaminit(27182818);
do _n_=1 to 1000;
  hya=11120100+rand('integer',60);
  htd=rand('integer',60);
  yield=ifn(rand('bern',0.83),round(rand('uniform',9),0.1),.);
  cellcount=ifn(rand('bern',0.85),round(rand('integer',1000),10),.);
  output;
end;
run;

proc sort data=have;
by hya htd;
run;

/* Add a temporary unique key _SEQNO */

data _tmp / view=_tmp;
set have;
by hya htd;
_seqno=_n_;
run;

/* Iteratively eliminate HYA and HTD groups with fewer than 10 non-missing &VAR values
   by setting their &VAR values to missing */

%macro elim(data=, clvar1=, clvar2=, var=, need=10, out=);
%local i j empty min1 min2;

%* Initial count of non-missing &VAR values;

proc summary data=&data;
class &clvar1 &clvar2;
var &var;
types &clvar1 &clvar2 &clvar1*&clvar2;
output out=_cnt(drop=_freq_) n=;
run;

%* Check if any groups need to be eliminated;

proc sql noprint;
select min(&var) into :min1-
from _cnt
where _type_<3
group by _type_
order by _type_ desc;  /* _type_=1 corresponds to &clvar2, */
quit;                  /* _type_=2 corresponds to &clvar1. */

%if &min1<&need | &min2<&need %then %do;

  %if &min1<&min2 %then %let i=1;
  %else %let i=2;

  %do %until(&&min&i>=&need | &empty);
    %let j=%eval(3-&i);

    %* Elimination step;

    proc sql;
    create table _cnt1 as
    select a.* from _cnt(where=(_type_=3)) a
               join _cnt(where=(_type_=&j & &var>=&need)) b
    on a.&&clvar&i=b.&&clvar&i;
    quit;

    %let empty=%eval(&sqlobs=0);

    %* Count non-missing &VAR values after elimination;

    proc summary data=_cnt1;
    class &clvar1 &clvar2;
    var &var;
    freq &var;
    types &&clvar&j &clvar1*&clvar2;
    output out=_cnt(drop=_freq_) n=;
    run;

    %* Check if more groups need to be eliminated;

    proc sql noprint;
    select min(&var) into :min&i trimmed
    from _cnt
    where _type_<3;
    quit;

    %let i=&j;
  %end;

  %* Create output dataset with &VAR values set to missing in eliminated groups;

  proc sql;
  create table &out as
  select a.&clvar1, a.&clvar2, a._seqno,
         case when b.&clvar1=. then . else &var end as &var
  from _tmp a left join _cnt1(drop=&var) b
  on a.&clvar1=b.&clvar1 & a.&clvar2=b.&clvar2
  order by _seqno;
  quit;

  %if &empty %then %put WAR%str(NING): All %upcase(&var) values have been set to missing!;

%end;

%else %do;
  %put All %upcase(&clvar1) and %upcase(&clvar2) groups in dataset %upcase(&data) have >=%upcase(&need) non-missing %upcase(&var) values.;
  data &out;
  set _tmp(keep=&clvar1 &clvar2 &var _seqno);
  run;
%end;
%mend elim;

%elim(data=have, clvar1=hya, clvar2=htd, var=yield,     out=out_y)
%elim(data=have, clvar1=hya, clvar2=htd, var=cellcount, out=out_c)

/* Combine results for YIELD and CELLCOUNT */

data want(drop=_seqno);
merge out_y out_c(drop=hya htd);
by _seqno;
run;

/* Optional: Verify that all HYA and HTD groups with non-missing YIELD values have >=10 of those */

proc summary data=want;
where yield>.z;
class hya htd;
ways 1;
var yield;
output out=stats_y(drop=_freq_) n=n_yield;
run;

proc print data=stats_y;
run;

/* Optional: Verify that all HYA and HTD groups with non-missing CELLCOUNT values have >=10 of those */

proc summary data=want;
where cellcount>.z;
class hya htd;
ways 1;
var cellcount;
output out=stats_c(drop=_freq_) n=n_cellcount;
run;

proc print data=stats_c;
run;

/* Optional: Select HYA and HTD groups with both >=10 non-missing YIELD values and >=10 non-missing CELLCOUNT values */

proc sql;
create table sel as
select * from want
where hya in (select hya from stats_y where _type_=2
                intersect
              select hya from stats_c where _type_=2)
  and htd in (select htd from stats_y where _type_=1
                intersect
              select htd from stats_c where _type_=1);
quit;

Set values to missing if less than 10 records in any of two classes

Re: Set values to missing if less than 10 records in any of two classes

Re: Set values to missing if less than 10 records in any of two classes

Re: Set values to missing if less than 10 records in any of two classes

Re: Set values to missing if less than 10 records in any of two classes

Re: Set values to missing if less than 10 records in any of two classes

Re: Set values to missing if less than 10 records in any of two classes

Re: Set values to missing if less than 10 records in any of two classes

Re: Set values to missing if less than 10 records in any of two classes

SAS Innovate 2025: Call for Content

Classroom Training Available!