Solved: identify records which are coming maximum twice (only if coming in 1st...

sahoositaram555 · Posted 08-15-2023 04:56 PM

data have;
        input col1 col2;
        datalines;
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 7
    1001 8
    1002 4.25
    1002 4.25
    1002 2.5
    1002 2.5
    1002 2.5
    1002 2.5
    1002 1.5
    1003 6.5
    1003 6.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 1.5
    1004 6.5
    ;
    run;

I have this above dataset, in which I need to identify and flag them if the first.COL2 value matches with the below row only. eg: for 1001, in col2 1.5 is repeated more than 2 times then we shallnot flag any observation of 1001. but for 1002, 4.25 is repeated only once( present in first and below row) , in this case both the rows should be flagged with "Y". same for 1003. for 1004 no flag will be populated as 8.7 is repeated more than 2 times.

pls guide how to go about it.

Ksharp · Posted 08-16-2023 08:02 AM

data have;
        input col1 col2;
        datalines;
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 7
    1001 8
    1002 4.25
    1002 4.25
    1002 2.5
    1002 2.5
    1002 2.5
    1002 2.5
    1002 1.5
    1003 6.5
    1003 6.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 1.5
    1004 6.5
    ;
    run;
data want;
do n=1 by 1 until(last.col2);
 set have;
 by col1 col2 notsorted;
 if first.col1 then first=1;
end;
do until(last.col2);
 set have;
 by col1 col2 notsorted;
 if first and n=2 then col3='Y';
 output;
end;
drop n first;
run;

View solution in original post

Astounding · Posted 08-15-2023 05:00 PM

Must the flag appear on the first row, or can it appear on the second row?

sahoositaram555 · Posted 08-15-2023 05:04 PM

Hi @Astounding,
the flag must appear on the first and second row(not any other rows), if it satisfies the condition that the repeat should be ONLY ON 1ST and 2ND row for COL2. if repeat value flows to 3rd rows and afterwards, ignore that col1 category .

should be as below

data have;
        input col1 col2 col3;
        datalines;
    1001 1.5 ""
    1001 1.5 ""
    1001 1.5 ""
    1001 1.5 ""
    1001 1.5 ""
    1001 7 ""
    1001 8 ""
    1002 4.25 "Y"
    1002 4.25 "Y"
    1002 2.5 ""
    1002 2.5 ""
    1002 2.5 ""
    1002 2.5 ""
    1002 1.5 ""
    1003 6.5 "Y"
    1003 6.5 "Y" 
    1003 1.5 ""
    1003 1.5 ""
    1003 1.5 ""
    1003 1.5 ""
    1003 1.5 ""
    1003 8.7 ""
    1004 8.7 ""
    1004 8.7 ""
    1004 8.7 ""
    1004 8.7 ""
    1004 8.7 ""
    1004 1.5 ""
    1004 6.5 ""
    ;
    run;

ballardw · Posted 08-15-2023 05:28 PM

Suggestion: For ANY flag that takes only two values True/False Yes/No or similar strongly suggest setting the value of True or Yes as a numeric 1 and False or No as numeric 0.

Character values like 'Y' require lots of extra coding to do things like get counts, percentages and answer questions. If you have 1/0 coded values you can use functions or statistics in report or analysis procedures like:

N => total number of answers

Sum => total number of Yes

Mean => percentage of Yes as decimal (0.10 is 10 percent yes)

Max => if 1 means that at least one of a group of value is true

Range => if 1 means that at least one Yes and one No. If 0 then all Yes or all No.

N - Sum => number No

There are some others useful as well.

Additionally SAS will assign 1/0 as the result of any logical comparison : Flag = (y>x); creates numeric 1/0 coded variable when the value of y is > x. So no "if /then/else" for such.

Astounding · Posted 08-16-2023 03:46 PM

OK, just being lazy here. You have seen some very good programmers post complex solutions. And the solutions may or may not do exactly what you are looking for. Here is a relatively simple program that does half the job. It flags the second observation only, when the conditions you specify are met. I leave it up to you to take that output and expand it to flag both the first and second observations. (After all, you should be doing some of the programming work!)

data want;
   set have;
   by col1 col2 notsorted;
   if first.col1 then n=1;
   else n+1;
   if n=2 and first.col2=0 and last.col2=1 then flag="Y";
   drop n;
run;

mkeintz · Posted 08-15-2023 11:13 PM

You can use a self-merge (with the "firstobs=" option) to look ahead at the next two COL1 values (_next_col1 and _after_next_col1) and next two COL2 values.

data have;
        input col1 col2;
        datalines;
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 7
    1001 8
    1002 4.25
    1002 4.25
    1002 2.5
    1002 2.5
    1002 2.5
    1002 2.5
    1002 1.5
    1003 6.5
    1003 6.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 1.5
    1004 6.5
run;


data rule1 (drop=_:);
  set have;
  by col1;
  merge have 
       have (firstobs=2 rename=(col1=_next_col1 col2=_next_col2)
              keep=col1 col2)
        have (firstobs=3 rename=(col1=_after_next_col1 col2=_after_next_col2)
              keep=col1 col2) ;
  retain col3 ' ';
  if first.col1 then do;
    if col1=_next_col1 and col2=_next_col2 and
      (col1^=_after_next_col1 or col2^=_after_next_col2)
    then col3='Y';
    else col3=' ';
  end;
  if lag2(first.col1)=1 and lag2(col1)=col1 then col3=' ';
run;

If you set COL3='Y', then when the third obs for a given COL1 is encountered, then COL3 becomes a blank. This is tested for via the "if lag2(first.col1)=1" test.

Note: I added the "lag2(col1)=col1" condition to defend against instances with exactly two obs for a given col1.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 08-17-2023 01:38 PM

Here's another slightly simpler approach. It uses the "point=" option of the SET statement. The code maintains a counter _N_Y, tracking the number of 'Y' values queued up for output. With each observation, it is decremented by 1. _N_Y never exceeds 2, and takes that value only under the conditions specified by @sahoositaram555

data want (drop=_:);
  set have;
  by col1 col2 notsorted;
  length col3 $1;
  if first.col1=1 and last.col2=0 then do;      /** At least one repeat **/
    _ptr=_n_+2;
    set have (keep=col1 col2 rename=(col2=_col2 col1=_col1)) point=_ptr;
    if col1^=_col1 or col2^=_col2 then _n_y=2;  /** But not two repeats **/
  end;
  col3=ifc(_n_y>0,'Y',' ');
  _n_y + (-1);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 08-16-2023 08:02 AM

data have;
        input col1 col2;
        datalines;
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 7
    1001 8
    1002 4.25
    1002 4.25
    1002 2.5
    1002 2.5
    1002 2.5
    1002 2.5
    1002 1.5
    1003 6.5
    1003 6.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 1.5
    1004 6.5
    ;
    run;
data want;
do n=1 by 1 until(last.col2);
 set have;
 by col1 col2 notsorted;
 if first.col1 then first=1;
end;
do until(last.col2);
 set have;
 by col1 col2 notsorted;
 if first and n=2 then col3='Y';
 output;
end;
drop n first;
run;

Tom · Posted 08-16-2023 09:01 AM

Sounds like you just need to COUNT how many repetitions there are and then test is that is 2 or not.

data have;
    input col1 col2 col3 $;
datalines;
1001 1.5 .
1001 1.5 .
1001 1.5 .
1001 1.5 .
1001 1.5 .
1001 7 .
1001 8 .
1002 4.25 Y
1002 4.25 Y
1002 2.5 .
1002 2.5 .
1002 2.5 .
1002 2.5 .
1002 1.5 .
1003 6.5 Y
1003 6.5 Y 
1003 1.5 .
1003 1.5 .
1003 1.5 .
1003 1.5 .
1003 1.5 .
1003 8.7 .
1004 8.7 .
1004 8.7 .
1004 8.7 .
1004 8.7 .
1004 8.7 .
1004 1.5 .
1004 6.5 .
;

data want;
  do nreps=1 by 1 until(last.col2);
    set have;
    by col1  col2 notsorted;
  end;
  flag = (nreps=2);
  do until(last.col2);
    set have;
    by col1  col2 notsorted;
    output;
  end;
run;

Obs    nreps    col1    col2    col3    flag

  1      5      1001    1.50              0
  2      5      1001    1.50              0
  3      5      1001    1.50              0
  4      5      1001    1.50              0
  5      5      1001    1.50              0
  6      1      1001    7.00              0
  7      1      1001    8.00              0
  8      2      1002    4.25     Y        1
  9      2      1002    4.25     Y        1
 10      4      1002    2.50              0
 11      4      1002    2.50              0
 12      4      1002    2.50              0
 13      4      1002    2.50              0
 14      1      1002    1.50              0
 15      2      1003    6.50     Y        1
 16      2      1003    6.50     Y        1
 17      5      1003    1.50              0
 18      5      1003    1.50              0
 19      5      1003    1.50              0
 20      5      1003    1.50              0
 21      5      1003    1.50              0
 22      1      1003    8.70              0
 23      5      1004    8.70              0
 24      5      1004    8.70              0
 25      5      1004    8.70              0
 26      5      1004    8.70              0
 27      5      1004    8.70              0
 28      1      1004    1.50              0
 29      1      1004    6.50              0

Astounding · Posted 08-16-2023 10:10 AM

@Tom ,

In your example data, all instances of nreps=2 occur at the beginning of a col1 set of values. Would it still work (by failing to flag) if a set of nreps=2 occur in the middle of a set of col1 values?

Tom · Posted 08-16-2023 10:37 AM

It is counting the number of observations in the groups defined by the values of COL2 nested under COL1.

So yes. Any group with exactly two observations will be flagged. Whether or not if it is the first group within that value of COL1.

identify records which are coming maximum twice (only if coming in 1st 2 rows)

Re: identify records which are coming maximum twice (only if coming in 1st 2 rows)

Re: identify records which are coming maximum twice (only if coming in 1st 2 rows)

Re: identify records which are coming maximum twice (only if coming in 1st 2 rows)

Re: identify records which are coming maximum twice (only if coming in 1st 2 rows)

Re: identify records which are coming maximum twice (only if coming in 1st 2 rows)

Re: identify records which are coming maximum twice (only if coming in 1st 2 rows)

Re: identify records which are coming maximum twice (only if coming in 1st 2 rows)

Re: identify records which are coming maximum twice (only if coming in 1st 2 rows)

Re: identify records which are coming maximum twice (only if coming in 1st 2 rows)

Re: identify records which are coming maximum twice (only if coming in 1st 2 rows)

Re: identify records which are coming maximum twice (only if coming in 1st 2 rows)

Registration is open

SAS Training: Just a Click Away