data have;
input col1 col2;
datalines;
1001 1.5
1001 1.5
1001 1.5
1001 1.5
1001 1.5
1001 7
1001 8
1002 4.25
1002 4.25
1002 2.5
1002 2.5
1002 2.5
1002 2.5
1002 1.5
1003 6.5
1003 6.5
1003 1.5
1003 1.5
1003 1.5
1003 1.5
1003 1.5
1003 8.7
1004 8.7
1004 8.7
1004 8.7
1004 8.7
1004 8.7
1004 1.5
1004 6.5
;
run;
I have this above dataset, in which I need to identify and flag them if the first.COL2 value matches with the below row only. eg: for 1001, in col2 1.5 is repeated more than 2 times then we shallnot flag any observation of 1001. but for 1002, 4.25 is repeated only once( present in first and below row) , in this case both the rows should be flagged with "Y". same for 1003. for 1004 no flag will be populated as 8.7 is repeated more than 2 times.
pls guide how to go about it.
data have;
input col1 col2;
datalines;
1001 1.5
1001 1.5
1001 1.5
1001 1.5
1001 1.5
1001 7
1001 8
1002 4.25
1002 4.25
1002 2.5
1002 2.5
1002 2.5
1002 2.5
1002 1.5
1003 6.5
1003 6.5
1003 1.5
1003 1.5
1003 1.5
1003 1.5
1003 1.5
1003 8.7
1004 8.7
1004 8.7
1004 8.7
1004 8.7
1004 8.7
1004 1.5
1004 6.5
;
run;
data want;
do n=1 by 1 until(last.col2);
set have;
by col1 col2 notsorted;
if first.col1 then first=1;
end;
do until(last.col2);
set have;
by col1 col2 notsorted;
if first and n=2 then col3='Y';
output;
end;
drop n first;
run;
Must the flag appear on the first row, or can it appear on the second row?
Hi @Astounding,
the flag must appear on the first and second row(not any other rows), if it satisfies the condition that the repeat should be ONLY ON 1ST and 2ND row for COL2. if repeat value flows to 3rd rows and afterwards, ignore that col1 category .
should be as below
data have;
input col1 col2 col3;
datalines;
1001 1.5 ""
1001 1.5 ""
1001 1.5 ""
1001 1.5 ""
1001 1.5 ""
1001 7 ""
1001 8 ""
1002 4.25 "Y"
1002 4.25 "Y"
1002 2.5 ""
1002 2.5 ""
1002 2.5 ""
1002 2.5 ""
1002 1.5 ""
1003 6.5 "Y"
1003 6.5 "Y"
1003 1.5 ""
1003 1.5 ""
1003 1.5 ""
1003 1.5 ""
1003 1.5 ""
1003 8.7 ""
1004 8.7 ""
1004 8.7 ""
1004 8.7 ""
1004 8.7 ""
1004 8.7 ""
1004 1.5 ""
1004 6.5 ""
;
run;
Suggestion: For ANY flag that takes only two values True/False Yes/No or similar strongly suggest setting the value of True or Yes as a numeric 1 and False or No as numeric 0.
Character values like 'Y' require lots of extra coding to do things like get counts, percentages and answer questions. If you have 1/0 coded values you can use functions or statistics in report or analysis procedures like:
N => total number of answers
Sum => total number of Yes
Mean => percentage of Yes as decimal (0.10 is 10 percent yes)
Max => if 1 means that at least one of a group of value is true
Range => if 1 means that at least one Yes and one No. If 0 then all Yes or all No.
N - Sum => number No
There are some others useful as well.
Additionally SAS will assign 1/0 as the result of any logical comparison : Flag = (y>x); creates numeric 1/0 coded variable when the value of y is > x. So no "if /then/else" for such.
OK, just being lazy here. You have seen some very good programmers post complex solutions. And the solutions may or may not do exactly what you are looking for. Here is a relatively simple program that does half the job. It flags the second observation only, when the conditions you specify are met. I leave it up to you to take that output and expand it to flag both the first and second observations. (After all, you should be doing some of the programming work!)
data want;
set have;
by col1 col2 notsorted;
if first.col1 then n=1;
else n+1;
if n=2 and first.col2=0 and last.col2=1 then flag="Y";
drop n;
run;
You can use a self-merge (with the "firstobs=" option) to look ahead at the next two COL1 values (_next_col1 and _after_next_col1) and next two COL2 values.
data have;
input col1 col2;
datalines;
1001 1.5
1001 1.5
1001 1.5
1001 1.5
1001 1.5
1001 7
1001 8
1002 4.25
1002 4.25
1002 2.5
1002 2.5
1002 2.5
1002 2.5
1002 1.5
1003 6.5
1003 6.5
1003 1.5
1003 1.5
1003 1.5
1003 1.5
1003 1.5
1003 8.7
1004 8.7
1004 8.7
1004 8.7
1004 8.7
1004 8.7
1004 1.5
1004 6.5
run;
data rule1 (drop=_:);
set have;
by col1;
merge have
have (firstobs=2 rename=(col1=_next_col1 col2=_next_col2)
keep=col1 col2)
have (firstobs=3 rename=(col1=_after_next_col1 col2=_after_next_col2)
keep=col1 col2) ;
retain col3 ' ';
if first.col1 then do;
if col1=_next_col1 and col2=_next_col2 and
(col1^=_after_next_col1 or col2^=_after_next_col2)
then col3='Y';
else col3=' ';
end;
if lag2(first.col1)=1 and lag2(col1)=col1 then col3=' ';
run;
If you set COL3='Y', then when the third obs for a given COL1 is encountered, then COL3 becomes a blank. This is tested for via the "if lag2(first.col1)=1" test.
Note: I added the "lag2(col1)=col1" condition to defend against instances with exactly two obs for a given col1.
Here's another slightly simpler approach. It uses the "point=" option of the SET statement. The code maintains a counter _N_Y, tracking the number of 'Y' values queued up for output. With each observation, it is decremented by 1. _N_Y never exceeds 2, and takes that value only under the conditions specified by @sahoositaram555
data want (drop=_:);
set have;
by col1 col2 notsorted;
length col3 $1;
if first.col1=1 and last.col2=0 then do; /** At least one repeat **/
_ptr=_n_+2;
set have (keep=col1 col2 rename=(col2=_col2 col1=_col1)) point=_ptr;
if col1^=_col1 or col2^=_col2 then _n_y=2; /** But not two repeats **/
end;
col3=ifc(_n_y>0,'Y',' ');
_n_y + (-1);
run;
data have;
input col1 col2;
datalines;
1001 1.5
1001 1.5
1001 1.5
1001 1.5
1001 1.5
1001 7
1001 8
1002 4.25
1002 4.25
1002 2.5
1002 2.5
1002 2.5
1002 2.5
1002 1.5
1003 6.5
1003 6.5
1003 1.5
1003 1.5
1003 1.5
1003 1.5
1003 1.5
1003 8.7
1004 8.7
1004 8.7
1004 8.7
1004 8.7
1004 8.7
1004 1.5
1004 6.5
;
run;
data want;
do n=1 by 1 until(last.col2);
set have;
by col1 col2 notsorted;
if first.col1 then first=1;
end;
do until(last.col2);
set have;
by col1 col2 notsorted;
if first and n=2 then col3='Y';
output;
end;
drop n first;
run;
Sounds like you just need to COUNT how many repetitions there are and then test is that is 2 or not.
data have;
input col1 col2 col3 $;
datalines;
1001 1.5 .
1001 1.5 .
1001 1.5 .
1001 1.5 .
1001 1.5 .
1001 7 .
1001 8 .
1002 4.25 Y
1002 4.25 Y
1002 2.5 .
1002 2.5 .
1002 2.5 .
1002 2.5 .
1002 1.5 .
1003 6.5 Y
1003 6.5 Y
1003 1.5 .
1003 1.5 .
1003 1.5 .
1003 1.5 .
1003 1.5 .
1003 8.7 .
1004 8.7 .
1004 8.7 .
1004 8.7 .
1004 8.7 .
1004 8.7 .
1004 1.5 .
1004 6.5 .
;
data want;
do nreps=1 by 1 until(last.col2);
set have;
by col1 col2 notsorted;
end;
flag = (nreps=2);
do until(last.col2);
set have;
by col1 col2 notsorted;
output;
end;
run;
Obs nreps col1 col2 col3 flag 1 5 1001 1.50 0 2 5 1001 1.50 0 3 5 1001 1.50 0 4 5 1001 1.50 0 5 5 1001 1.50 0 6 1 1001 7.00 0 7 1 1001 8.00 0 8 2 1002 4.25 Y 1 9 2 1002 4.25 Y 1 10 4 1002 2.50 0 11 4 1002 2.50 0 12 4 1002 2.50 0 13 4 1002 2.50 0 14 1 1002 1.50 0 15 2 1003 6.50 Y 1 16 2 1003 6.50 Y 1 17 5 1003 1.50 0 18 5 1003 1.50 0 19 5 1003 1.50 0 20 5 1003 1.50 0 21 5 1003 1.50 0 22 1 1003 8.70 0 23 5 1004 8.70 0 24 5 1004 8.70 0 25 5 1004 8.70 0 26 5 1004 8.70 0 27 5 1004 8.70 0 28 1 1004 1.50 0 29 1 1004 6.50 0
@Tom ,
In your example data, all instances of nreps=2 occur at the beginning of a col1 set of values. Would it still work (by failing to flag) if a set of nreps=2 occur in the middle of a set of col1 values?
It is counting the number of observations in the groups defined by the values of COL2 nested under COL1.
So yes. Any group with exactly two observations will be flagged. Whether or not if it is the first group within that value of COL1.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.