BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
data have;
        input col1 col2;
        datalines;
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
1001 7
1001 8
1002 4.25 1002 4.25 1002 2.5 1002 2.5 1002 2.5
1002 2.5
1002 1.5
1003 6.5 1003 6.5 1003 1.5 1003 1.5 1003 1.5
1003 1.5
1003 1.5
1003 8.7
1004 8.7 1004 8.7 1004 8.7 1004 8.7
1004 8.7
1004 1.5
1004 6.5 ; run;

 I have this above dataset, in which I need to identify and flag them if the first.COL2 value matches with the below row only. eg: for 1001, in col2 1.5 is repeated more than 2 times then we shallnot flag any observation of 1001.  but for 1002, 4.25 is repeated only once( present in first and below row) , in this case both the rows should be flagged with "Y". same for 1003. for 1004 no flag will be populated as 8.7 is repeated more than 2 times.

pls guide how to go about it.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
data have;
        input col1 col2;
        datalines;
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 7
    1001 8
    1002 4.25
    1002 4.25
    1002 2.5
    1002 2.5
    1002 2.5
    1002 2.5
    1002 1.5
    1003 6.5
    1003 6.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 1.5
    1004 6.5
    ;
    run;
data want;
do n=1 by 1 until(last.col2);
 set have;
 by col1 col2 notsorted;
 if first.col1 then first=1;
end;
do until(last.col2);
 set have;
 by col1 col2 notsorted;
 if first and n=2 then col3='Y';
 output;
end;
drop n first;
run;

View solution in original post

10 REPLIES 10
Astounding
PROC Star

Must the flag appear on the first row, or can it appear on the second row?

sahoositaram555
Pyrite | Level 9

Hi @Astounding,
the flag must appear on the first and second row(not any other rows), if it satisfies the condition that the repeat should be ONLY ON 1ST and 2ND row for COL2. if repeat value flows to 3rd rows and afterwards, ignore that col1 category .

 

should be as below

data have;
        input col1 col2 col3;
        datalines;
    1001 1.5 ""
    1001 1.5 ""
    1001 1.5 ""
    1001 1.5 ""
    1001 1.5 ""
1001 7 ""
1001 8 ""
1002 4.25 "Y" 1002 4.25 "Y" 1002 2.5 "" 1002 2.5 "" 1002 2.5 ""
1002 2.5 ""
1002 1.5 ""
1003 6.5 "Y" 1003 6.5 "Y" 1003 1.5 "" 1003 1.5 "" 1003 1.5 ""
1003 1.5 ""
1003 1.5 ""
1003 8.7 ""
1004 8.7 "" 1004 8.7 "" 1004 8.7 "" 1004 8.7 ""
1004 8.7 ""
1004 1.5 ""
1004 6.5 "" ; run;
ballardw
Super User

Suggestion: For ANY flag that takes only two values True/False  Yes/No or similar strongly suggest setting the value of True or Yes as a numeric 1 and False or No as numeric 0.

Character values like 'Y' require lots of extra coding to do things like get counts, percentages and answer questions. If you have 1/0 coded values you can use functions or statistics in report or analysis procedures  like:

N => total number of answers

Sum => total number of Yes

Mean => percentage of Yes as decimal  (0.10  is 10 percent yes)

Max => if 1 means that at least one of a group of value is true

Range => if 1 means that at least one Yes and one No. If 0 then all Yes or all No.

N - Sum => number No

There are some others useful as well.

 

Additionally SAS will assign 1/0 as the result of any logical comparison :   Flag = (y>x);  creates numeric 1/0 coded variable when the value of y is > x.  So no "if /then/else" for such.

Astounding
PROC Star

OK, just being lazy here.  You have seen some very good programmers post complex solutions.  And the solutions may or may not do exactly what you are looking for.  Here is a relatively simple program that does half the job.  It flags the second observation only, when the conditions you specify are met.  I leave it up to you to take that output and expand it to flag both the first and second observations.  (After all, you should be doing some of the programming work!)

data want;
   set have;
   by col1 col2 notsorted;
   if first.col1 then n=1;
   else n+1;
   if n=2 and first.col2=0 and last.col2=1 then flag="Y";
   drop n;
run;
mkeintz
PROC Star

You can use a self-merge (with the "firstobs=" option) to look ahead at the next two COL1 values (_next_col1 and _after_next_col1) and next two COL2 values.

 

data have;
        input col1 col2;
        datalines;
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 7
    1001 8
    1002 4.25
    1002 4.25
    1002 2.5
    1002 2.5
    1002 2.5
    1002 2.5
    1002 1.5
    1003 6.5
    1003 6.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 1.5
    1004 6.5
run;


data rule1 (drop=_:);
  set have;
  by col1;
  merge have 
       have (firstobs=2 rename=(col1=_next_col1 col2=_next_col2)
              keep=col1 col2)
        have (firstobs=3 rename=(col1=_after_next_col1 col2=_after_next_col2)
              keep=col1 col2) ;
  retain col3 ' ';
  if first.col1 then do;
    if col1=_next_col1 and col2=_next_col2 and
      (col1^=_after_next_col1 or col2^=_after_next_col2)
    then col3='Y';
    else col3=' ';
  end;
  if lag2(first.col1)=1 and lag2(col1)=col1 then col3=' ';
run;

If you set COL3='Y', then when the third obs for a given COL1 is encountered, then COL3 becomes a blank.  This is tested for via the "if lag2(first.col1)=1" test.

 

Note:  I added the "lag2(col1)=col1" condition to defend against instances with exactly two obs for a given col1.  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

 

Here's another slightly simpler approach.  It uses the "point=" option of the SET statement.  The code maintains a counter _N_Y, tracking the number of 'Y' values queued up for output.  With each observation, it is decremented by 1.  _N_Y never exceeds 2, and takes that value only under the conditions specified  by @sahoositaram555 

 

 

 

data want (drop=_:);
  set have;
  by col1 col2 notsorted;
  length col3 $1;
  if first.col1=1 and last.col2=0 then do;      /** At least one repeat **/
    _ptr=_n_+2;
    set have (keep=col1 col2 rename=(col2=_col2 col1=_col1)) point=_ptr;
    if col1^=_col1 or col2^=_col2 then _n_y=2;  /** But not two repeats **/
  end;
  col3=ifc(_n_y>0,'Y',' ');
  _n_y + (-1);
run;

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User
data have;
        input col1 col2;
        datalines;
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 1.5
    1001 7
    1001 8
    1002 4.25
    1002 4.25
    1002 2.5
    1002 2.5
    1002 2.5
    1002 2.5
    1002 1.5
    1003 6.5
    1003 6.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 1.5
    1003 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 8.7
    1004 1.5
    1004 6.5
    ;
    run;
data want;
do n=1 by 1 until(last.col2);
 set have;
 by col1 col2 notsorted;
 if first.col1 then first=1;
end;
do until(last.col2);
 set have;
 by col1 col2 notsorted;
 if first and n=2 then col3='Y';
 output;
end;
drop n first;
run;
Tom
Super User Tom
Super User

Sounds like you just need to COUNT how many repetitions there are and then test is that is 2 or not.

data have;
    input col1 col2 col3 $;
datalines;
1001 1.5 .
1001 1.5 .
1001 1.5 .
1001 1.5 .
1001 1.5 .
1001 7 .
1001 8 .
1002 4.25 Y
1002 4.25 Y
1002 2.5 .
1002 2.5 .
1002 2.5 .
1002 2.5 .
1002 1.5 .
1003 6.5 Y
1003 6.5 Y 
1003 1.5 .
1003 1.5 .
1003 1.5 .
1003 1.5 .
1003 1.5 .
1003 8.7 .
1004 8.7 .
1004 8.7 .
1004 8.7 .
1004 8.7 .
1004 8.7 .
1004 1.5 .
1004 6.5 .
;

data want;
  do nreps=1 by 1 until(last.col2);
    set have;
    by col1  col2 notsorted;
  end;
  flag = (nreps=2);
  do until(last.col2);
    set have;
    by col1  col2 notsorted;
    output;
  end;
run;
Obs    nreps    col1    col2    col3    flag

  1      5      1001    1.50              0
  2      5      1001    1.50              0
  3      5      1001    1.50              0
  4      5      1001    1.50              0
  5      5      1001    1.50              0
  6      1      1001    7.00              0
  7      1      1001    8.00              0
  8      2      1002    4.25     Y        1
  9      2      1002    4.25     Y        1
 10      4      1002    2.50              0
 11      4      1002    2.50              0
 12      4      1002    2.50              0
 13      4      1002    2.50              0
 14      1      1002    1.50              0
 15      2      1003    6.50     Y        1
 16      2      1003    6.50     Y        1
 17      5      1003    1.50              0
 18      5      1003    1.50              0
 19      5      1003    1.50              0
 20      5      1003    1.50              0
 21      5      1003    1.50              0
 22      1      1003    8.70              0
 23      5      1004    8.70              0
 24      5      1004    8.70              0
 25      5      1004    8.70              0
 26      5      1004    8.70              0
 27      5      1004    8.70              0
 28      1      1004    1.50              0
 29      1      1004    6.50              0
Astounding
PROC Star

@Tom ,

 

In your example data, all instances of nreps=2 occur at the beginning of a col1 set of values.  Would it still work (by failing to flag) if a set of nreps=2 occur in the middle of a set of col1 values?

Tom
Super User Tom
Super User

It is counting the number of observations in the groups defined by the values of COL2 nested under COL1.

So yes. Any group with exactly two observations will be flagged. Whether or not if it is the first group within that value of COL1.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 545 views
  • 0 likes
  • 6 in conversation