BookmarkSubscribeRSS Feed
seohyeonjeong
Obsidian | Level 7

Hi everyone,

 

I have to make a new column named DupCheck which compares right below row and set a flag.

Excel function: =IF(B1=B2, 0, 1) is used for putting the flag of the duplication.

The table is ordered by contractNo

 

For example,

 

rowN ContractNo    DupCheck

 1        ZA3LE   0

 2        ZA3LE   1

 3        ZC2G3    1

 4        ZV3GS    0

 5        ZV3GS    0

 6        ZV3GS    1

       

 

I made a function below but it has a big problem when it comes to having more than two same variables and the wrong first row flag.

 

 

SELECT 
g1.*
,case when g1.contractNo =g2.contractNo then 0 else 1 end as dupCheck
FROM table g1
INNER JOIN table g2
ON g1.rowN = g1.rowN-1

 

 

I will appreciate your help!

11 REPLIES 11
Ksharp
Super User
SELECT
g1.*
, ifn( g1.contractNo =g2.contractNo , 0 ,1 ) as dupCheck
FROM table g1
Amir
PROC Star

Hi @seohyeonjeong,

 

If I've understood your requirements, the following code might help. It merges the data with itself, with one of the inputs offset to the next record:

 

 

/* set up input data */
data have;
	input contractno $char5.;

	datalines;
ZA3LE
ZA3LE
ZC2G3
ZV3GS
ZV3GS
ZV3GS
;


/* merge data with itself, starting one input from the 2nd obs */
data want(drop = next_contractno);
	merge have(firstobs = 2 rename = contractno = next_contractno)
		  have;
		  
	/* true gives 1, false give 0 */
	dupcheck = contractno ne next_contractno;
run;


/* display results */
proc print data = want;
run;

 

 

 

 

Kind regards,

Amir.

 

Edit: Improved comment.

seohyeonjeong
Obsidian | Level 7
Thank you, Amir.

This will help me in the future, but I'm finding a way in proc sql.
My manager asked me to make it work in proc sql.

But thank you so much!
Patrick
Opal | Level 21

@seohyeonjeong Especially given your data is already sorted by contractno using sequential SAS data step by group processing is so much more efficient than SQL in your case - both from a coding and processing perspective.

If your source data resides in a database then using SQL might be the right thing to do - but else there is really no reason for using SQL here.

 

Compare all the code and processing required to create want_1 as compared to the little it takes to create want_2.

/* set up input data */
data have;
	input contractno $char5.;
	datalines;
ZA3LE
ZA3LE
ZC2G3
ZV3GS
ZV3GS
ZV3GS
;

data inter;
  set have;
  row_id=_n_;
run;

proc sql;
  create table want_1 as
  select 
    o.contractno,
    case
      when not missing(i.row_id) then 1
      else 0
      end as dupcheck
  from 
    inter o
    left join
    (
      select
        contractno,
        max(row_id) as row_id
      from inter
      group by contractno
    ) i
    on o.contractno=i.contractno
      and o.row_id=i.row_id
    order by o.row_id
  ;
quit;

data want_2;
  set have;
  by contractno;
  dupcheck= last.contractno;
run;
seohyeonjeong
Obsidian | Level 7
Our dept uses SAS Enterprise Guide which is the only way to access DB now, but we are planning to change our work environment on AWS (Athena). That's the reason I have to make the code in proc SQL as possible I can.
Thank you for your help!
ChrisHemedinger
Community Manager

I think @Amir provides a good answer for your case, where you need to compare a value from one record with that of the next record.

 

The SAS functions IFN and IFC offer the equivalent feature to the Excel IF function. But if your goal is a binary flag only (0 or 1), then you can rely on the fact that a condition in SQL will result in a 1 or 0 output.

 

Example:

 

 

proc sql;
 select (age>=13) as isTeen 
  from sashelp.class;
quit;

 

 

The result isTeen column will have a 1 if the subject is 13 or older, 0 if not.

 

Using a SAS function like IFN or IFC is valid in PROC SQL but it may not pass to the underlying database if your data source is in something like Oracle or SQL Server -- and that can result in more data movement behind the scenes. For more complex condition checking or non-binary flags, I recommend using the CASE-WHEN-THEN-ELSE structure, which is standard in SQL. Example:

 

 

proc sql;
 select 
 case 
  when (age <=13) then 'Young'
  when (age between 14 and 15) then 'Mid'
  else 'Old' 
  end as ageRange
  from sashelp.class;
quit;

 

 

 

Learn from the Experts! Check out the huge catalog of free sessions in the Ask the Expert webinar series.
Tom
Super User Tom
Super User

This will be much easier if you process the data the way normal SAS works.

 

The result of a boolean expression is already going to be code 0/1 so no need for the IF() function (called IFN() or IFC() in SAS or CASE expression in SQL).

 

SQL does not have any concept of ordered data.  That is why you had to introduce the ROWN variable.  But using a DATA step the records are processed sequentially.

 

It is much easier to remember something from the past than it is to predict the future.  So use the LAG() function.

data have;
  rowN+1;
  input ContractNo :$10. @@ ;
cards;
ZA3LE ZA3LE ZC2G3 ZV3GS ZV3GS ZV3GS
;

data want ;
  set have;
  dup = contractno=lag(contractno);
run;

If you really need to have the flag on the first duplicate instead of the second duplicate then use your ROWN variable to re-order the data.

proc sort data=have out=descending;
  by descending rown;
run;

data want2;
  set descending;
  by descending rown;
  dup_check = contractno=lag(contractno);
run;

Results:

               Contract     dup_
Obs    rowN       No       check

 1       6      ZV3GS        0
 2       5      ZV3GS        1
 3       4      ZV3GS        1
 4       3      ZC2G3        0
 5       2      ZA3LE        0
 6       1      ZA3LE        1

But I am not sure how useful either flag is going to be.  If you want to check for values of CONTRACTNO that repeat (form a run of the same value) then use BY group processing.  Then you can flag ALL of the observations in the duplicate run.  Of locate the FIRST one or the LAST one.

data want3;
  set have;
  by contractno notsorted;
  dup = not (first.contractno and last.contractno);
  first=first.contractno;
  last=last.contractno;
run;

Results

               Contract
Obs    rowN       No       dup    first    last

 1       1      ZA3LE       1       1        0
 2       2      ZA3LE       1       0        1
 3       3      ZC2G3       0       1        1
 4       4      ZV3GS       1       1        0
 5       5      ZV3GS       1       0        0
 6       6      ZV3GS       1       0        1

 

seohyeonjeong
Obsidian | Level 7
Thank you for your help.
How generous you are. Thank you for the examples. It is easy to understand.
Amir
PROC Star

Hi @seohyeonjeong,

 

If there is no choice, except to use sql, then the below code shows a possible solution, but there is a lot more preparation work involved. This is why the data step is a more straightforward approach, so it might be worth finding out why you are being asked to use sql instead of a data step, perhaps there is a good reason but it does not appear to be obvious to any of us at the moment.

 

/* set up input data */
data have;
	input contractno $char5.;

	datalines;
ZA3LE
ZA3LE
ZC2G3
ZV3GS
ZV3GS
ZV3GS
;


/* generate two input data sets with sequence numbers */
data have1 have2;
	set have;
	
	/* only output to have2 from the 2nd observation */
	if seq then
		output have2;
		
	seq + 1;
		
	output have1;
run;


/* use sql to left join on seq */
proc sql;
	create table
		want
	as
	select
		 have1.contractno
		,have1.contractno ne have2.contractno as dupcheck
	from
		have1
	left join
		have2
	on
		have1.seq eq have2.seq
	;
quit;


/* display results */
proc print data = want;
run;

 

HTH.

 

 

Kind regards,

Amir.

seohyeonjeong
Obsidian | Level 7
I truly understand what you mean. It's really annoying that I have to use only PROC SQL.
I will ask my boss to do this on normal SAS and will change the code when we move this one to AWS Athena. Thank you for your help:)
ballardw
Super User

Why is Row 5 not a duplicate of Row 4 in your example? Is that a typo or do you actually mean something other than the Excel example?

 


@seohyeonjeong wrote:

Hi everyone,

 

I have to make a new column named DupCheck which compares right below row and set a flag.

Excel function: =IF(B1=B2, 0, 1) is used for putting the flag of the duplication.

The table is ordered by contractNo

 

For example,

 

rowN ContractNo    DupCheck

 1        ZA3LE   0

 2        ZA3LE   1

 3        ZC2G3    1

 4        ZV3GS    0

 5        ZV3GS    0

 6        ZV3GS    1

       

 

I made a function below but it has a big problem when it comes to having more than two same variables and the wrong first row flag.

 

 

SELECT 
g1.*
,case when g1.contractNo =g2.contractNo then 0 else 1 end as dupCheck
FROM table g1
INNER JOIN table g2
ON g1.rowN = g1.rowN-1

 

 

I will appreciate your help!


 

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1241 views
  • 6 likes
  • 7 in conversation