Re: How to make an excel function =IF(B1=B2, 0, 1) in SAS PROC SQL

seohyeonjeong · Posted 01-28-2022 11:18 PM

Hi everyone,

I have to make a new column named DupCheck which compares right below row and set a flag.

Excel function: =IF(B1=B2, 0, 1) is used for putting the flag of the duplication.

The table is ordered by contractNo

For example,

rowN ContractNo DupCheck

1 ZA3LE　　　0

2 ZA3LE　　　1

3 ZC2G3　　 1

4 ZV3GS　　 0

5 ZV3GS　　 0

6 ZV3GS　　 1

I made a function below but it has a big problem when it comes to having more than two same variables and the wrong first row flag.

SELECT 
g1.*
,case when g1.contractNo =g2.contractNo then 0 else 1 end as dupCheck
FROM table g1
INNER JOIN table g2
ON g1.rowN = g1.rowN-1

I will appreciate your help!

Ksharp · Posted 01-29-2022 12:38 AM

SELECT
g1.*
, ifn( g1.contractNo =g2.contractNo , 0 ,1 ) as dupCheck
FROM table g1

Amir · Posted 01-29-2022 08:35 AM

Hi @seohyeonjeong,

If I've understood your requirements, the following code might help. It merges the data with itself, with one of the inputs offset to the next record:

/* set up input data */
data have;
	input contractno $char5.;

	datalines;
ZA3LE
ZA3LE
ZC2G3
ZV3GS
ZV3GS
ZV3GS
;


/* merge data with itself, starting one input from the 2nd obs */
data want(drop = next_contractno);
	merge have(firstobs = 2 rename = contractno = next_contractno)
		  have;
		  
	/* true gives 1, false give 0 */
	dupcheck = contractno ne next_contractno;
run;


/* display results */
proc print data = want;
run;

Kind regards,

Amir.

Edit: Improved comment.

seohyeonjeong · Posted 01-29-2022 09:22 AM

Thank you, Amir.

This will help me in the future, but I'm finding a way in proc sql.
My manager asked me to make it work in proc sql.

But thank you so much!

Patrick · Posted 01-29-2022 09:55 PM

@seohyeonjeong Especially given your data is already sorted by contractno using sequential SAS data step by group processing is so much more efficient than SQL in your case - both from a coding and processing perspective.

If your source data resides in a database then using SQL might be the right thing to do - but else there is really no reason for using SQL here.

Compare all the code and processing required to create want_1 as compared to the little it takes to create want_2.

/* set up input data */
data have;
	input contractno $char5.;
	datalines;
ZA3LE
ZA3LE
ZC2G3
ZV3GS
ZV3GS
ZV3GS
;

data inter;
  set have;
  row_id=_n_;
run;

proc sql;
  create table want_1 as
  select 
    o.contractno,
    case
      when not missing(i.row_id) then 1
      else 0
      end as dupcheck
  from 
    inter o
    left join
    (
      select
        contractno,
        max(row_id) as row_id
      from inter
      group by contractno
    ) i
    on o.contractno=i.contractno
      and o.row_id=i.row_id
    order by o.row_id
  ;
quit;

data want_2;
  set have;
  by contractno;
  dupcheck= last.contractno;
run;

seohyeonjeong · Posted 02-01-2022 06:24 PM

Our dept uses SAS Enterprise Guide which is the only way to access DB now, but we are planning to change our work environment on AWS (Athena). That's the reason I have to make the code in proc SQL as possible I can.
Thank you for your help!

ChrisHemedinger · Posted 01-29-2022 08:37 AM

I think @Amir provides a good answer for your case, where you need to compare a value from one record with that of the next record.

The SAS functions IFN and IFC offer the equivalent feature to the Excel IF function. But if your goal is a binary flag only (0 or 1), then you can rely on the fact that a condition in SQL will result in a 1 or 0 output.

Example:

proc sql;
 select (age>=13) as isTeen 
  from sashelp.class;
quit;

The result isTeen column will have a 1 if the subject is 13 or older, 0 if not.

Using a SAS function like IFN or IFC is valid in PROC SQL but it may not pass to the underlying database if your data source is in something like Oracle or SQL Server -- and that can result in more data movement behind the scenes. For more complex condition checking or non-binary flags, I recommend using the CASE-WHEN-THEN-ELSE structure, which is standard in SQL. Example:

proc sql;
 select 
 case 
  when (age <=13) then 'Young'
  when (age between 14 and 15) then 'Mid'
  else 'Old' 
  end as ageRange
  from sashelp.class;
quit;

Learn from the Experts! Check out the huge catalog of free sessions in the Ask the Expert webinar series.

Tom · Posted 01-29-2022 12:15 PM

This will be much easier if you process the data the way normal SAS works.

The result of a boolean expression is already going to be code 0/1 so no need for the IF() function (called IFN() or IFC() in SAS or CASE expression in SQL).

SQL does not have any concept of ordered data. That is why you had to introduce the ROWN variable. But using a DATA step the records are processed sequentially.

It is much easier to remember something from the past than it is to predict the future. So use the LAG() function.

data have;
  rowN+1;
  input ContractNo :$10. @@ ;
cards;
ZA3LE ZA3LE ZC2G3 ZV3GS ZV3GS ZV3GS
;

data want ;
  set have;
  dup = contractno=lag(contractno);
run;

If you really need to have the flag on the first duplicate instead of the second duplicate then use your ROWN variable to re-order the data.

proc sort data=have out=descending;
  by descending rown;
run;

data want2;
  set descending;
  by descending rown;
  dup_check = contractno=lag(contractno);
run;

Results:

               Contract     dup_
Obs    rowN       No       check

 1       6      ZV3GS        0
 2       5      ZV3GS        1
 3       4      ZV3GS        1
 4       3      ZC2G3        0
 5       2      ZA3LE        0
 6       1      ZA3LE        1

But I am not sure how useful either flag is going to be. If you want to check for values of CONTRACTNO that repeat (form a run of the same value) then use BY group processing. Then you can flag ALL of the observations in the duplicate run. Of locate the FIRST one or the LAST one.

data want3;
  set have;
  by contractno notsorted;
  dup = not (first.contractno and last.contractno);
  first=first.contractno;
  last=last.contractno;
run;

Results

               Contract
Obs    rowN       No       dup    first    last

 1       1      ZA3LE       1       1        0
 2       2      ZA3LE       1       0        1
 3       3      ZC2G3       0       1        1
 4       4      ZV3GS       1       1        0
 5       5      ZV3GS       1       0        0
 6       6      ZV3GS       1       0        1

seohyeonjeong · Posted 02-01-2022 06:27 PM

Thank you for your help.
How generous you are. Thank you for the examples. It is easy to understand.

Amir · Posted 01-29-2022 03:10 PM

Hi @seohyeonjeong,

If there is no choice, except to use sql, then the below code shows a possible solution, but there is a lot more preparation work involved. This is why the data step is a more straightforward approach, so it might be worth finding out why you are being asked to use sql instead of a data step, perhaps there is a good reason but it does not appear to be obvious to any of us at the moment.

/* set up input data */
data have;
	input contractno $char5.;

	datalines;
ZA3LE
ZA3LE
ZC2G3
ZV3GS
ZV3GS
ZV3GS
;


/* generate two input data sets with sequence numbers */
data have1 have2;
	set have;
	
	/* only output to have2 from the 2nd observation */
	if seq then
		output have2;
		
	seq + 1;
		
	output have1;
run;


/* use sql to left join on seq */
proc sql;
	create table
		want
	as
	select
		 have1.contractno
		,have1.contractno ne have2.contractno as dupcheck
	from
		have1
	left join
		have2
	on
		have1.seq eq have2.seq
	;
quit;


/* display results */
proc print data = want;
run;

HTH.

Kind regards,

Amir.

seohyeonjeong · Posted 02-01-2022 06:29 PM

I truly understand what you mean. It's really annoying that I have to use only PROC SQL.
I will ask my boss to do this on normal SAS and will change the code when we move this one to AWS Athena. Thank you for your help:)

ballardw · Posted 01-30-2022 07:57 AM

Why is Row 5 not a duplicate of Row 4 in your example? Is that a typo or do you actually mean something other than the Excel example?

@seohyeonjeong wrote:

Hi everyone,

I have to make a new column named DupCheck which compares right below row and set a flag.

Excel function: =IF(B1=B2, 0, 1) is used for putting the flag of the duplication.

The table is ordered by contractNo

For example,

rowN ContractNo DupCheck

1 ZA3LE　　　0

2 ZA3LE　　　1

3 ZC2G3　　 1

4 ZV3GS　　 0

5 ZV3GS　　 0

6 ZV3GS　　 1

I made a function below but it has a big problem when it comes to having more than two same variables and the wrong first row flag.
SELECT 
g1.*
,case when g1.contractNo =g2.contractNo then 0 else 1 end as dupCheck
FROM table g1
INNER JOIN table g2
ON g1.rowN = g1.rowN-1
I will appreciate your help!

SAS Innovate 2025: Call for Content

Classroom Training Available!