Hi everyone,
I have to make a new column named DupCheck which compares right below row and set a flag.
Excel function: =IF(B1=B2, 0, 1) is used for putting the flag of the duplication.
The table is ordered by contractNo
For example,
rowN ContractNo DupCheck
1 ZA3LE 0
2 ZA3LE 1
3 ZC2G3 1
4 ZV3GS 0
5 ZV3GS 0
6 ZV3GS 1
I made a function below but it has a big problem when it comes to having more than two same variables and the wrong first row flag.
SELECT g1.* ,case when g1.contractNo =g2.contractNo then 0 else 1 end as dupCheck FROM table g1 INNER JOIN table g2 ON g1.rowN = g1.rowN-1
I will appreciate your help!
Hi @seohyeonjeong,
If I've understood your requirements, the following code might help. It merges the data with itself, with one of the inputs offset to the next record:
/* set up input data */
data have;
input contractno $char5.;
datalines;
ZA3LE
ZA3LE
ZC2G3
ZV3GS
ZV3GS
ZV3GS
;
/* merge data with itself, starting one input from the 2nd obs */
data want(drop = next_contractno);
merge have(firstobs = 2 rename = contractno = next_contractno)
have;
/* true gives 1, false give 0 */
dupcheck = contractno ne next_contractno;
run;
/* display results */
proc print data = want;
run;
Kind regards,
Amir.
Edit: Improved comment.
@seohyeonjeong Especially given your data is already sorted by contractno using sequential SAS data step by group processing is so much more efficient than SQL in your case - both from a coding and processing perspective.
If your source data resides in a database then using SQL might be the right thing to do - but else there is really no reason for using SQL here.
Compare all the code and processing required to create want_1 as compared to the little it takes to create want_2.
/* set up input data */
data have;
input contractno $char5.;
datalines;
ZA3LE
ZA3LE
ZC2G3
ZV3GS
ZV3GS
ZV3GS
;
data inter;
set have;
row_id=_n_;
run;
proc sql;
create table want_1 as
select
o.contractno,
case
when not missing(i.row_id) then 1
else 0
end as dupcheck
from
inter o
left join
(
select
contractno,
max(row_id) as row_id
from inter
group by contractno
) i
on o.contractno=i.contractno
and o.row_id=i.row_id
order by o.row_id
;
quit;
data want_2;
set have;
by contractno;
dupcheck= last.contractno;
run;
I think @Amir provides a good answer for your case, where you need to compare a value from one record with that of the next record.
The SAS functions IFN and IFC offer the equivalent feature to the Excel IF function. But if your goal is a binary flag only (0 or 1), then you can rely on the fact that a condition in SQL will result in a 1 or 0 output.
Example:
proc sql;
select (age>=13) as isTeen
from sashelp.class;
quit;
The result isTeen column will have a 1 if the subject is 13 or older, 0 if not.
Using a SAS function like IFN or IFC is valid in PROC SQL but it may not pass to the underlying database if your data source is in something like Oracle or SQL Server -- and that can result in more data movement behind the scenes. For more complex condition checking or non-binary flags, I recommend using the CASE-WHEN-THEN-ELSE structure, which is standard in SQL. Example:
proc sql;
select
case
when (age <=13) then 'Young'
when (age between 14 and 15) then 'Mid'
else 'Old'
end as ageRange
from sashelp.class;
quit;
This will be much easier if you process the data the way normal SAS works.
The result of a boolean expression is already going to be code 0/1 so no need for the IF() function (called IFN() or IFC() in SAS or CASE expression in SQL).
SQL does not have any concept of ordered data. That is why you had to introduce the ROWN variable. But using a DATA step the records are processed sequentially.
It is much easier to remember something from the past than it is to predict the future. So use the LAG() function.
data have;
rowN+1;
input ContractNo :$10. @@ ;
cards;
ZA3LE ZA3LE ZC2G3 ZV3GS ZV3GS ZV3GS
;
data want ;
set have;
dup = contractno=lag(contractno);
run;
If you really need to have the flag on the first duplicate instead of the second duplicate then use your ROWN variable to re-order the data.
proc sort data=have out=descending;
by descending rown;
run;
data want2;
set descending;
by descending rown;
dup_check = contractno=lag(contractno);
run;
Results:
Contract dup_ Obs rowN No check 1 6 ZV3GS 0 2 5 ZV3GS 1 3 4 ZV3GS 1 4 3 ZC2G3 0 5 2 ZA3LE 0 6 1 ZA3LE 1
But I am not sure how useful either flag is going to be. If you want to check for values of CONTRACTNO that repeat (form a run of the same value) then use BY group processing. Then you can flag ALL of the observations in the duplicate run. Of locate the FIRST one or the LAST one.
data want3;
set have;
by contractno notsorted;
dup = not (first.contractno and last.contractno);
first=first.contractno;
last=last.contractno;
run;
Results
Contract Obs rowN No dup first last 1 1 ZA3LE 1 1 0 2 2 ZA3LE 1 0 1 3 3 ZC2G3 0 1 1 4 4 ZV3GS 1 1 0 5 5 ZV3GS 1 0 0 6 6 ZV3GS 1 0 1
Hi @seohyeonjeong,
If there is no choice, except to use sql, then the below code shows a possible solution, but there is a lot more preparation work involved. This is why the data step is a more straightforward approach, so it might be worth finding out why you are being asked to use sql instead of a data step, perhaps there is a good reason but it does not appear to be obvious to any of us at the moment.
/* set up input data */
data have;
input contractno $char5.;
datalines;
ZA3LE
ZA3LE
ZC2G3
ZV3GS
ZV3GS
ZV3GS
;
/* generate two input data sets with sequence numbers */
data have1 have2;
set have;
/* only output to have2 from the 2nd observation */
if seq then
output have2;
seq + 1;
output have1;
run;
/* use sql to left join on seq */
proc sql;
create table
want
as
select
have1.contractno
,have1.contractno ne have2.contractno as dupcheck
from
have1
left join
have2
on
have1.seq eq have2.seq
;
quit;
/* display results */
proc print data = want;
run;
HTH.
Kind regards,
Amir.
Why is Row 5 not a duplicate of Row 4 in your example? Is that a typo or do you actually mean something other than the Excel example?
@seohyeonjeong wrote:
Hi everyone,
I have to make a new column named DupCheck which compares right below row and set a flag.
Excel function: =IF(B1=B2, 0, 1) is used for putting the flag of the duplication.
The table is ordered by contractNo
For example,
rowN ContractNo DupCheck
1 ZA3LE 0
2 ZA3LE 1
3 ZC2G3 1
4 ZV3GS 0
5 ZV3GS 0
6 ZV3GS 1
I made a function below but it has a big problem when it comes to having more than two same variables and the wrong first row flag.
SELECT g1.* ,case when g1.contractNo =g2.contractNo then 0 else 1 end as dupCheck FROM table g1 INNER JOIN table g2 ON g1.rowN = g1.rowN-1
I will appreciate your help!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.