topic Re: Identify duplicate records in a dataset without specifying all variables in SAS Programming

Identify duplicate records in a dataset without specifying all variables

MART1 — Fri, 17 Sep 2021 15:24:29 GMT

Hello

I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).

I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).

My example below: in the dataset TEST, Nick's row is entered twice.

I can use PROC SORT nouniquekeys to identify that, but in the BY I need to specify all the variables ( Name Age Car).

I cannot find how to use, say, *, without having to list all the variables?

data TEST;                                                                                                                            
   input Name $ Age Car $;                                                                                                                 
   datalines;                                                                                                                           
Mike 40 Volvo
Nick 35 Nissan
Susan 51 BMW
Bill 60 Volvo
Tom 35 Ford
Nick 35 Nissan
Nadia 49 Nissan
;                                                                                                                                       
run;                    

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  Name Age Car;
run;

many thanks

Re: Identify duplicate records in a dataset without specifying all variables

Reeza — Fri, 17 Sep 2021 15:29:04 GMT

Use the keyword _ALL_ instead which is a shortcut for all variables.

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  _all_;
run;

Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

@MART1 wrote:

Hello

I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).

I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).

My example below: in the dataset TEST, Nick's row is entered twice.

I can use PROC SORT nouniquekeys to identify that, but in the BY I need to specify all the variables ( Name Age Car).

I cannot find how to use, say, *, without having to list all the variables?
data TEST;                                                                                                                            
   input Name $ Age Car $;                                                                                                                 
   datalines;                                                                                                                           
Mike 40 Volvo
Nick 35 Nissan
Susan 51 BMW
Bill 60 Volvo
Tom 35 Ford
Nick 35 Nissan
Nadia 49 Nissan
;                                                                                                                                       
run;                    

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  Name Age Car;
run;
many thanks

Re: Identify duplicate records in a dataset without specifying all variables

MART1 — Fri, 17 Sep 2021 15:48:24 GMT

That's great thanks so much @Reeza

Very interesting paper - just one question (not strictly related to the topic I appreciate) ; the _ALL_ cannot be used in PROC SQL, for example

PROC SQL;
	CREATE TABLE DUPS AS 
		SELECT COUNT(DISTINCT _ALL_) FROM TEST;
QUIT;

Do you know if there's an _ALL_ equivalent for SQL? (* can be used, but not with DISTINCT, or in a GROUP BY)

Thanks

Re: Identify duplicate records in a dataset without specifying all variables

Reeza — Fri, 17 Sep 2021 15:57:49 GMT

Unfortunately there is no equivalent to SQL. In general, SQL doesn't work well with shortcut references or formats.

You could create a macro variable list from sashelp but _all_ is faster and more dynamic.

Re: Identify duplicate records in a dataset without specifying all variables

Tom — Fri, 17 Sep 2021 16:40:39 GMT

The SQL equivalent of _ALL_ is *.

So this code will reduce HAVE from 19 observations to 11 distinct observations.

data have;
  set sashelp.class;
  keep age sex ;
run;

proc sql;
create table distinct as select distinct * from have ;
quit;

Re: Identify duplicate records in a dataset without specifying all variables

Reeza — Fri, 17 Sep 2021 16:43:12 GMT

 69         proc sql;
 70         create table test as select count(distinct *) from sashelp.class ;
 ERROR: * used in an illegal position.
 NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
 71         quit;

Cannot be used for counts, but could be used to de-duplicate

Re: Identify duplicate records in a dataset without specifying all variables

Tom — Fri, 17 Sep 2021 17:54:04 GMT

But that is just the same invalid syntax you sent before.

If you want to count the number of distinct records use:

select count(*) as number_distinct from (select distinct * from have);