BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MART1
Quartz | Level 8

Hello

 

I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).

 

I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).

 

My example below: in the dataset TEST, Nick's row is entered twice.

 

I can use PROC SORT nouniquekeys to identify that, but in the BY  I need to specify all the variables ( Name Age Car).

 

I cannot find how to use, say, *, without having to list all the variables?

 

 

 

data TEST;                                                                                                                            
   input Name $ Age Car $;                                                                                                                 
   datalines;                                                                                                                           
Mike 40 Volvo
Nick 35 Nissan
Susan 51 BMW
Bill 60 Volvo
Tom 35 Ford
Nick 35 Nissan
Nadia 49 Nissan
;                                                                                                                                       
run;                    

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  Name Age Car;
run;

many thanks

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Use the keyword _ALL_ instead which is a shortcut for all variables.

 

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  _all_;
run;

Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

 


@MART1 wrote:

Hello

 

I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).

 

I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).

 

My example below: in the dataset TEST, Nick's row is entered twice.

 

I can use PROC SORT nouniquekeys to identify that, but in the BY  I need to specify all the variables ( Name Age Car).

 

I cannot find how to use, say, *, without having to list all the variables?

 

 

 

data TEST;                                                                                                                            
   input Name $ Age Car $;                                                                                                                 
   datalines;                                                                                                                           
Mike 40 Volvo
Nick 35 Nissan
Susan 51 BMW
Bill 60 Volvo
Tom 35 Ford
Nick 35 Nissan
Nadia 49 Nissan
;                                                                                                                                       
run;                    

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  Name Age Car;
run;

many thanks

 

 


 

View solution in original post

6 REPLIES 6
Reeza
Super User

Use the keyword _ALL_ instead which is a shortcut for all variables.

 

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  _all_;
run;

Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

 


@MART1 wrote:

Hello

 

I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).

 

I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).

 

My example below: in the dataset TEST, Nick's row is entered twice.

 

I can use PROC SORT nouniquekeys to identify that, but in the BY  I need to specify all the variables ( Name Age Car).

 

I cannot find how to use, say, *, without having to list all the variables?

 

 

 

data TEST;                                                                                                                            
   input Name $ Age Car $;                                                                                                                 
   datalines;                                                                                                                           
Mike 40 Volvo
Nick 35 Nissan
Susan 51 BMW
Bill 60 Volvo
Tom 35 Ford
Nick 35 Nissan
Nadia 49 Nissan
;                                                                                                                                       
run;                    

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  Name Age Car;
run;

many thanks

 

 


 

MART1
Quartz | Level 8

That's great thanks so much @Reeza 

 

Very interesting paper  - just one question (not strictly related to the topic I appreciate) ; the _ALL_ cannot be used in PROC SQL, for example

 

PROC SQL;
	CREATE TABLE DUPS AS 
		SELECT COUNT(DISTINCT _ALL_) FROM TEST;
QUIT;

Do you know if there's an _ALL_ equivalent for SQL? (* can be used, but not with DISTINCT, or in a GROUP BY)

 

Thanks

 

 

 

Reeza
Super User
Unfortunately there is no equivalent to SQL. In general, SQL doesn't work well with shortcut references or formats.

You could create a macro variable list from sashelp but _all_ is faster and more dynamic.
Tom
Super User Tom
Super User

The SQL equivalent of _ALL_ is *.

 

So this code will reduce HAVE from 19 observations to 11 distinct observations.

 

data have;
  set sashelp.class;
  keep age sex ;
run;

proc sql;
create table distinct as select distinct * from have ;
quit;
Reeza
Super User
 69         proc sql;
 70         create table test as select count(distinct *) from sashelp.class ;
 ERROR: * used in an illegal position.
 NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
 71         quit;

Cannot be used for counts, but could be used to de-duplicate

Tom
Super User Tom
Super User

But that is just the same invalid syntax you sent before.

If you want to count the number of distinct records use:

select count(*) as number_distinct from (select distinct * from have);

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 2331 views
  • 1 like
  • 3 in conversation