BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MART1
Quartz | Level 8

Hello

 

I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).

 

I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).

 

My example below: in the dataset TEST, Nick's row is entered twice.

 

I can use PROC SORT nouniquekeys to identify that, but in the BY  I need to specify all the variables ( Name Age Car).

 

I cannot find how to use, say, *, without having to list all the variables?

 

 

 

data TEST;                                                                                                                            
   input Name $ Age Car $;                                                                                                                 
   datalines;                                                                                                                           
Mike 40 Volvo
Nick 35 Nissan
Susan 51 BMW
Bill 60 Volvo
Tom 35 Ford
Nick 35 Nissan
Nadia 49 Nissan
;                                                                                                                                       
run;                    

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  Name Age Car;
run;

many thanks

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Use the keyword _ALL_ instead which is a shortcut for all variables.

 

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  _all_;
run;

Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

 


@MART1 wrote:

Hello

 

I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).

 

I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).

 

My example below: in the dataset TEST, Nick's row is entered twice.

 

I can use PROC SORT nouniquekeys to identify that, but in the BY  I need to specify all the variables ( Name Age Car).

 

I cannot find how to use, say, *, without having to list all the variables?

 

 

 

data TEST;                                                                                                                            
   input Name $ Age Car $;                                                                                                                 
   datalines;                                                                                                                           
Mike 40 Volvo
Nick 35 Nissan
Susan 51 BMW
Bill 60 Volvo
Tom 35 Ford
Nick 35 Nissan
Nadia 49 Nissan
;                                                                                                                                       
run;                    

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  Name Age Car;
run;

many thanks

 

 


 

View solution in original post

6 REPLIES 6
Reeza
Super User

Use the keyword _ALL_ instead which is a shortcut for all variables.

 

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  _all_;
run;

Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

 


@MART1 wrote:

Hello

 

I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).

 

I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).

 

My example below: in the dataset TEST, Nick's row is entered twice.

 

I can use PROC SORT nouniquekeys to identify that, but in the BY  I need to specify all the variables ( Name Age Car).

 

I cannot find how to use, say, *, without having to list all the variables?

 

 

 

data TEST;                                                                                                                            
   input Name $ Age Car $;                                                                                                                 
   datalines;                                                                                                                           
Mike 40 Volvo
Nick 35 Nissan
Susan 51 BMW
Bill 60 Volvo
Tom 35 Ford
Nick 35 Nissan
Nadia 49 Nissan
;                                                                                                                                       
run;                    

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  Name Age Car;
run;

many thanks

 

 


 

MART1
Quartz | Level 8

That's great thanks so much @Reeza 

 

Very interesting paper  - just one question (not strictly related to the topic I appreciate) ; the _ALL_ cannot be used in PROC SQL, for example

 

PROC SQL;
	CREATE TABLE DUPS AS 
		SELECT COUNT(DISTINCT _ALL_) FROM TEST;
QUIT;

Do you know if there's an _ALL_ equivalent for SQL? (* can be used, but not with DISTINCT, or in a GROUP BY)

 

Thanks

 

 

 

Reeza
Super User
Unfortunately there is no equivalent to SQL. In general, SQL doesn't work well with shortcut references or formats.

You could create a macro variable list from sashelp but _all_ is faster and more dynamic.
Tom
Super User Tom
Super User

The SQL equivalent of _ALL_ is *.

 

So this code will reduce HAVE from 19 observations to 11 distinct observations.

 

data have;
  set sashelp.class;
  keep age sex ;
run;

proc sql;
create table distinct as select distinct * from have ;
quit;
Reeza
Super User
 69         proc sql;
 70         create table test as select count(distinct *) from sashelp.class ;
 ERROR: * used in an illegal position.
 NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
 71         quit;

Cannot be used for counts, but could be used to de-duplicate

Tom
Super User Tom
Super User

But that is just the same invalid syntax you sent before.

If you want to count the number of distinct records use:

select count(*) as number_distinct from (select distinct * from have);

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1419 views
  • 1 like
  • 3 in conversation