BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MART1
Quartz | Level 8

Hello

 

I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).

 

I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).

 

My example below: in the dataset TEST, Nick's row is entered twice.

 

I can use PROC SORT nouniquekeys to identify that, but in the BY  I need to specify all the variables ( Name Age Car).

 

I cannot find how to use, say, *, without having to list all the variables?

 

 

 

data TEST;                                                                                                                            
   input Name $ Age Car $;                                                                                                                 
   datalines;                                                                                                                           
Mike 40 Volvo
Nick 35 Nissan
Susan 51 BMW
Bill 60 Volvo
Tom 35 Ford
Nick 35 Nissan
Nadia 49 Nissan
;                                                                                                                                       
run;                    

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  Name Age Car;
run;

many thanks

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Use the keyword _ALL_ instead which is a shortcut for all variables.

 

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  _all_;
run;

Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

 


@MART1 wrote:

Hello

 

I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).

 

I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).

 

My example below: in the dataset TEST, Nick's row is entered twice.

 

I can use PROC SORT nouniquekeys to identify that, but in the BY  I need to specify all the variables ( Name Age Car).

 

I cannot find how to use, say, *, without having to list all the variables?

 

 

 

data TEST;                                                                                                                            
   input Name $ Age Car $;                                                                                                                 
   datalines;                                                                                                                           
Mike 40 Volvo
Nick 35 Nissan
Susan 51 BMW
Bill 60 Volvo
Tom 35 Ford
Nick 35 Nissan
Nadia 49 Nissan
;                                                                                                                                       
run;                    

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  Name Age Car;
run;

many thanks

 

 


 

View solution in original post

6 REPLIES 6
Reeza
Super User

Use the keyword _ALL_ instead which is a shortcut for all variables.

 

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  _all_;
run;

Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

 


@MART1 wrote:

Hello

 

I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).

 

I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).

 

My example below: in the dataset TEST, Nick's row is entered twice.

 

I can use PROC SORT nouniquekeys to identify that, but in the BY  I need to specify all the variables ( Name Age Car).

 

I cannot find how to use, say, *, without having to list all the variables?

 

 

 

data TEST;                                                                                                                            
   input Name $ Age Car $;                                                                                                                 
   datalines;                                                                                                                           
Mike 40 Volvo
Nick 35 Nissan
Susan 51 BMW
Bill 60 Volvo
Tom 35 Ford
Nick 35 Nissan
Nadia 49 Nissan
;                                                                                                                                       
run;                    

proc sort data=WORK.TEST nouniquekeys out=duplicates;
    by  Name Age Car;
run;

many thanks

 

 


 

MART1
Quartz | Level 8

That's great thanks so much @Reeza 

 

Very interesting paper  - just one question (not strictly related to the topic I appreciate) ; the _ALL_ cannot be used in PROC SQL, for example

 

PROC SQL;
	CREATE TABLE DUPS AS 
		SELECT COUNT(DISTINCT _ALL_) FROM TEST;
QUIT;

Do you know if there's an _ALL_ equivalent for SQL? (* can be used, but not with DISTINCT, or in a GROUP BY)

 

Thanks

 

 

 

Reeza
Super User
Unfortunately there is no equivalent to SQL. In general, SQL doesn't work well with shortcut references or formats.

You could create a macro variable list from sashelp but _all_ is faster and more dynamic.
Tom
Super User Tom
Super User

The SQL equivalent of _ALL_ is *.

 

So this code will reduce HAVE from 19 observations to 11 distinct observations.

 

data have;
  set sashelp.class;
  keep age sex ;
run;

proc sql;
create table distinct as select distinct * from have ;
quit;
Reeza
Super User
 69         proc sql;
 70         create table test as select count(distinct *) from sashelp.class ;
 ERROR: * used in an illegal position.
 NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
 71         quit;

Cannot be used for counts, but could be used to de-duplicate

Tom
Super User Tom
Super User

But that is just the same invalid syntax you sent before.

If you want to count the number of distinct records use:

select count(*) as number_distinct from (select distinct * from have);
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 2388 views
  • 1 like
  • 3 in conversation