Hello
I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).
I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).
My example below: in the dataset TEST, Nick's row is entered twice.
I can use PROC SORT nouniquekeys to identify that, but in the BY I need to specify all the variables ( Name Age Car).
I cannot find how to use, say, *, without having to list all the variables?
data TEST;
input Name $ Age Car $;
datalines;
Mike 40 Volvo
Nick 35 Nissan
Susan 51 BMW
Bill 60 Volvo
Tom 35 Ford
Nick 35 Nissan
Nadia 49 Nissan
;
run;
proc sort data=WORK.TEST nouniquekeys out=duplicates;
by Name Age Car;
run;
many thanks
Use the keyword _ALL_ instead which is a shortcut for all variables.
proc sort data=WORK.TEST nouniquekeys out=duplicates;
by _all_;
run;
Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html
@MART1 wrote:
Hello
I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).
I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).
My example below: in the dataset TEST, Nick's row is entered twice.
I can use PROC SORT nouniquekeys to identify that, but in the BY I need to specify all the variables ( Name Age Car).
I cannot find how to use, say, *, without having to list all the variables?
data TEST; input Name $ Age Car $; datalines; Mike 40 Volvo Nick 35 Nissan Susan 51 BMW Bill 60 Volvo Tom 35 Ford Nick 35 Nissan Nadia 49 Nissan ; run; proc sort data=WORK.TEST nouniquekeys out=duplicates; by Name Age Car; run;many thanks
Use the keyword _ALL_ instead which is a shortcut for all variables.
proc sort data=WORK.TEST nouniquekeys out=duplicates;
by _all_;
run;
Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html
@MART1 wrote:
Hello
I have a number of datasets with different variables; I need to return the duplicate records for each dataset (if they exist).
I'd like to do this by not having to specify all the variables (so I can wrap it in a macro which will go through all the datasets).
My example below: in the dataset TEST, Nick's row is entered twice.
I can use PROC SORT nouniquekeys to identify that, but in the BY I need to specify all the variables ( Name Age Car).
I cannot find how to use, say, *, without having to list all the variables?
data TEST; input Name $ Age Car $; datalines; Mike 40 Volvo Nick 35 Nissan Susan 51 BMW Bill 60 Volvo Tom 35 Ford Nick 35 Nissan Nadia 49 Nissan ; run; proc sort data=WORK.TEST nouniquekeys out=duplicates; by Name Age Car; run;many thanks
That's great thanks so much @Reeza
Very interesting paper - just one question (not strictly related to the topic I appreciate) ; the _ALL_ cannot be used in PROC SQL, for example
PROC SQL; CREATE TABLE DUPS AS SELECT COUNT(DISTINCT _ALL_) FROM TEST; QUIT;
Do you know if there's an _ALL_ equivalent for SQL? (* can be used, but not with DISTINCT, or in a GROUP BY)
Thanks
The SQL equivalent of _ALL_ is *.
So this code will reduce HAVE from 19 observations to 11 distinct observations.
data have;
set sashelp.class;
keep age sex ;
run;
proc sql;
create table distinct as select distinct * from have ;
quit;
69 proc sql; 70 create table test as select count(distinct *) from sashelp.class ; ERROR: * used in an illegal position. NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements. 71 quit;
Cannot be used for counts, but could be used to de-duplicate
But that is just the same invalid syntax you sent before.
If you want to count the number of distinct records use:
select count(*) as number_distinct from (select distinct * from have);
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.