I have a dataset with many variables, and I would like to flag all the variables that have any duplicated values. I first did
proc contents data=foo out=tmp noprint; run;
so I have a table i can work with called tmp, with one row per variable. I would like to add a new column, call it "DUPLICATES", to tmp that is TRUE if the variable contains any duplicated values, and is FALSE, if it contains all unique values.
I am having a very hard time figuring out what the best way to do this is. Can someone please help? Also, there are many variables in the dataset so I would prefer a way that does not require me to know and type out the variable names individually.
Thanks!
Please look at:
data have; input A B $ C D $; datalines; 1 a 1 a 2 b 1 b 3 c 2 c 4 d 3 d 5 e 4 a ; run; /* replace WORK and HAVE with the name of your actual dataset*/ proc sql noprint; select nobs into : rowcount from dictionary.tables where libname='WORK' and memname='HAVE' ; run; %put &rowcount; /*put the libname.dataset in this procfreq code*/ proc freq data=have nlevels ; ods output Nlevels=Havelevels; run; data want; set HaveLevels (drop=NMissLevels NNonMissLevels); If Nlevels = &rowcount then Duplicates='FALSE'; else Duplicates='TRUE'; run;
I had to create a dataset have for the example. Use your dataset libname and dataset name in place of "have".
Note that in the Proc Sql part the library name and data set are stored in uppercase so the values have to be upper or use functions to make them match.
Proc Freq and Proc Univariate are two common procedures used for data profiling.
You have a table that is a transpose of the normal data table. So a lot of the normal PROC's will not be of help unless you transpose back to obs as rows and variables as columns.
You could run a DATA step. Let's say you have a data set with a var called VARNAME, and a set of numeric values in vars X1-X20:
data want (drop=I);
set have;
array x {*} x1-x20;
call sortn(of x{*});
duplicate='FALSE';
do I=1 to 19 while (duplicate='FALSE');
if x{I}=x{I+1} then duplicate='TRUE';
end;
set have;
run;
There is a second SET statement in this data step. That's because the CALL SORTN statement changes the order of the values in X1 through X20, which of course you don't want in the output data set. So a second SET statement is executed after the value of duplicate is established, rereading the original X1-X20 values. I.e. the 2 set statements read from HAVE in parallel, not as interleaved records.
The "while (duplicate='FALSE')" clause allows the DO loop iteration to stop as soon as duplicates are found. Not need to continue once the first duplicate is encountered.
Is this solution looking to see if values in one variable are duplicated in another variable? If so, this is not what I want. I only want to look within each variable in the dataset, to see if that variable has any duplicate values. So in this example, I want to look within each variable in foo, and check whether there are duplicates within that variable, or if all the values within that variable are unique.
Doing PROC CONTENTS on foo yields a table (which has one row per variable in the foo dataset), call it tmp, that does not have a column with this information, so I want to add such a column to it, call it "DUPLICATES". For each row in tmp (corresponding to a variable in foo) it would be TRUE if that variable listed in that row has any duplicates, and FALSE if it does not. For example, if the first variable in the dataset foo is A, then the first row of the PROC CONTENTS table (call it tmp) is for A, and if A in foo has any duplicates then the value of DUPLICATES in the first row of tmp would be "TRUE".
Does that make sense?
Thanks very much for your help!
Hi,
Here is only a partial answer-I will add to it if I have more time later. To get a list of duplicate values you can use the freq procedure:
PROC FREQ DATA=my_data NOPRINT;
TABLES Employee_ID / OUT=EmpFreq;
RUN;
PROC PRINT DATA=EmpFreq;
WHERE count>1;
RUN;
Or you can use a WHERE= data set option wen creating the output dataset with counts:
PROC FREQ DATA= my_data NOPRINT;
TABLES Employee_ID / OUT=EmpFreq2 (WHERE=(count>1));
run;
PROC PRINT DATA=EmpFreq2;
RUN;
an easy way to get a list of variables in a dataset that lets you avoid typing them by hand is:
PROC CONTENTS DATA=my_data SHORT;
RUN;
this will give you just the variable names. You can copy and paste them back into your editor window to avoid typing the variables
I believe @jjknknl is NOT looking for instances of a repeated employee id, which your program does effectively, but rather the presence of duplicate values among the H values for a given row.
@jjknknl, am I correct?
I am looking to see if each variable has repeated values/records/rows/observations or not. I'm NOT looking for duplicates across variables, only within variables. Each variable can be considered completely independently of all others. For example, a variable may contain ID numbers, and i want to see if any ID numbers are repeated in that variable (ignoring all other variables in the dataset). I don't care what the repeated ID numbers, i just ultimately want a way to output TRUE if the variable has any duplicates, and FALSE if it does not. Thanks!
@jjknknl wrote:
I have a dataset with many variables, and I would like to flag all the variables that have any duplicated values. I first did
proc contents data=foo out=tmp noprint; run;
so I have a table i can work with called tmp, with one row per variable. I would like to add a new column, call it "DUPLICATES", to tmp that is TRUE if the variable contains any duplicated values, and is FALSE, if it contains all unique values.
I am having a very hard time figuring out what the best way to do this is. Can someone please help? Also, there are many variables in the dataset so I would prefer a way that does not require me to know and type out the variable names individually.
Thanks!
How about providing some example data, maybe with 5 variables or so with different combinations of numbers of duplicates and without and then show what the final result would look like for that example data.
The reason I suggest the example desired output is it really depends on what you think you want and I'm pretty sure I do not know what that is. If you want something that shows var1=var2 in some way we need to know what it is. We also need to know if you need to distinguish between var1=var2 and var1=var3 separately or if a single variable can hold all of the values if any that equal var1. One approach could require as many as 10 additional variables to track the comparisons of 5 variables. It appears as a minimum that you may want one flag variable per variable but I'm not exactly sure.
Also are all of your variables numeric, all character or is there a mix? And if you have mixed numeric and character is it a "duplicate" if one variable has numeric 1 and a character variable has character "1"?
Thanks very much for your help. Please see my response to mkeintz. Also, to give a concrete example, please consider the following dataset "foo"
A | B | C | D |
1 | a | 1 | a |
2 | b | 1 | b |
3 | c | 2 | c |
4 | d | 3 | d |
5 | e | 4 | a |
When i do PROC CONTENTS on this, i get the following table, call it "tmp"
Name | Label |
A | First variable |
B | Second variable |
C | Third variable |
D | Fourth variable |
It actually has many more columns than just Name and Label but i've omitted them for the sake of clarity. What I ultimately want is the following
Name | Label | DUPLICATES |
A | First variable | FALSE |
B | Second variable | FALSE |
C | Third variable | TRUE |
D | Fourth variable | TRUE |
Does that make it clear? Thanks again!
A,B,C, and D are the variable names in this dataset, and there are 5 values/rows/observations/records per variable. The variable C in this dataset has duplicates (the first and second values/rows/observations/records are both 1), and the variable D has duplicates (the first and fifth values/rows/observations/records are both "a"). That is why DUPLICATES=TRUE for them. The variable A has no duplicates, i.e., all values are unique, and the variable B has no duplicates either. That is why DUPLICATES=FALSE for them.
Please look at:
data have; input A B $ C D $; datalines; 1 a 1 a 2 b 1 b 3 c 2 c 4 d 3 d 5 e 4 a ; run; /* replace WORK and HAVE with the name of your actual dataset*/ proc sql noprint; select nobs into : rowcount from dictionary.tables where libname='WORK' and memname='HAVE' ; run; %put &rowcount; /*put the libname.dataset in this procfreq code*/ proc freq data=have nlevels ; ods output Nlevels=Havelevels; run; data want; set HaveLevels (drop=NMissLevels NNonMissLevels); If Nlevels = &rowcount then Duplicates='FALSE'; else Duplicates='TRUE'; run;
I had to create a dataset have for the example. Use your dataset libname and dataset name in place of "have".
Note that in the Proc Sql part the library name and data set are stored in uppercase so the values have to be upper or use functions to make them match.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.