Solved: How can I identify all the variables with duplicate values

jjknknl · Posted 09-21-2017 06:32 PM

I have a dataset with many variables, and I would like to flag all the variables that have any duplicated values. I first did

proc contents data=foo out=tmp noprint; run;

so I have a table i can work with called tmp, with one row per variable. I would like to add a new column, call it "DUPLICATES", to tmp that is TRUE if the variable contains any duplicated values, and is FALSE, if it contains all unique values.

I am having a very hard time figuring out what the best way to do this is. Can someone please help? Also, there are many variables in the dataset so I would prefer a way that does not require me to know and type out the variable names individually.

Thanks!

ballardw · Posted 09-22-2017 06:37 PM

Please look at:

data have;
  input A  B $ C D $;
datalines;
1 a 1 a 
2 b 1 b 
3 c 2 c 
4 d 3 d 
5 e 4 a 
;
run;
/* replace WORK and HAVE with the name of your actual dataset*/
proc sql noprint;
   select nobs into : rowcount
   from dictionary.tables where
   libname='WORK' and memname='HAVE'
   ;
run;
%put &rowcount;

/*put the libname.dataset in this procfreq code*/
proc freq data=have nlevels  ;
ods output Nlevels=Havelevels;
run;

data want;
   set HaveLevels (drop=NMissLevels NNonMissLevels);
 
   If Nlevels = &rowcount then Duplicates='FALSE';
   else Duplicates='TRUE';

run;

I had to create a dataset have for the example. Use your dataset libname and dataset name in place of "have".

Note that in the Proc Sql part the library name and data set are stored in uppercase so the values have to be upper or use functions to make them match.

View solution in original post

Patrick · Posted 09-21-2017 06:45 PM

@jjknknl

Proc Freq and Proc Univariate are two common procedures used for data profiling.

mkeintz · Posted 09-21-2017 07:43 PM

You have a table that is a transpose of the normal data table. So a lot of the normal PROC's will not be of help unless you transpose back to obs as rows and variables as columns.

You could run a DATA step. Let's say you have a data set with a var called VARNAME, and a set of numeric values in vars X1-X20:

data want (drop=I);
  set have;
  array x {*} x1-x20;
  call sortn(of x{*});

  duplicate='FALSE';
  do I=1 to 19 while (duplicate='FALSE');
    if x{I}=x{I+1} then duplicate='TRUE';
  end;

  set have;
run;

There is a second SET statement in this data step. That's because the CALL SORTN statement changes the order of the values in X1 through X20, which of course you don't want in the output data set. So a second SET statement is executed after the value of duplicate is established, rereading the original X1-X20 values. I.e. the 2 set statements read from HAVE in parallel, not as interleaved records.

The "while (duplicate='FALSE')" clause allows the DO loop iteration to stop as soon as duplicates are found. Not need to continue once the first duplicate is encountered.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

jjknknl · Posted 09-22-2017 11:11 AM

Is this solution looking to see if values in one variable are duplicated in another variable? If so, this is not what I want. I only want to look within each variable in the dataset, to see if that variable has any duplicate values. So in this example, I want to look within each variable in foo, and check whether there are duplicates within that variable, or if all the values within that variable are unique.

Doing PROC CONTENTS on foo yields a table (which has one row per variable in the foo dataset), call it tmp, that does not have a column with this information, so I want to add such a column to it, call it "DUPLICATES". For each row in tmp (corresponding to a variable in foo) it would be TRUE if that variable listed in that row has any duplicates, and FALSE if it does not. For example, if the first variable in the dataset foo is A, then the first row of the PROC CONTENTS table (call it tmp) is for A, and if A in foo has any duplicates then the value of DUPLICATES in the first row of tmp would be "TRUE".

Does that make sense?

Thanks very much for your help!

Tarek_Elnaccash · Posted 09-21-2017 07:48 PM

Hi,

Here is only a partial answer-I will add to it if I have more time later. To get a list of duplicate values you can use the freq procedure:

PROC FREQ DATA=my_data NOPRINT;

TABLES Employee_ID / OUT=EmpFreq;

RUN;

PROC PRINT DATA=EmpFreq;

WHERE count>1;

RUN;

Or you can use a WHERE= data set option wen creating the output dataset with counts:

PROC FREQ DATA= my_data NOPRINT;

TABLES Employee_ID / OUT=EmpFreq2 (WHERE=(count>1));

run;

PROC PRINT DATA=EmpFreq2;

RUN;

an easy way to get a list of variables in a dataset that lets you avoid typing them by hand is:

PROC CONTENTS DATA=my_data SHORT;

RUN;

this will give you just the variable names. You can copy and paste them back into your editor window to avoid typing the variables

mkeintz · Posted 09-22-2017 11:04 AM

@Tarek_Elnaccash

I believe @jjknknl is NOT looking for instances of a repeated employee id, which your program does effectively, but rather the presence of duplicate values among the H values for a given row.

@jjknknl, am I correct?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

jjknknl · Posted 09-22-2017 02:39 PM

I am looking to see if each variable has repeated values/records/rows/observations or not. I'm NOT looking for duplicates across variables, only within variables. Each variable can be considered completely independently of all others. For example, a variable may contain ID numbers, and i want to see if any ID numbers are repeated in that variable (ignoring all other variables in the dataset). I don't care what the repeated ID numbers, i just ultimately want a way to output TRUE if the variable has any duplicates, and FALSE if it does not. Thanks!

ballardw · Posted 09-22-2017 11:23 AM

@jjknknl wrote:

I have a dataset with many variables, and I would like to flag all the variables that have any duplicated values. I first did

proc contents data=foo out=tmp noprint; run;

so I have a table i can work with called tmp, with one row per variable. I would like to add a new column, call it "DUPLICATES", to tmp that is TRUE if the variable contains any duplicated values, and is FALSE, if it contains all unique values.

I am having a very hard time figuring out what the best way to do this is. Can someone please help? Also, there are many variables in the dataset so I would prefer a way that does not require me to know and type out the variable names individually.

Thanks!

How about providing some example data, maybe with 5 variables or so with different combinations of numbers of duplicates and without and then show what the final result would look like for that example data.

The reason I suggest the example desired output is it really depends on what you think you want and I'm pretty sure I do not know what that is. If you want something that shows var1=var2 in some way we need to know what it is. We also need to know if you need to distinguish between var1=var2 and var1=var3 separately or if a single variable can hold all of the values if any that equal var1. One approach could require as many as 10 additional variables to track the comparisons of 5 variables. It appears as a minimum that you may want one flag variable per variable but I'm not exactly sure.

Also are all of your variables numeric, all character or is there a mix? And if you have mixed numeric and character is it a "duplicate" if one variable has numeric 1 and a character variable has character "1"?

jjknknl · Posted 09-22-2017 11:55 AM

Thanks very much for your help. Please see my response to mkeintz. Also, to give a concrete example, please consider the following dataset "foo"

A	B	C	D
1	a	1	a
2	b	1	b
3	c	2	c
4	d	3	d
5	e	4	a

When i do PROC CONTENTS on this, i get the following table, call it "tmp"

Name	Label
A	First variable
B	Second variable
C	Third variable
D	Fourth variable

It actually has many more columns than just Name and Label but i've omitted them for the sake of clarity. What I ultimately want is the following

Name	Label	DUPLICATES
A	First variable	FALSE
B	Second variable	FALSE
C	Third variable	TRUE
D	Fourth variable	TRUE

Does that make it clear? Thanks again!

jjknknl · Posted 09-22-2017 02:43 PM

A,B,C, and D are the variable names in this dataset, and there are 5 values/rows/observations/records per variable. The variable C in this dataset has duplicates (the first and second values/rows/observations/records are both 1), and the variable D has duplicates (the first and fifth values/rows/observations/records are both "a"). That is why DUPLICATES=TRUE for them. The variable A has no duplicates, i.e., all values are unique, and the variable B has no duplicates either. That is why DUPLICATES=FALSE for them.

ballardw · Posted 09-22-2017 06:37 PM

Please look at:

data have;
  input A  B $ C D $;
datalines;
1 a 1 a 
2 b 1 b 
3 c 2 c 
4 d 3 d 
5 e 4 a 
;
run;
/* replace WORK and HAVE with the name of your actual dataset*/
proc sql noprint;
   select nobs into : rowcount
   from dictionary.tables where
   libname='WORK' and memname='HAVE'
   ;
run;
%put &rowcount;

/*put the libname.dataset in this procfreq code*/
proc freq data=have nlevels  ;
ods output Nlevels=Havelevels;
run;

data want;
   set HaveLevels (drop=NMissLevels NNonMissLevels);
 
   If Nlevels = &rowcount then Duplicates='FALSE';
   else Duplicates='TRUE';

run;

I had to create a dataset have for the example. Use your dataset libname and dataset name in place of "have".

Note that in the Proc Sql part the library name and data set are stored in uppercase so the values have to be upper or use functions to make them match.

How can I identify all the variables with duplicate values

Re: How can I identify all the variables with duplicate values

Re: How can I identify all the variables with duplicate values

Re: How can I identify all the variables with duplicate values

Re: How can I identify all the variables with duplicate values

Re: How can I identify all the variables with duplicate values

Re: How can I identify all the variables with duplicate values

Re: How can I identify all the variables with duplicate values

Re: How can I identify all the variables with duplicate values

Re: How can I identify all the variables with duplicate values

Re: How can I identify all the variables with duplicate values

Re: How can I identify all the variables with duplicate values

Registration is open

SAS Training: Just a Click Away