Hey everyone,
So I agregated a table using proc sql's group by but I see there are a few lines that have duplicate keys. However I have 300 hundreds columns and I can't seem to find the difference manually. Is there a way to see which column are differents between these duplicates?
My data have numbers and strings.
Thanks
Ahh ok. Yeah I didn't encounter that with the dummy data because all of the variable names being reviewed were the same length.
So what you need to do is set the length of the Diff_Variable to at minimum the value of the longest variable.
I cleaned up the logic a little bit and created a new macro variable that contains the length for the longest variable name to use in the PROC SQL query below it.
%MACRO VARIABLEDIFFS(LIBNAME,MEMNAME);
PROC SQL noprint;
select nvar INTO :TOTALVAR from dictionary.tables where libname="&LIBNAME." and memname="&MEMNAME.";
select name INTO :Variable1- from dictionary.COLUMNS where libname="&LIBNAME." and memname="&MEMNAME.";
select MAX(LENGTH(name)) INTO :MaxVarLength from dictionary.COLUMNS where libname="&LIBNAME." and memname="&MEMNAME.";
QUIT;
%DO n=2 %TO &TOTALVAR;
PROC SQL;
CREATE TABLE WORK.WANT_&n. AS
SELECT DISTINCT a.&Variable1., "&&Variable&n." AS Diff_Variable LENGTH=&MaxVarLength.
FROM WORK.HAVE AS a
INNER JOIN WORK.HAVE AS b ON a.&Variable1.=b.&Variable1.
WHERE a.&&Variable&n. NE b.&&Variable&n.;
QUIT;
PROC APPEND BASE=WORK.WANT_ALL
DATA=WORK.WANT_&N.;
RUN;
PROC DELETE DATA=WORK.WANT_&N.;
%END;
%MEND VARIABLEDIFFS;
%VARIABLEDIFFS(WORK,HAVE);
I tried this, but it only seems to be comparing variables between them. I am looking for something that can help me compare the same columns between two rows so I can see which columns are different.
I threw together a dummy table and a macro statement to identify observations that have a duplicate id but a varying value in a different variable. The dummy table only has 6 variables total, but is built to accommodate tables with any number of variables.
Here is the logic:
DATA WORK.HAVE;
FORMAT ID $3. Var2 $16. Var3 $16. Var4 8. Var5 $16. Var6 $16.;
INFORMAT ID $3. Var2 $16. Var3 $16. Var4 8. Var5 $16. Var6 $16.;
INPUT ID VAR2 VAR3 VAR4 VAR5 VAR6;
INFILE DATALINES DLM='|' DSD;
DATALINES;
123|Red|Blue|100|Yellow|Black
123|Red|Black|100|Yellow|Black
234|Orange|Yellow|80|White|Purple
234|Orange|Yellow|65|White|Purple
345|Yellow|White|90|Green|Brown
456|Green|Brown|25|Red|Orange
;
%MACRO VARIABLEDIFFS(LIBNAME,MEMNAME);
PROC SQL noprint;
select nvar INTO :TOTALVAR
from dictionary.tables
where libname="&LIBNAME."
and memname="&MEMNAME.";
QUIT;
%PUT &=TOTALVAR.;
proc sql noprint;
select name INTO :Variable1-
from dictionary.COLUMNS
where libname="&LIBNAME."
and memname="&MEMNAME."
ORDER BY VARNUM;
QUIT;
%DO n=2 %TO &TOTALVAR;
PROC SQL;
CREATE TABLE WORK.WANT_&n. AS
SELECT DISTINCT a.&Variable1., "&&Variable&n." AS Diff_Variable
FROM WORK.HAVE AS a
INNER JOIN WORK.HAVE AS b ON a.&Variable1.=b.&Variable1.
WHERE a.&&Variable&n. NE b.&&Variable&n.;
QUIT;
PROC APPEND DATA=WORK.WANT_&N.
BASE=WORK.WANT_ALL;
RUN;
PROC DELETE DATA=WORK.WANT_&N.;
%END;
%MEND VARIABLEDIFFS;
%VARIABLEDIFFS(WORK,HAVE);
The result in the end, on the table WORK.WANT_ALL is:
ID Diff_Variable
123 Var3 234 Var4
Showing that the observations with ID '123' vary on the variable 'Var3' and ID '234' varies on the variable 'Var4'.
Looking at the dummy data that was loaded for those ids:
For ID '123':
123|Red|Blue|100|Yellow|Black
123|Red|Black|100|Yellow|Black
We can confirm that Var3 is different between the two observations. The first observation has 'Blue' for its value of Var3 and the next observation has the value of 'Black'.
When calling the macro statement, you include the libname and the table name. In my example it was 'WORK' and 'HAVE'.
I hope this helps
Thanks for the answer ! I'm having some trouble executing your code. I did create a table named have into my work library with only duplicate data but I'm getting this error :
WARNING: Variable Diff_Variable has different lengths on BASE and DATA files (BASE 10 DATA 16). ERROR: No appending done because of anomalies listed above. Use FORCE option to append these files.
How do I reformat Diff_Variable in this case?
Ahh ok. Yeah I didn't encounter that with the dummy data because all of the variable names being reviewed were the same length.
So what you need to do is set the length of the Diff_Variable to at minimum the value of the longest variable.
I cleaned up the logic a little bit and created a new macro variable that contains the length for the longest variable name to use in the PROC SQL query below it.
%MACRO VARIABLEDIFFS(LIBNAME,MEMNAME);
PROC SQL noprint;
select nvar INTO :TOTALVAR from dictionary.tables where libname="&LIBNAME." and memname="&MEMNAME.";
select name INTO :Variable1- from dictionary.COLUMNS where libname="&LIBNAME." and memname="&MEMNAME.";
select MAX(LENGTH(name)) INTO :MaxVarLength from dictionary.COLUMNS where libname="&LIBNAME." and memname="&MEMNAME.";
QUIT;
%DO n=2 %TO &TOTALVAR;
PROC SQL;
CREATE TABLE WORK.WANT_&n. AS
SELECT DISTINCT a.&Variable1., "&&Variable&n." AS Diff_Variable LENGTH=&MaxVarLength.
FROM WORK.HAVE AS a
INNER JOIN WORK.HAVE AS b ON a.&Variable1.=b.&Variable1.
WHERE a.&&Variable&n. NE b.&&Variable&n.;
QUIT;
PROC APPEND BASE=WORK.WANT_ALL
DATA=WORK.WANT_&N.;
RUN;
PROC DELETE DATA=WORK.WANT_&N.;
%END;
%MEND VARIABLEDIFFS;
%VARIABLEDIFFS(WORK,HAVE);
Works just fine ! Thank you very much !
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.