Desktop productivity for business analysts and programmers

Compare 2 datasets and eliminate matching variables

Reply
Occasional Contributor
Posts: 11

Compare 2 datasets and eliminate matching variables

I need to compare Table1 and Table2 by ID. If Data1 to Data5 in Table2 is equal to Data in Table1, set the field to blank.

 

Example:

Table1 
IDData
ABCR5
ABCJ3
ABCE1
ABCC7
DEFD0
DEFZa
DEFZ1
DEFX4
GHIR2
GHIA7
GHIE5
GHIF3
GHIH4

 

Table2        
IDData1Data2Data3Data4Data5Var1Va2Var3
ABCR5MxC7C8J3   
ABCR5MxC7C8J3   
ABCR5MxC7C8J3   
ABCR5MxC7C8J3   
ABCR5MxC7C8J3   
ABCR5MxC7C8J3   
ABCR5MxC7C8J3   
DEFZ1E3IwZaD1   
DEFZ1E3IwZaD1   
DEFZ1E3IwZaD1   
DEFZ1E3IwZaD1   
DEFZ1E3IwZaD1   
DEFZ1E3IwZaD1   
DEFZ1E3IwZaD1   
DEFZ1E3IwZaD1   
DEFZ1E3IwZaD1   
GHIR2E5Q4I9Jk   
GHIR2E5Q4I9Jk   
GHIR2E5Q4I9Jk   
GHIR2E5Q4I9Jk   
GHIR2E5Q4I9Jk   
GHIR2E5Q4I9Jk   
GHIR2E5Q4I9Jk   

 

Expected Output       
IDData1Data2Data3Data4Data5Var1Va2Var3
ABC Mx C8    
ABC Mx C8    
ABC Mx C8    
ABC Mx C8    
ABC Mx C8    
ABC Mx C8    
ABC Mx C8    
DEF E3Iw D1   
DEF E3Iw D1   
DEF E3Iw D1   
DEF E3Iw D1   
DEF E3Iw D1   
DEF E3Iw D1   
DEF E3Iw D1   
DEF E3Iw D1   
DEF E3Iw D1   
GHI  Q4I9Jk   
GHI  Q4I9Jk   
GHI  Q4I9Jk   
GHI  Q4I9Jk   
GHI  Q4I9Jk   
GHI  Q4I9Jk   
GHI  Q4I9Jk   

 

Thanks so much for your help!

PROC Star
Posts: 403

Re: Compare 2 datasets and eliminate matching variables

Posted in reply to jsphnwllms

From the first data set, create a new one with one row per ID and a new variable (ex: conc) containing the data value separated by ,

ex:

ID     conc

ABC R5,J3,E1,C7

DEF D0, Za, Z1, X4

...

 

Then merge that data set with the second one on ID.

For each data1-data5, check if the concatenated contains the value, and if so blank it.

if conc contains data1 then data1 = '';

if conc contains data2 then data2 = '';

...

 

Just as an example of how to solve it, it can of course be made nicer with macros and stuff, but try the simple way first Smiley Happy

 

//Fredrik

Occasional Contributor
Posts: 11

Re: Compare 2 datasets and eliminate matching variables

Thanks for your response!

Super User
Posts: 10,852

Re: Compare 2 datasets and eliminate matching variables

Posted in reply to jsphnwllms

data Table1	 ;
input (ID	Data) ($);
cards;
ABC	R5
ABC	J3
ABC	E1
ABC	C7
DEF	D0
DEF	Za
DEF	Z1
DEF	X4
GHI	R2
GHI	A7
GHI	E5
GHI	F3
GHI	H4
;
run;
 
data Table2; 
input (ID	Data1	Data2	Data3	Data4	Data5) ($);
cards;
ABC	R5	Mx	C7	C8	J3	 	 	 
ABC	R5	Mx	C7	C8	J3	 	 	 
ABC	R5	Mx	C7	C8	J3	 	 	 
ABC	R5	Mx	C7	C8	J3	 	 	 
ABC	R5	Mx	C7	C8	J3	 	 	 
ABC	R5	Mx	C7	C8	J3	 	 	 
ABC	R5	Mx	C7	C8	J3	 	 	 
DEF	Z1	E3	Iw	Za	D1	 	 	 
DEF	Z1	E3	Iw	Za	D1	 	 	 
DEF	Z1	E3	Iw	Za	D1	 	 	 
DEF	Z1	E3	Iw	Za	D1	 	 	 
DEF	Z1	E3	Iw	Za	D1	 	 	 
DEF	Z1	E3	Iw	Za	D1	 	 	 
DEF	Z1	E3	Iw	Za	D1	 	 	 
DEF	Z1	E3	Iw	Za	D1	 	 	 
DEF	Z1	E3	Iw	Za	D1	 	 	 
GHI	R2	E5	Q4	I9	Jk	 	 	 
GHI	R2	E5	Q4	I9	Jk	 	 	 
GHI	R2	E5	Q4	I9	Jk	 	 	 
GHI	R2	E5	Q4	I9	Jk	 	 	 
GHI	R2	E5	Q4	I9	Jk	 	 	 
GHI	R2	E5	Q4	I9	Jk	 	 	 
GHI	R2	E5	Q4	I9	Jk
;
run;
data want;
 if _n_=1 then do;
  if 0 then set table1;
  declare hash h(dataset:'table1');
  h.definekey('id','data');
  h.definedone();
 end;
set table2;
array x{*} $ data1-data5;
do i=1 to dim(x);
 data=x{i};if h.check()=0 then call missing(x{i});
end;
drop i data;
run;

Occasional Contributor
Posts: 11

Re: Compare 2 datasets and eliminate matching variables

Thanks for your response. The code from Ksharp did not quite produce the expected output.

 

Here's what I've done:

 

libname test 'C:\Data';

PROC SORT data=test.Table1;
by ID;
RUN;

 

PROC TRANSPOSE data=test.Table1 out=test.Table3;
var Data;
by ID;
RUN;

 

PROC SQL;
CREATE TABLE test.combine AS
SELECT
t1.*
,t2.col1
,t2.col2
,t2.col3
,t2.col4
,t2.col5
FROM
test.table2 t1
LEFT JOIN
test.table3 t2 on
t1.id = t2.id;
QUIT;

 

PROC SQL;
CREATE TABLE test.final AS
SELECT
ID
,Case When ((Data1 = col1) or
(Data1 = col2) or
(Data1 = col3) or
(Data1 = col4) or
(Data1 = col5)) then '' else Data1 end as Data1
,Case When ((Data2 = col1) or
(Data2 = col2) or
(Data2 = col3) or
(Data2 = col4) or
(Data2 = col5)) then '' else Data2 end as Data2
,Case When ((Data3 = col1) or
(Data3 = col2) or
(Data3 = col3) or
(Data3 = col4) or
(Data3 = col5)) then '' else Data3 end as Data3
,Case When ((Data4 = col1) or
(Data4 = col2) or
(Data4 = col3) or
(Data4 = col4) or
(Data4 = col5)) then '' else Data4 end as Data4
,Case When ((Data5 = col1) or
(Data5 = col2) or
(Data5 = col3) or
(Data5 = col4) or
(Data5 = col5)) then '' else Data5 end as Data5
FROM test.combine;
QUIT;

 

Here's the problem:

I'm dealing with thousands, even millions, of records. Also, the data variable could go from Data1 to Data25, which means this would be a long program. Is there a way to shorten the code in the last PROC SQL step? Thanks!

Ask a Question
Discussion stats
  • 4 replies
  • 241 views
  • 0 likes
  • 3 in conversation