Solved: Re: Summary report of duplicate values from multiple tables

jpprovost · Posted 10-29-2019 01:35 PM

Hi there,

I have 5 datasets (set1 to set5) containing the same variable (eg : var1). Those 5 datasets are 5 different sources.

By example,

set1 contains 5506 var1

set2 contains 6371 var1

set3 contains 56 var1

set4 contains 1237 var1

set5 contains 6371 var1

Of course, there's a lot of duplicates of var1 between those datasets.

(By the way, those totals were made with a PROC SQL and select distinct, so in each set there's no duplicate value.)

Is there a way to produce a summary report that could ventilate those duplicates between the 5 datasets?

Let's say, for example, that in set1 there's 2333 var1 which are also in set2.

I would like to have a table (dataset) like a report which will display as follows:

set1 set2 set3 set4 set5

set1 5506 2333 ...........................

set2 2333 6371 ............................

set3 .............................................

set4 ...............................................

set5 ...................................................

Is this something possible?

Ksharp · Posted 10-30-2019 09:27 AM

EDITED.

data T1 T2 T3 T4 T5;
  do I=1 to 1000;
    if ranuni(0)>.6 then output T1;   
    if ranuni(0)>.6 then output T2;   
    if ranuni(0)>.6 then output T3;   
    if ranuni(0)>.6 then output T4;   
    if ranuni(0)>.6 then output T5;
  end;
run;

%macro intersect(list= T1 T2 T3 T4 T5  );
%let n=%sysfunc(countw(&list , %str( ) ));

proc sql;
create table temp as

%do i=1 %to &n ;
%let a=%scan(&list,&i,%str( ));
%do j=1 %to &n ;
%let b=%scan(&list,&j,%str( ));

select "&a" as dsn1 length=40,"&b" as dsn2 length=40,count(*) as n from(
select * from &a 
intersect
select * from &b 
)

%if &i ne &n or &j ne &n %then %do; union %end;

%end;
%end;

;quit;

%mend;

options mprint mlogic symbolgen;
%intersect()

proc transpose data=temp out=want(drop=_:);
by dsn1;
id dsn2;
var n;
run;

View solution in original post

ChrisNZ · Posted 10-29-2019 04:56 PM

How do you show values that are present in 3 tables?

High-Performance SAS Coding - Third Edition

jpprovost · Posted 10-29-2019 05:19 PM

@ChrisNZ wrote:

How do you show values that are present in 3 tables?

Hi @ChrisNZ
I was thinking of something like this:

data test ABC;
set set1 (IN=A)
set2 (IN=B)
set3 (IN=C);
IF A AND B AND C THEN OUTPUT ABC;
run;

But I am not sure...

ChrisNZ · Posted 10-29-2019 08:25 PM

> I was thinking of something like this:

This does not answer the question.

You have a multidimensional crossing. How do you want to show it?

@mkeintz 's solution give you a summarised overview (WANT2 below).

You could also want a detailed overview (WANT1 below).

Or anything in between.

data T1 T2 T3 T4 T5;
  do I=1 to 1000;
    if ranuni(0)>.6 then output T1;   
    if ranuni(0)>.6 then output T2;   
    if ranuni(0)>.6 then output T3;   
    if ranuni(0)>.6 then output T4;   
    if ranuni(0)>.6 then output T5;
  end;
run;

data WANT1;
  merge T1(in=A) T2(in=B) T3(in=C) T4(in=D) T5(in=E);
  by I;
  src=catx('-',ifc(A,'T1',''),ifc(B,'T2',''),ifc(C,'T3',''),ifc(D,'T4',''),ifc(E,'T5',''));
run;

proc tabulate data=WANT1;
  class SRC;
  table src=' '*n=' '; 
run;

data WANT2 (keep=ROW COL );
  length ROW COL $4;
  merge T1(in=A) T2(in=B) T3(in=C) T4(in=D) T5(in=E);
  by I;
  array _in {5} in1-in5;
  do C=1 to 5;
    COL=cats('SET',C);
    if _in{C}=1 then do R=C to 5;
      ROW=cats('SET',R);
      if _in{r}=1 then output;
    end;
  end;
run;

proc tabulate data=WANT2  noseps;
  class ROW  COL;
  table ROW=' ' , COL=' '*n=' ';
run;

	SET1	SET2	SET3	SET4	SET5
SET1	405	.	.	.	.
SET2	163	389	.	.	.
SET3	162	151	397	.	.
SET4	151	156	161	396	.
SET5	158	160	167	178	396

T1	T1-T2	T1-T2-T3	T1-T2-T3-T4	T1-T2-T3-T4-T5	T1-T2-T3-T5	T1-T2-T4	T1-T2-T4-T5	T1-T2-T5	T1-T3	T1-T3-T4	T1-T3-T4-T5	T1-T3-T5	T1-T4	T1-T4-T5	T1-T5	T2	T2-T3	T2-T3-T4	T2-T3-T4-T5	T2-T3-T5	T2-T4	T2-T4-T5	T2-T5	T3	T3-T4	T3-T4-T5	T3-T5	T4	T4-T5	T5
57	35	24	8	13	18	29	13	23	45	19	20	15	30	19	37	61	25	23	20	20	24	26	27	54	32	26	35	53	41	43

High-Performance SAS Coding - Third Edition

jpprovost · Posted 10-29-2019 09:29 PM

@ChrisNZ wrote:

> I was thinking of something like this:

This does not answer the question.

You have a multidimensional crossing. How do you want to show it?

@mkeintz 's solution give you a summarised overview (WANT2 below).

You could also want a detailed overview (WANT1 below).

Or anything in between.
data T1 T2 T3 T4 T5;
  do I=1 to 1000;
    if ranuni(0)>.6 then output T1;   
    if ranuni(0)>.6 then output T2;   
    if ranuni(0)>.6 then output T3;   
    if ranuni(0)>.6 then output T4;   
    if ranuni(0)>.6 then output T5;
  end;
run;

data WANT1;
  merge T1(in=A) T2(in=B) T3(in=C) T4(in=D) T5(in=E);
  by I;
  src=catx('-',ifc(A,'T1',''),ifc(B,'T2',''),ifc(C,'T3',''),ifc(D,'T4',''),ifc(E,'T5',''));
run;

proc tabulate data=WANT1;
  class SRC;
  table src=' '*n=' '; 
run;

data WANT2 (keep=ROW COL );
  length ROW COL $4;
  merge T1(in=A) T2(in=B) T3(in=C) T4(in=D) T5(in=E);
  by I;
  array _in {5} in1-in5;
  do C=1 to 5;
    COL=cats('SET',C);
    if _in{C}=1 then do R=C to 5;
      ROW=cats('SET',R);
      if _in{r}=1 then output;
    end;
  end;
run;

proc tabulate data=WANT2  noseps;
  class ROW  COL;
  table ROW=' ' , COL=' '*n=' ';
run;
SET1 SET2 SET3 SET4 SET5

SET1 405 . . . .

SET2 163 389 . . .

SET3 162 151 397 . .

SET4 151 156 161 396 .

SET5 158 160 167 178 396

T1 T1-T2 T1-T2-T3 T1-T2-T3-T4 T1-T2-T3-T4-T5 T1-T2-T3-T5 T1-T2-T4 T1-T2-T4-T5 T1-T2-T5 T1-T3 T1-T3-T4 T1-T3-T4-T5 T1-T3-T5 T1-T4 T1-T4-T5 T1-T5 T2 T2-T3 T2-T3-T4 T2-T3-T4-T5 T2-T3-T5 T2-T4 T2-T4-T5 T2-T5 T3 T3-T4 T3-T4-T5 T3-T5 T4 T4-T5 T5

57 35 24 8 13 18 29 13 23 45 19 20 15 30 19 37 61 25 23 20 20 24 26 27 54 32 26 35 53 41 43

Hi @ChrisNZ

I will try this tomorrow morning. To be honest, I do not understand some lines of the code. However, this is amazing and it's way better that I wanted.

I will deeply look into it to at least understand a small part of it 🙂

EDIT:

@ChrisNZ

Tried WANT1 and it works perfectly, thanks to you. However, for WANT2, it doesn't seem to works. It gives me an empty dataset with the two variables (ROW COL)...

mkeintz · Posted 10-29-2019 05:17 PM

If you were to first merge the 5 datasets, by VAR1, then for each VAR1 value you could write out observations with a collection of ROW and COL values. Let's say that for VAR1=88, it is present in datasets DS1, DS3, and DS5. The you could write out these observations.

ROW    COL
SET1   SET1
SET3   SET1
SET5   SET1
SET3   SET3    /*Editted addition */
SET5   SET3    /*Editted addition */
SET5   SET5    /*Editted addition */

The you could just do a crosstabulation of ROW * COL:


proc sort data=ds1; by var1; run;
proc sort data=ds2; by var1; run;
proc sort data=ds3; by var1; run;
proc sort data=ds4; by var1; run;
proc sort data=ds5; by var1; run;

data want (keep=row col var1 pattern);
  length row col $4;
  merge ds1 (in=in1) ds2 (in=in2) ds3 (in=in3) ds4 (in=in4) ds5 (in=in5);
  by var1;

  array _in {5} in1-in5;
  do c=1 to 5;
    col=cats('SET',c);
    if _in{c}=1 then do r=c to 5;
      row=cats('SET',r);
      if _in{r}=1 then output;
    end;
  end;
run;

proc tabulate data=want  noseps;
  class row  col;
  table row=' ' , col=' ';
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

jpprovost · Posted 10-29-2019 09:30 PM

@mkeintz
Thanks also for your help. Will try it tomorrow.

PGStats · Posted 10-29-2019 06:25 PM

I would try:

data test;
set set1 set2 set3 set4 set5 indsname=dsn;
fromDS = scan(dsn, 2);
run;

proc sort data=test; by var1 fromDS; run;

data temp;
length DS $100;
do until(last.var1);
	set test; by var1;
	ds = catx("-", ds, fromDS); 
	end;
keep var1 DS;
run;

proc sql;
create table want as
select DS, count(*) as n
from temp
group by DS;
quit;

(untested)

PG

jpprovost · Posted 10-29-2019 09:31 PM

@PGStats
Thank you also for your time and help. I will try your piece of code also tomorrow.
🙂

Ksharp · Posted 10-30-2019 09:27 AM

EDITED.

data T1 T2 T3 T4 T5;
  do I=1 to 1000;
    if ranuni(0)>.6 then output T1;   
    if ranuni(0)>.6 then output T2;   
    if ranuni(0)>.6 then output T3;   
    if ranuni(0)>.6 then output T4;   
    if ranuni(0)>.6 then output T5;
  end;
run;

%macro intersect(list= T1 T2 T3 T4 T5  );
%let n=%sysfunc(countw(&list , %str( ) ));

proc sql;
create table temp as

%do i=1 %to &n ;
%let a=%scan(&list,&i,%str( ));
%do j=1 %to &n ;
%let b=%scan(&list,&j,%str( ));

select "&a" as dsn1 length=40,"&b" as dsn2 length=40,count(*) as n from(
select * from &a 
intersect
select * from &b 
)

%if &i ne &n or &j ne &n %then %do; union %end;

%end;
%end;

;quit;

%mend;

options mprint mlogic symbolgen;
%intersect()

proc transpose data=temp out=want(drop=_:);
by dsn1;
id dsn2;
var n;
run;

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away