Solved: Re: Capture duplicate records based on three different conditon

Tanja · Posted 06-18-2015 12:29 PM

Hello Everyone!

As per below sample data, I need to capture all the study with only duplicate event. (e.g.Hemorge for study CT001).

After that I retrieve only those records which contain events with exactly same gender or exactly same risk_factor or exactly same values for five variables (freq, SD, median, mode, covariance)

For example for same gender with all duplicate events, data will be like below

STUDY	SEQ	EVENT	GENDER	RISK_FACTOR	FREQ	SD	MEDIAN	MODE	COVARIANCE
CT003	4	Pregnancy	3	1	11	1	1	0	1
CT003	5	Pregnancy	3	2	0	1	1	0	1

For same risk_factor with duplicate events

STUDY	SEQ	EVENT	GENDER	RISK_FACTOR	FREQ	SD	MEDIAN	MODE	COVARIANCE
CT006	1	Oedema	0	0	1	0	1	1	0
CT006	2	Oedema	1	0	1	1	1	1	0
CT006	3	Oedema	1	0	0	1	1	1	0

For same five variables (freq, SD, median, mode, covariance) with exactly same value for duplicate events

STUDY	SEQ	EVENT	GENDER	RISK_FACTOR	FREQ	SD	MEDIAN	MODE	COVARIANCE
CT005	1	Skinlson	1	1	0	0	0	0	0
CT005	2	Skinlson	2	1	0	0	0	0	0
CT005	3	Skinlson	1	0	0	0	0	0	0

data clinical;

input study $ seq event $ gender risk_factor freq SD median mode covariance;

cards;

CT001 1 Hemorge 1 1 1 1 1 1 1

CT001 2 Hemorge 1 2 0 3 1 0 11

CT001 3 Hemorge 2 2 0 1 1 0 11

CT001 8 Hemoglobin 2 2 0 1 1 0 11

CT001 4 SBP 2 3 1 2 1 0 11

CT001 5 SBP 1 1 1 2 1 0 11

CT001 6 Hemorge 3 3 1 1 1 1 1

CT002 2 Gastro 0 1 1 1 1 1 1

CT002 1 Gastro 1 0 2 1 1 11 1

CT002 3 Gastro 1 0 3 1 1 0 1

CT003 1 Malignt 1 0 1 1 1 0 1

CT003 2 Malignt 0 0 1 1 1 0 1

CT003 3 Malignt 3 1 11 1 1 0 1

CT003 4 Pregnant 3 1 11 1 1 0 1

CT003 5 Pregnant 3 2 0 1 1 0 1

CT004 4 Pregnant 3 1 11 1 1 0 1

CT005 1 Skinlson 1 1 0 0 0 0 0

CT005 2 Skinlson 2 1 0 0 0 0 0

CT005 3 Skinlson 1 0 0 0 0 0 0

CT006 1 Oedema 0 0 1 0 1 1 0

CT006 2 Oedema 1 0 1 1 1 1 0

CT006 4 STD 1 0 1 1 1 1 0

CT006 3 Oedema 1 0 0 1 1 1 0

;

run;

Output

STUDY	SEQ	EVENT	GENDER	RISK_FACTOR	FREQ	SD	MEDIAN	MODE	COVARIANCE
CT006	1	Oedema	0	0	1	0	1	1	0
CT006	2	Oedema	1	0	1	1	1	1	0
CT006	3	Oedema	1	0	0	1	1	1	0
CT003	4	Pregnant	3	1	11	1	1	0	1
CT003	5	Pregnant	3	2	0	1	1	0	1
CT005	1	Skinlson	1	1	0	0	0	0	0
CT005	2	Skinlson	2	1	0	0	0	0	0
CT005	3	Skinlson	1	0	0	0	0	0	0
CT001	4	SBP	2	3	1	2	1	0	11
CT001	5	SBP	1	1	1	2	1	0	11

Thanks in advance

dataset value change by tiki

Ksharp · Posted 06-20-2015 12:27 AM

SAS Standard SQL also could achieve your Oracle's Logic .

Code: Program

data clinical;
input study $ seq event $ gender risk_factor freq  SD  median  mode  covariance;
cards;
CT001 1 Hemorge 1 1 1 1 1 1 1
CT001 2 Hemorge 1 2 0 3 1 0 11
CT001 3 Hemorge 2 2 0 1 1 0 11
CT001 8 Hemoglobin 2 2 0 1 1 0 11
CT001 4 SBP 2 3 1 2 1 0 11
CT001 5 SBP 1 1 1 2 1 0 11
CT001 6 Hemorge 3 3 1 1 1 1 1
CT002 2 Gastro 0 1 1 1 1 1 1
CT002 1 Gastro 1 0 2 1 1 11 1
CT002 3 Gastro 1 0 3 1 1 0 1
CT003 1 Malignt 1 0 1 1 1 0 1
CT003 2 Malignt 0 0 1 1 1 0 1
CT003 3 Malignt 3 1 11 1 1 0 1
CT003 4 Pregnant 3 1 11 1 1 0 1
CT003 5 Pregnant 3 2 0 1 1 0 1
CT004 4 Pregnant 3 1 11 1 1 0 1
CT005 1 Skinlson 1 1 0 0 0 0 0
CT005 2 Skinlson 2 1 0 0 0 0 0
CT005 3 Skinlson 1 0 0 0 0 0 0
CT006 1 Oedema 0 0 1 0 1 1 0
CT006 2 Oedema 1 0 1 1 1 1 0
CT006 4 STD 1 0 1 1 1 1 0
CT006 3 Oedema 1 0 0 1 1 1 0
;
run;
proc sql;
 create table want as 
  select *
   from clinical
   group by study, event
   having count(*) gt 1 and (
   count(distinct GENDER)=1 or
   count(distinct RISK_FACTOR)=1 or
   count(distinct cats(FREQ,SD,MEDIAN,MODE,COVARIANCE))=1
 );
quit;

Xia Keshan

View solution in original post

PGStats · Posted 06-18-2015 02:27 PM

Your output doesn't match your input. For example, your output shows two records with negative FREQ values but there aren't any in your input. Please fix.

PG

Tanja · Posted 06-18-2015 11:37 PM

Apology PG.

I made a mistake to copy dataset. I updated the table along with output. Please check once.

The problem is that partition by clause is not supported by SAS. The below SQL approach is taken.

select study,seq,event,gender,risk_factor,freq,SD,median,mode,covariance

from (select study,seq,event,gender,risk_factor,freq,SD,median,mode,covariance,

count(distinct gender) over (partition by event) cnt1,

count(distinct risk_factor) over (partition by event) cnt2,

count(distinct freq||'#'||SD||'#'||median||'#'||mode||'#'||covariance) over (partition by event) cnt3,

count(*) over(partition by event) cnt

from trial_study)

where cnt > 1

and 1 in (cnt1,cnt2,cnt3)

Ksharp · Posted 06-19-2015 10:16 AM

Code: Program

data clinical;
input study $ seq event $ gender risk_factor freq  SD  median  mode  covariance;
cards;
CT001 1 Hemorge 1 1 1 1 1 1 1
CT001 2 Hemorge 1 2 0 3 1 0 11
CT001 3 Hemorge 2 2 0 1 1 0 11
CT001 8 Hemoglobin 2 2 0 1 1 0 11
CT001 4 SBP 2 3 1 2 1 0 11
CT001 5 SBP 1 1 1 2 1 0 11
CT001 6 Hemorge 3 3 1 1 1 1 1
CT002 2 Gastro 0 1 1 1 1 1 1
CT002 1 Gastro 1 0 2 1 1 11 1
CT002 3 Gastro 1 0 3 1 1 0 1
CT003 1 Malignt 1 0 1 1 1 0 1
CT003 2 Malignt 0 0 1 1 1 0 1
CT003 3 Malignt 3 1 11 1 1 0 1
CT003 4 Pregnant 3 1 11 1 1 0 1
CT003 5 Pregnant 3 2 0 1 1 0 1
CT004 4 Pregnant 3 1 11 1 1 0 1
CT005 1 Skinlson 1 1 0 0 0 0 0
CT005 2 Skinlson 2 1 0 0 0 0 0
CT005 3 Skinlson 1 0 0 0 0 0 0
CT006 1 Oedema 0 0 1 0 1 1 0
CT006 2 Oedema 1 0 1 1 1 1 0
CT006 4 STD 1 0 1 1 1 1 0
CT006 3 Oedema 1 0 0 1 1 1 0
;
run;
proc sql;
 create table want as 
  select *
   from clinical
   group by study, event
   having count(*) gt 1 and (
   range(GENDER)=0 or
   range(RISK_FACTOR)=0 or
 (range(FREQ)=0 and range(SD)=0 and range(MEDIAN)=0 and range(MODE)=0 and range(COVARIANCE)=0)
 );
quit;

Tanja · Posted 06-19-2015 10:23 AM

Thanks keshan. But the value of variables are not fixed. It changes randomly. I can not enter zero or any fixed value. Is there any alternative approach to find sample results. Advance thanks

Ksharp · Posted 06-19-2015 11:05 AM

Try my code on your sample data. If there were any problem , tell me .

ballardw · Posted 06-19-2015 11:18 AM

Why aren't CT001 seq 2 and 3 not in the result? They have the same risk_factor.

ballardw · Posted 06-19-2015 11:19 AM

Look up what the RANGE function does. It is not looking at specific values or your data though it does assume they will be numeric.

Tanja · Posted 06-19-2015 11:55 AM

Thanks Keshan for providing the logic. It is working fine.

Suppose the database contain value like below (other than zero). In that case, we can not fix the program with zero. We need to use some other approach.

In oracle, I resolve it through below logic using function.

select * from

(select study,seq,event,gender,risk_factor,freq,SD,median,mode,covariance,

count(distinct freq||'#'||SD||'#'||median||'#'||mode||'#'||covariance) over (partition by event) cnt3,

count(*) over(partition by event) cnt

from trial_study)

where cnt > 1

and cnt3=1

FREQ	SD	MEDIAN	MODE	COVARIANCE
11	11	11	11	11
11	11	11	11	11
44	44	44	44	44
44	44	44	44	44
-2	-2	-2	-2	-2
-2	-2	-2	-2	-2

Dear ballarw,

You are correct. yes, those two record should be captured by the program. I keep only few records to display as output. Sorry for the confusion

Ksharp · Posted 06-20-2015 12:27 AM

SAS Standard SQL also could achieve your Oracle's Logic .

Code: Program

data clinical;
input study $ seq event $ gender risk_factor freq  SD  median  mode  covariance;
cards;
CT001 1 Hemorge 1 1 1 1 1 1 1
CT001 2 Hemorge 1 2 0 3 1 0 11
CT001 3 Hemorge 2 2 0 1 1 0 11
CT001 8 Hemoglobin 2 2 0 1 1 0 11
CT001 4 SBP 2 3 1 2 1 0 11
CT001 5 SBP 1 1 1 2 1 0 11
CT001 6 Hemorge 3 3 1 1 1 1 1
CT002 2 Gastro 0 1 1 1 1 1 1
CT002 1 Gastro 1 0 2 1 1 11 1
CT002 3 Gastro 1 0 3 1 1 0 1
CT003 1 Malignt 1 0 1 1 1 0 1
CT003 2 Malignt 0 0 1 1 1 0 1
CT003 3 Malignt 3 1 11 1 1 0 1
CT003 4 Pregnant 3 1 11 1 1 0 1
CT003 5 Pregnant 3 2 0 1 1 0 1
CT004 4 Pregnant 3 1 11 1 1 0 1
CT005 1 Skinlson 1 1 0 0 0 0 0
CT005 2 Skinlson 2 1 0 0 0 0 0
CT005 3 Skinlson 1 0 0 0 0 0 0
CT006 1 Oedema 0 0 1 0 1 1 0
CT006 2 Oedema 1 0 1 1 1 1 0
CT006 4 STD 1 0 1 1 1 1 0
CT006 3 Oedema 1 0 0 1 1 1 0
;
run;
proc sql;
 create table want as 
  select *
   from clinical
   group by study, event
   having count(*) gt 1 and (
   count(distinct GENDER)=1 or
   count(distinct RISK_FACTOR)=1 or
   count(distinct cats(FREQ,SD,MEDIAN,MODE,COVARIANCE))=1
 );
quit;

Xia Keshan

Tanja · Posted 06-20-2015 01:28 PM

Thanks Keshan for your great help

Code: Program

Code: Program

Code: Program

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!