DATA Step, Macro, Functions and more

check duplicates

Accepted Solution Solved
Reply
Contributor
Posts: 42
Accepted Solution

check duplicates

Hi all,

 

I have a list of medication of over thousands of observations, how can I flag the same medications taken by the same subject (duplicated entries) effectively? any ideas.. 

 

SUBJECTCMTRTcmstdtcCMONGOcmendtc
117961Acetylsalicylic Acid. .
117961Dextrose. .
117961Midazolam. .
117961Oxygen. .
117961Testosterone2014-12-31 2015-03-26
117961Atorvastatin2015-04-01Checked.
117961Diltiazen2015-03-28 2015-05-30
117961Diltiazem2015-03-30 2015-03-31
117961Amiodarone2015-03-31 2015-05-05
117961Sennoside2015-04-01 2015-04-01
117961Apixaban2015-04-05 2015-04-07
117961Magnesium sulphate2015-03-31 2015-03-31
117961TPA2015-03-26 2015-03-26
117961acetaminophen2015-03-27Checked.
117961enoxaparin2015-03-31 2015-04-03
117961Cetirizine2014-12-31Checked.
117961Diltiazem2015-06-10Checked.
117961acetylsalicyclic acid2015-03-31 2015-04-05
117961perindopril2015-03-30Checked.
117961Indapamide2015-03-30Checked.
117961Gentamicin2015-05-08 2015-05-08
1179612% Xylocaine2015-05-08 2015-05-08
117961Cephazolin2015-05-08 2015-05-08
117961dalteparin2015-04-04 2015-04-04
117961Docusate Sodium2015-04-01 2015-04-08
117961Omnaris Nasal Spray2014-12-31 2015-03-26
117961Amiodarone2015-03-31 2015-05-06
117961Morphine2015-03-27 2015-05-28
117961Perindopril/ indapamide2014-12-31Checked.
117961Zopiclone2014-12-31 2015-03-26
117961acetaminophen2015-03-27 2015-05-28
117961Dimenhydrinate2015-03-28 2015-03-28
117961Baclofen2014-12-31 2015-03-26
117961Amiodarone2015-03-28 2015-05-28
117961Magnesium Sulphate2015-03-28 2015-03-28
117961Potassium Chloride2015-03-28 2015-05-28
118036Acetylsalicylic Acid. .

Accepted Solutions
Solution
4 weeks ago
Regular Contributor
Posts: 170

Re: check duplicates

sort by subject and cmtrt, then maybe "proc sort data=.... nodupkey dupout=XXX" see the dataset XXX for the duplicates. Or once sorted you can identify duplicates in a data step

--------------
blog: papersandprograms.com

View solution in original post


All Replies
Respected Advisor
Posts: 3,271

Re: check duplicates

Are there duplicated entries in this data?? Can you give a specific example?

--
Paige Miller
Contributor
Posts: 42

Re: check duplicates

Posted in reply to PaigeMiller

I am trying to check/identify if there are duplicate medications taken by the same subject...

 

I know there is a way to do this like sort and compare, I am just wondering if there is a simple approach, since the list is huge...  

Super User
Posts: 2,061

Re: check duplicates

proc sql;

create table want as

select * ,count(CMTRT) >1 as dup_flag

from have

group by subject, CMTRT;

quit;

 

untested

Contributor
Posts: 42

Re: check duplicates

Posted in reply to novinosrin

tested it, it works too

 

Thanks

zimcom

Respected Advisor
Posts: 3,271

Re: check duplicates

[ Edited ]

So the dates shown in the data set have no bearing on whether or not something is a duplicate? This was not stated in the original problem statement. Why show us information not related to the problem at hand?

--
Paige Miller
Solution
4 weeks ago
Regular Contributor
Posts: 170

Re: check duplicates

sort by subject and cmtrt, then maybe "proc sort data=.... nodupkey dupout=XXX" see the dataset XXX for the duplicates. Or once sorted you can identify duplicates in a data step

--------------
blog: papersandprograms.com
Frequent Contributor
Posts: 112

Re: check duplicates

You data sample gives no good idea about (a) what you mean by duplicates and (b) in which manner you want to search for them:

 

(a) For example, for subject 117961, neither Atorvastatin nor Cetirizine appear among its records more than once, and yet in your sample, they are checked as dupes.  

(b) You don't say whether medication A in one record and medications A/B in another (such as Perindopril/ indapamide) are considered dupes. If they are, a program to check for them would be more involved than if they are not. This is because in this case, CMTRT cannot be relied upon as a key by the unduplication process since entries like A/B would have to be parsed into components first.

 

Besides, it's unclear whether your sample data represents your input or desired output looking like the input augmented with the variable CMONGO. Also, it looks as though your input isn't cleansed: For example, you have DiltiazeM in one record and DiltiazeN (which is likely a data entry typo) in the prior one. If you wanted your program to recognize such things as identical, it'd have to contain some sort of a fuzzy match routine, which seems to be is well beyond the scope of your question.

 

Generally, it would serve you (and those trying to help here) well if you presented you sample input and desired output unambiguously, tersely, and, as Paige has noted, with no extraneous information (such as the dates in your sample data). 

 

Paul D.  

 

 

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 147 views
  • 0 likes
  • 5 in conversation