DATA Step, Macro, Functions and more

How do I merge two databases based on an imprecise character variable that includes numbers?

Reply
Contributor
Posts: 51

How do I merge two databases based on an imprecise character variable that includes numbers?

Hi,

 

I am working in SAS 9.3, trying to categorize a large list of medications based on the type of medication (aka indication for taking the medication). I have two databases, one which is a list of medications, and the other which lists the medications AND their indication. The only problem is that the names of the medications are not necessarily identical across both databases.

See below for sample data (please note: the databases are not comprehensive, since I am just showing sections of data). In one database, the medication is titled "Script_Name" and in the other database the medication is titled "Generic_Name"). I want to match based on Script_Name and Generic_Name, so that I ultimately get a database that has all three variables "Common_use", "Generic_name", and "Script_Name" where Generic_name and Script_Name are basically the same thing and Common_use is properly applied to each Script_Name value. For example, where Generic_name="Lidocaine 1%/Epi 1:100000", I would want to match it with Script_Name="Lidocaine". Similarly, where Generic_name="Methyl Salicylate/Menthol", I would want to match it to Script_Name="Methyl Salicylate".

 

I think there is a way to do this using the SUBSTR function, but I could use some help working out the details. Some options would be to truncate the Script_Name and/or Generic_Name variables at the first number (1-9) or potentially even the first space or slash (although that might leave more room for error). 

 

Thank you!

 

Obs Common_use Generic_name

1 Analgesia; muscle relaxant Acetaminophen

2 Analgesia; muscle relaxant Aspirin

3 Analgesia; muscle relaxant Baclofen

4 Analgesia; muscle relaxant Cyclobenzaprine

5 Analgesia; muscle relaxant Diflunisal

6 Analgesia; muscle relaxant Ibuprofen

7 Analgesia; muscle relaxant Indomethacin

8 Analgesia; muscle relaxant Indomethacin SR

9 Analgesia; muscle relaxant Ketorolac

10 Analgesia; muscle relaxant Lidocaine 1%

11 Analgesia; muscle relaxant Lidocaine 1%/Epi 1:100000

12 Analgesia; muscle relaxant Lidocaine 2% Syr

13 Analgesia; muscle relaxant Methyl Salicylate/Menthol

14 Analgesia; muscle relaxant Naproxen

15 Analgesia; muscle relaxant Piroxicam

16 Antihistamine Chlorpheniramine Maleate

17 Antihistamine Diphenhydramine

18 Antihistamine loratadine

19 Antimicrobial Acyclovir

20 Antimicrobial Amoxicillin

21 Antimicrobial Amoxycillin/clavulanate

22 Antimicrobial Ampicillin

23 Antimicrobial Ampicillin Sodium/Sulbactam Sodium 24 Antimicrobial Atovaquone

25 Antimicrobial Azithromycin

26 Antimicrobial Cefazolin

27 Antimicrobial Ceftazidime Pentahydrate

28 Antimicrobial Ceftriaxone

29 Antimicrobial Cephalexin

30 Antimicrobial Ciprofloxacin

31 Antimicrobial Clarithromycin

32 Antimicrobial Clindamycin

33 Antimicrobial Clotrimazole

34 Antimicrobial Dapsone

35 Antimicrobial Dicloxacillin

36 Antimicrobial Doxycycline

37 Antimicrobial Erythromycin Base

38 Antimicrobial Erythromycin Base/Ethanol

39 Antimicrobial Erythromycin lactobionate

40 Antimicrobial Ethambutol

 

 

Obs Script_Name

1 TRAZODONE

2 TRAZODONE

3 INSULIN NPH HUMAN 10ML

4 RITONAVIR

5 ATAZANAVIR

6 EMTRICITABINE/TENOFOVIR

7 SULF/TRIMETH DS

8 IBUPROFEN

9 CLOTRIMAZOLE 1% CREAM

10 CLOTRIMAZOLE 1% CREAM

11 INSULIN NPH HUMAN 10ML

12 INSULIN NPH HUMAN 10ML

13 INSULIN NPH HUMAN 10ML

14 LISINOPRIL

15 TRAZODONE

16 IBUPROFEN

17 GLYBURIDE

18 LISINOPRIL

19 INSULIN NPH HUMAN 10ML

20 ASPIRIN

21 UREA

20 CREAM

22 GLYBURIDE

23 TRAZODONE

24 CLOTRIMAZOLE 1% CREAM

25 INSULIN NPH HUMAN 10ML

26 TRAZODONE

27 MULTIVITAMINS

28 RITONAVIR

29 ATAZANAVIR

30 EMTRICITABINE/TENOFOVIR

31 GLYBURIDE 32 LISINOPRIL

33 ASPIRIN

34 IBUPROFEN

35 INSULIN NPH HUMAN 10ML

36 IBUPROFEN

37 AMLODIPINE

38 HYDROCHLOROTHIAZIDE

39 ATENOLOL

40 LISINOPRIL

Super User
Posts: 11,343

Re: How do I merge two databases based on an imprecise character variable that includes numbers?

You will need to provide some details on how you you handle script name LIDOCAINE as you have

10 Analgesia; muscle relaxant Lidocaine 1%

11 Analgesia; muscle relaxant Lidocaine 1%/Epi 1:100000

12 Analgesia; muscle relaxant Lidocaine 2% Syr

 

Should your script match all three of those on only one? If only one what would the RULE be other similar issues.

 

Instead of SUBSTR you would likely want to match on something like Index(upcase(generic_name), script_name) > 1 as that will search the generic name for the text of the script name. The upcase is to make sure that the comparison is upper case to upper case as other wise 'a' is not equal to 'A'.

 

You also have duplicates of the script name. Why? If there are other variables to bring in it may be a good idea to describe them as well.

Contributor
Posts: 51

Re: How do I merge two databases based on an imprecise character variable that includes numbers?

The Script_Name should match all three (they are just variations of the same medication). The same idea applies to other semi-duplicate entries. It sounds like the code you mentioned will get me on the right track. Are you able to provide any additional information regarding the code? Thank you so much.

Regarding your question about why there are duplicates of the Script_Name, the reason is that the original database (from which I took the sample data) is actually a list of Script_Names for different individuals accompanied by the dates they were prescribed and the dates the prescription was discontinued. So the database that contains the Script_Name variable also contains variables "StudyID" (for participant), "Script_Dose", "Script_KOPDOT" (how it was administered), "ScriptStartDate", and "ScriptEndDate". Taking these other variables into account, there should no longer any duplicate Script_Name entries. I can provide sample data of the full database, but I thought it would be easier to keep it simple.

Regarding the other response from PGStats, this is a good point. The code I am attempting to create will help me to roughly match medication names with medication indications based on medical/pharmacologic guidelines, after which I plan to go through manually and further modify based on medical knowledge and my research question. However, having a code that would allow me to do this initial match will save me from having to manually code every single medication into a indication category. Thank you both.

Super User
Posts: 11,343

Re: How do I merge two databases based on an imprecise character variable that includes numbers?

The following program snippet creates a look up table that has the matches between script_name and generic_name data. Then uses that look up table to bring in the vales with the script data.

Note: Spelling will be critical. If the script_name finds no matches then the final result won't have any result from the generic.

The intermediate lookup table is there to reduce the very large number of comparisons trying a direct match with the generic which would be use the (number of script records*number of generic records) comparisons.

 

proc sql;
   create table LookUp as
   select distinct a.Script_name, b.*
   from (select distinct script_name from Script_data_table) as a, generic_table_name as b
   where index(upcase(b.generic_name),upcase(a.script_name))>0;
quit;
proc sql;
   create table want as
   select a.*,b.*
   from script_data_table_name as a left join LookUP as b
   on a.script_name=b.script_name;

quit;

If the ONLY thing in the generic data you need to bring in is the generic_name then replace the last B.* with B.Generic_name. The code as provided will generate a warning about script_name already in the set because it is both the data and the look up data. That would not be an error just a warning and the previous change would get rid of it. I tend to be a tad lazy in some of these common steps because I know that I want all the variables from the set except the join. Listing the names for each explicitly is generally better practice unless your dataset structure changes.

 

Respected Advisor
Posts: 4,919

Re: How do I merge two databases based on an imprecise character variable that includes numbers?

There is NO WAY that medications can be matched solely on their names. Commercial medication names are chosen for marketing purposes, and not necessarily meant to be informative. You need an authoritative table of commercial name equivalents that is based on science.

PG
Ask a Question
Discussion stats
  • 4 replies
  • 275 views
  • 1 like
  • 3 in conversation