BookmarkSubscribeRSS Feed
newsas007
Quartz | Level 8

Hi all, 

This is my first time working with text data in SAS. I am working on scanning and merging by text data. The Dataset A contains a list of drugs and Dataset B has the actual drug. Each drug_list in data A is associated with a gene type. I want to merge the two datasets if the drug variable in Dataset B is listed in one of the drug list in Dataset A corresponding to the Gene type (G_type).

data A

IDDate_aG_typeDrug_list
11/2/2018aAfatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
12/10/2019bCetuximab,Panitumumab
23/3/2019cBinimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
31/2/2018dAlpelisib,Everolimus,Temsirolimus
42/10/2019eBosutinib,Dasatinib,Everolimus,Temsirolimus
53/3/2019aAfatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
61/2/2018cBinimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
72/10/2019dAlpelisib,Everolimus,Temsirolimus

 

Data B

IDDrugDate_b
1Panitumumab2/1/2018
1Binimetinib3/2/2019
1Cetuximab3/28/2019
2Trametinib3/23/2019
2Temsirolimus3/25/2019
3Alpelisib2/1/2018
5Dacomitinib3/21/2019
4Everolimus3/1/2019
6Idelalisib1/17/2018
7Temsirolimus3/2/2019

 

Data Want:

IDDrugDate_bDate_aG_typeDrug_List
1Panitumumab2/1/20181/2/2018aAfatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1Binimetinib3/2/2019   
1Cetuximab3/28/20192/10/2019bCetuximab,Panitumumab
2Trametinib3/23/20193/3/2019cBinimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
2Temsirolimus3/25/2019   
3Alpelisib2/1/20181/2/2018dAlpelisib,Everolimus,Temsirolimus
5Dacomitinib3/21/20193/3/2019aAfatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
4Everolimus3/1/20192/10/2019eBosutinib,Dasatinib,Everolimus,Temsirolimus
6Idelalisib1/17/20181/2/2018cBinimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7Temsirolimus3/2/20192/10/2019dAlpelisib,Everolimus,Temsirolimus
11 REPLIES 11
smantha
Lapis Lazuli | Level 10
You can use a proc sql join
Proc sql;
Create table c as select a.*, b.* from a a, b b
Where index(b.drug, a.drug_list) >0;
Quit;
newsas007
Quartz | Level 8

Here is the program and it doesn't work.


data A;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id date_a Drug_list;
format id 3. date_a mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;

data B;
informat id 3. date_b mmddyy10. Drug $20.;
input id date_b Drug;
format id 3. date_b mmddyy10. Drug $200.;
cards;
1 2/01/2018 Panitumumab
1 3/02/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/01/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/01/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/02/2019 Temsirolimus
;;;


Proc sql;
Create table c as select a.*, b.date_b, b.Drug from a a, b b
Where index(b.drug, a.drug_list) > 0;
Quit;

 

Here is the log:

137 Where index(b.drug, a.drug_list) > 0;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can
not be optimized.
NOTE: Table WORK.C created, with 0 rows and 5 columns.

newsas007
Quartz | Level 8

This doesn't work...

 

data have;

informat id 3. date_a mmddyy10. Drug_list $200.;

input id A Drug_list;

format id 3. A mmddyy10. Drug_list $200.;

cards;

1     1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab

1     2/10/2019   Cetuximab,Panitumumab

2     3/3/2019    Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib

3     1/2/2018    Alpelisib,Everolimus,Temsirolimus

4     2/10/2019   Bosutinib,Dasatinib,Everolimus,Temsirolimus

5     3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab

6     1/2/2018    Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib

7     2/10/2019   Alpelisib,Everolimus,Temsirolimus

;;;

 

data want;

informat id 3. date_b mmddyy10. Drug $20.;

input id A Drug;

format id 3. A mmddyy10. Drug $200.;

cards;

ID    Date_b      Drug

1     2/1/2018 Panitumumab

1     3/2/2019 Binimetinib

1     3/28/2019 Cetuximab

2     3/23/2019 Trametinib

2     3/25/2019 Temsirolimus

3     2/1/2018 Alpelisib

5     3/21/2019 Dacomitinib

4     3/1/2019 Everolimus

6     1/17/2018 Idelalisib

7     3/2/2019 Temsirolimus

;;;

Proc sql;
Create table c as select a.*, b.* from a a, b b
Where index(b.drug, a.drug_list) >0;
Quit;

smantha
Lapis Lazuli | Level 10
data A;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id date_a Drug_list;
format id 3. date_a mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;

data B;
informat id 3. date_b mmddyy10. Drug $20.;
input id date_b Drug;
format id 3. date_b mmddyy10. Drug $200.;
cards;
1 2/01/2018 Panitumumab
1 3/02/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/01/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/01/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/02/2019 Temsirolimus
;;;


Proc sql;
Create table c as select a.*, b.date_b, b.Drug from a a, b b
Where index( a.drug_list,strip(b.drug)) > 0;
Quit;

Got the order of values wrong.

newsas007
Quartz | Level 8
Although it runs and matches, This code is creating 29 rows, whereas my dataset will have the same number of rows as dataset B. It also has a couple of missing values for ID#1 and ID#2.
mkeintz
PROC Star

@smantha 

 

I recommend you make two changes:

  1. add   "a.id=b.id" to the where clause.  This is why @newsas007 notices and excess number of cases.
  2. consider replacing
        index( a.drug_list,strip(b.drug)) > 0
    with
       findw(a.drug_list,strip(b.drug),', ') > 0

    The latter will prevent inadvertently matching a substring of a drug name in the drug_list variable. It's not a problem in this sample data, but if you had a drug name=Erlotin, the INDEX function would match it to Erlotinib in the drug list.   The third argument of findw above contains a blank and a comma, telling SAS to consider those characters as word separators, and the job of FINDW is to match only whole words.
        
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PGStats
Opal | Level 21

Why is Cetuximab associated with the second but not the first drug list for ID=1?

PG
newsas007
Quartz | Level 8
sorry i made that change and added the G_type associated with the specific drug list
smantha
Lapis Lazuli | Level 10

What you showed in the output is this. @mkeintz , @newsas007 please notice this response

 

data A;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id date_a Drug_list;
format id 3. date_a mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;
proc sort; by id; run;

data B;
informat id 3. date_b mmddyy10. Drug $20.;
input id date_b Drug;
format id 3. date_b mmddyy10. Drug $200.;
cards;
1 2/01/2018 Panitumumab
1 3/02/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/01/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/01/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/02/2019 Temsirolimus
;;;

proc sort; by id; run;

data c;
merge b a;
by id;
if findw( drug_list,strip(drug),',') = 0 then drug_list='';
run;

Which might be wrong as you get this warning in the log

 

 

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 10 observations read from the data set WORK.B.
NOTE: There were 8 observations read from the data set WORK.A.
NOTE: The data set WORK.C has 10 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

instead the result should be 

Proc sql;
Create table c2 as select a.date_a,a.drug_list, b.date_b, b.Drug,b.id from a a right join b b
on a.id=b.id and  findw( a.drug_list,strip(b.drug),',') > 0 and (b.date_b);
Quit;
date_a Drug_list date_b Drug id
1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab 2/1/2018 Panitumumab 1
1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab 3/28/2019 Cetuximab 1
2/10/2019 Cetuximab,Panitumumab 2/1/2018 Panitumumab 1
2/10/2019 Cetuximab,Panitumumab 3/28/2019 Cetuximab 1
    3/2/2019 Binimetinib 1
3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib 3/23/2019 Trametinib 2
    3/25/2019 Temsirolimus 2
1/2/2018 Alpelisib,Everolimus,Temsirolimus 2/1/2018 Alpelisib 3
2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus 3/1/2019 Everolimus 4
3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab 3/21/2019 Dacomitinib 5
1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib 1/17/2018 Idelalisib 6
2/10/2019 Alpelisib,Everolimus,Temsirolimus 3/2/2019 Temsirolimus 7

 

newsas007
Quartz | Level 8
I edited the dataset a bit to make more sense. I added the G_type variable for the drug to match the drug list. Hope this makes more sense...
smantha
Lapis Lazuli | Level 10

Adding a new variable does not change the result any way unless there is a business logic associated with it or the column is on both datasets. See my earlier response. You get your desired result using a data step merge, but this is wrong.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 881 views
  • 4 likes
  • 4 in conversation