BookmarkSubscribeRSS Feed
newsas007
Quartz | Level 8

Hi all, 

This is my first time working with text data in SAS. I am working on scanning and merging by text data. The Dataset A contains a list of drugs and Dataset B has the actual drug. Each drug_list in data A is associated with a gene type. I want to merge the two datasets if the drug variable in Dataset B is listed in one of the drug list in Dataset A corresponding to the Gene type (G_type).

data A

IDDate_aG_typeDrug_list
11/2/2018aAfatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
12/10/2019bCetuximab,Panitumumab
23/3/2019cBinimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
31/2/2018dAlpelisib,Everolimus,Temsirolimus
42/10/2019eBosutinib,Dasatinib,Everolimus,Temsirolimus
53/3/2019aAfatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
61/2/2018cBinimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
72/10/2019dAlpelisib,Everolimus,Temsirolimus

 

Data B

IDDrugDate_b
1Panitumumab2/1/2018
1Binimetinib3/2/2019
1Cetuximab3/28/2019
2Trametinib3/23/2019
2Temsirolimus3/25/2019
3Alpelisib2/1/2018
5Dacomitinib3/21/2019
4Everolimus3/1/2019
6Idelalisib1/17/2018
7Temsirolimus3/2/2019

 

Data Want:

IDDrugDate_bDate_aG_typeDrug_List
1Panitumumab2/1/20181/2/2018aAfatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1Binimetinib3/2/2019   
1Cetuximab3/28/20192/10/2019bCetuximab,Panitumumab
2Trametinib3/23/20193/3/2019cBinimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
2Temsirolimus3/25/2019   
3Alpelisib2/1/20181/2/2018dAlpelisib,Everolimus,Temsirolimus
5Dacomitinib3/21/20193/3/2019aAfatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
4Everolimus3/1/20192/10/2019eBosutinib,Dasatinib,Everolimus,Temsirolimus
6Idelalisib1/17/20181/2/2018cBinimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7Temsirolimus3/2/20192/10/2019dAlpelisib,Everolimus,Temsirolimus
11 REPLIES 11
smantha
Lapis Lazuli | Level 10
You can use a proc sql join
Proc sql;
Create table c as select a.*, b.* from a a, b b
Where index(b.drug, a.drug_list) >0;
Quit;
newsas007
Quartz | Level 8

Here is the program and it doesn't work.


data A;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id date_a Drug_list;
format id 3. date_a mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;

data B;
informat id 3. date_b mmddyy10. Drug $20.;
input id date_b Drug;
format id 3. date_b mmddyy10. Drug $200.;
cards;
1 2/01/2018 Panitumumab
1 3/02/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/01/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/01/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/02/2019 Temsirolimus
;;;


Proc sql;
Create table c as select a.*, b.date_b, b.Drug from a a, b b
Where index(b.drug, a.drug_list) > 0;
Quit;

 

Here is the log:

137 Where index(b.drug, a.drug_list) > 0;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can
not be optimized.
NOTE: Table WORK.C created, with 0 rows and 5 columns.

newsas007
Quartz | Level 8

This doesn't work...

 

data have;

informat id 3. date_a mmddyy10. Drug_list $200.;

input id A Drug_list;

format id 3. A mmddyy10. Drug_list $200.;

cards;

1     1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab

1     2/10/2019   Cetuximab,Panitumumab

2     3/3/2019    Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib

3     1/2/2018    Alpelisib,Everolimus,Temsirolimus

4     2/10/2019   Bosutinib,Dasatinib,Everolimus,Temsirolimus

5     3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab

6     1/2/2018    Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib

7     2/10/2019   Alpelisib,Everolimus,Temsirolimus

;;;

 

data want;

informat id 3. date_b mmddyy10. Drug $20.;

input id A Drug;

format id 3. A mmddyy10. Drug $200.;

cards;

ID    Date_b      Drug

1     2/1/2018 Panitumumab

1     3/2/2019 Binimetinib

1     3/28/2019 Cetuximab

2     3/23/2019 Trametinib

2     3/25/2019 Temsirolimus

3     2/1/2018 Alpelisib

5     3/21/2019 Dacomitinib

4     3/1/2019 Everolimus

6     1/17/2018 Idelalisib

7     3/2/2019 Temsirolimus

;;;

Proc sql;
Create table c as select a.*, b.* from a a, b b
Where index(b.drug, a.drug_list) >0;
Quit;

smantha
Lapis Lazuli | Level 10
data A;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id date_a Drug_list;
format id 3. date_a mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;

data B;
informat id 3. date_b mmddyy10. Drug $20.;
input id date_b Drug;
format id 3. date_b mmddyy10. Drug $200.;
cards;
1 2/01/2018 Panitumumab
1 3/02/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/01/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/01/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/02/2019 Temsirolimus
;;;


Proc sql;
Create table c as select a.*, b.date_b, b.Drug from a a, b b
Where index( a.drug_list,strip(b.drug)) > 0;
Quit;

Got the order of values wrong.

newsas007
Quartz | Level 8
Although it runs and matches, This code is creating 29 rows, whereas my dataset will have the same number of rows as dataset B. It also has a couple of missing values for ID#1 and ID#2.
mkeintz
PROC Star

@smantha 

 

I recommend you make two changes:

  1. add   "a.id=b.id" to the where clause.  This is why @newsas007 notices and excess number of cases.
  2. consider replacing
        index( a.drug_list,strip(b.drug)) > 0
    with
       findw(a.drug_list,strip(b.drug),', ') > 0

    The latter will prevent inadvertently matching a substring of a drug name in the drug_list variable. It's not a problem in this sample data, but if you had a drug name=Erlotin, the INDEX function would match it to Erlotinib in the drug list.   The third argument of findw above contains a blank and a comma, telling SAS to consider those characters as word separators, and the job of FINDW is to match only whole words.
        
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PGStats
Opal | Level 21

Why is Cetuximab associated with the second but not the first drug list for ID=1?

PG
newsas007
Quartz | Level 8
sorry i made that change and added the G_type associated with the specific drug list
smantha
Lapis Lazuli | Level 10

What you showed in the output is this. @mkeintz , @newsas007 please notice this response

 

data A;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id date_a Drug_list;
format id 3. date_a mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;
proc sort; by id; run;

data B;
informat id 3. date_b mmddyy10. Drug $20.;
input id date_b Drug;
format id 3. date_b mmddyy10. Drug $200.;
cards;
1 2/01/2018 Panitumumab
1 3/02/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/01/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/01/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/02/2019 Temsirolimus
;;;

proc sort; by id; run;

data c;
merge b a;
by id;
if findw( drug_list,strip(drug),',') = 0 then drug_list='';
run;

Which might be wrong as you get this warning in the log

 

 

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 10 observations read from the data set WORK.B.
NOTE: There were 8 observations read from the data set WORK.A.
NOTE: The data set WORK.C has 10 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

instead the result should be 

Proc sql;
Create table c2 as select a.date_a,a.drug_list, b.date_b, b.Drug,b.id from a a right join b b
on a.id=b.id and  findw( a.drug_list,strip(b.drug),',') > 0 and (b.date_b);
Quit;
date_a Drug_list date_b Drug id
1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab 2/1/2018 Panitumumab 1
1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab 3/28/2019 Cetuximab 1
2/10/2019 Cetuximab,Panitumumab 2/1/2018 Panitumumab 1
2/10/2019 Cetuximab,Panitumumab 3/28/2019 Cetuximab 1
    3/2/2019 Binimetinib 1
3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib 3/23/2019 Trametinib 2
    3/25/2019 Temsirolimus 2
1/2/2018 Alpelisib,Everolimus,Temsirolimus 2/1/2018 Alpelisib 3
2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus 3/1/2019 Everolimus 4
3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab 3/21/2019 Dacomitinib 5
1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib 1/17/2018 Idelalisib 6
2/10/2019 Alpelisib,Everolimus,Temsirolimus 3/2/2019 Temsirolimus 7

 

newsas007
Quartz | Level 8
I edited the dataset a bit to make more sense. I added the G_type variable for the drug to match the drug list. Hope this makes more sense...
smantha
Lapis Lazuli | Level 10

Adding a new variable does not change the result any way unless there is a business logic associated with it or the column is on both datasets. See my earlier response. You get your desired result using a data step merge, but this is wrong.

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 2127 views
  • 4 likes
  • 4 in conversation