Hi all,
This is my first time working with text data in SAS. I am working on scanning and merging by text data. The Dataset A contains a list of drugs and Dataset B has the actual drug. Each drug_list in data A is associated with a gene type. I want to merge the two datasets if the drug variable in Dataset B is listed in one of the drug list in Dataset A corresponding to the Gene type (G_type).
data A
ID | Date_a | G_type | Drug_list |
1 | 1/2/2018 | a | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab |
1 | 2/10/2019 | b | Cetuximab,Panitumumab |
2 | 3/3/2019 | c | Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib |
3 | 1/2/2018 | d | Alpelisib,Everolimus,Temsirolimus |
4 | 2/10/2019 | e | Bosutinib,Dasatinib,Everolimus,Temsirolimus |
5 | 3/3/2019 | a | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab |
6 | 1/2/2018 | c | Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib |
7 | 2/10/2019 | d | Alpelisib,Everolimus,Temsirolimus |
Data B
ID | Drug | Date_b |
1 | Panitumumab | 2/1/2018 |
1 | Binimetinib | 3/2/2019 |
1 | Cetuximab | 3/28/2019 |
2 | Trametinib | 3/23/2019 |
2 | Temsirolimus | 3/25/2019 |
3 | Alpelisib | 2/1/2018 |
5 | Dacomitinib | 3/21/2019 |
4 | Everolimus | 3/1/2019 |
6 | Idelalisib | 1/17/2018 |
7 | Temsirolimus | 3/2/2019 |
Data Want:
ID | Drug | Date_b | Date_a | G_type | Drug_List |
1 | Panitumumab | 2/1/2018 | 1/2/2018 | a | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab |
1 | Binimetinib | 3/2/2019 | |||
1 | Cetuximab | 3/28/2019 | 2/10/2019 | b | Cetuximab,Panitumumab |
2 | Trametinib | 3/23/2019 | 3/3/2019 | c | Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib |
2 | Temsirolimus | 3/25/2019 | |||
3 | Alpelisib | 2/1/2018 | 1/2/2018 | d | Alpelisib,Everolimus,Temsirolimus |
5 | Dacomitinib | 3/21/2019 | 3/3/2019 | a | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab |
4 | Everolimus | 3/1/2019 | 2/10/2019 | e | Bosutinib,Dasatinib,Everolimus,Temsirolimus |
6 | Idelalisib | 1/17/2018 | 1/2/2018 | c | Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib |
7 | Temsirolimus | 3/2/2019 | 2/10/2019 | d | Alpelisib,Everolimus,Temsirolimus |
Here is the program and it doesn't work.
data A;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id date_a Drug_list;
format id 3. date_a mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;
data B;
informat id 3. date_b mmddyy10. Drug $20.;
input id date_b Drug;
format id 3. date_b mmddyy10. Drug $200.;
cards;
1 2/01/2018 Panitumumab
1 3/02/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/01/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/01/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/02/2019 Temsirolimus
;;;
Proc sql;
Create table c as select a.*, b.date_b, b.Drug from a a, b b
Where index(b.drug, a.drug_list) > 0;
Quit;
Here is the log:
137 Where index(b.drug, a.drug_list) > 0;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can
not be optimized.
NOTE: Table WORK.C created, with 0 rows and 5 columns.
This doesn't work...
data have;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id A Drug_list;
format id 3. A mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;
data want;
informat id 3. date_b mmddyy10. Drug $20.;
input id A Drug;
format id 3. A mmddyy10. Drug $200.;
cards;
ID Date_b Drug
1 2/1/2018 Panitumumab
1 3/2/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/1/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/1/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/2/2019 Temsirolimus
;;;
Proc sql;
Create table c as select a.*, b.* from a a, b b
Where index(b.drug, a.drug_list) >0;
Quit;
data A;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id date_a Drug_list;
format id 3. date_a mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;
data B;
informat id 3. date_b mmddyy10. Drug $20.;
input id date_b Drug;
format id 3. date_b mmddyy10. Drug $200.;
cards;
1 2/01/2018 Panitumumab
1 3/02/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/01/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/01/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/02/2019 Temsirolimus
;;;
Proc sql;
Create table c as select a.*, b.date_b, b.Drug from a a, b b
Where index( a.drug_list,strip(b.drug)) > 0;
Quit;
Got the order of values wrong.
I recommend you make two changes:
Why is Cetuximab associated with the second but not the first drug list for ID=1?
What you showed in the output is this. @mkeintz , @newsas007 please notice this response
data A;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id date_a Drug_list;
format id 3. date_a mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;
proc sort; by id; run;
data B;
informat id 3. date_b mmddyy10. Drug $20.;
input id date_b Drug;
format id 3. date_b mmddyy10. Drug $200.;
cards;
1 2/01/2018 Panitumumab
1 3/02/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/01/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/01/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/02/2019 Temsirolimus
;;;
proc sort; by id; run;
data c;
merge b a;
by id;
if findw( drug_list,strip(drug),',') = 0 then drug_list='';
run;
Which might be wrong as you get this warning in the log
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 10 observations read from the data set WORK.B.
NOTE: There were 8 observations read from the data set WORK.A.
NOTE: The data set WORK.C has 10 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
instead the result should be
Proc sql;
Create table c2 as select a.date_a,a.drug_list, b.date_b, b.Drug,b.id from a a right join b b
on a.id=b.id and findw( a.drug_list,strip(b.drug),',') > 0 and (b.date_b);
Quit;
date_a | Drug_list | date_b | Drug | id |
1/2/2018 | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab | 2/1/2018 | Panitumumab | 1 |
1/2/2018 | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab | 3/28/2019 | Cetuximab | 1 |
2/10/2019 | Cetuximab,Panitumumab | 2/1/2018 | Panitumumab | 1 |
2/10/2019 | Cetuximab,Panitumumab | 3/28/2019 | Cetuximab | 1 |
3/2/2019 | Binimetinib | 1 | ||
3/3/2019 | Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib | 3/23/2019 | Trametinib | 2 |
3/25/2019 | Temsirolimus | 2 | ||
1/2/2018 | Alpelisib,Everolimus,Temsirolimus | 2/1/2018 | Alpelisib | 3 |
2/10/2019 | Bosutinib,Dasatinib,Everolimus,Temsirolimus | 3/1/2019 | Everolimus | 4 |
3/3/2019 | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab | 3/21/2019 | Dacomitinib | 5 |
1/2/2018 | Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib | 1/17/2018 | Idelalisib | 6 |
2/10/2019 | Alpelisib,Everolimus,Temsirolimus | 3/2/2019 | Temsirolimus | 7 |
Adding a new variable does not change the result any way unless there is a business logic associated with it or the column is on both datasets. See my earlier response. You get your desired result using a data step merge, but this is wrong.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.