Hi all,
This is my first time working with text data in SAS. I am working on scanning and merging by text data. The Dataset A contains a list of drugs and Dataset B has the actual drug. Each drug_list in data A is associated with a gene type. I want to merge the two datasets if the drug variable in Dataset B is listed in one of the drug list in Dataset A corresponding to the Gene type (G_type).
data A
| ID | Date_a | G_type | Drug_list |
| 1 | 1/2/2018 | a | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab |
| 1 | 2/10/2019 | b | Cetuximab,Panitumumab |
| 2 | 3/3/2019 | c | Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib |
| 3 | 1/2/2018 | d | Alpelisib,Everolimus,Temsirolimus |
| 4 | 2/10/2019 | e | Bosutinib,Dasatinib,Everolimus,Temsirolimus |
| 5 | 3/3/2019 | a | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab |
| 6 | 1/2/2018 | c | Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib |
| 7 | 2/10/2019 | d | Alpelisib,Everolimus,Temsirolimus |
Data B
| ID | Drug | Date_b |
| 1 | Panitumumab | 2/1/2018 |
| 1 | Binimetinib | 3/2/2019 |
| 1 | Cetuximab | 3/28/2019 |
| 2 | Trametinib | 3/23/2019 |
| 2 | Temsirolimus | 3/25/2019 |
| 3 | Alpelisib | 2/1/2018 |
| 5 | Dacomitinib | 3/21/2019 |
| 4 | Everolimus | 3/1/2019 |
| 6 | Idelalisib | 1/17/2018 |
| 7 | Temsirolimus | 3/2/2019 |
Data Want:
| ID | Drug | Date_b | Date_a | G_type | Drug_List |
| 1 | Panitumumab | 2/1/2018 | 1/2/2018 | a | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab |
| 1 | Binimetinib | 3/2/2019 | |||
| 1 | Cetuximab | 3/28/2019 | 2/10/2019 | b | Cetuximab,Panitumumab |
| 2 | Trametinib | 3/23/2019 | 3/3/2019 | c | Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib |
| 2 | Temsirolimus | 3/25/2019 | |||
| 3 | Alpelisib | 2/1/2018 | 1/2/2018 | d | Alpelisib,Everolimus,Temsirolimus |
| 5 | Dacomitinib | 3/21/2019 | 3/3/2019 | a | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab |
| 4 | Everolimus | 3/1/2019 | 2/10/2019 | e | Bosutinib,Dasatinib,Everolimus,Temsirolimus |
| 6 | Idelalisib | 1/17/2018 | 1/2/2018 | c | Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib |
| 7 | Temsirolimus | 3/2/2019 | 2/10/2019 | d | Alpelisib,Everolimus,Temsirolimus |
Here is the program and it doesn't work.
data A;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id date_a Drug_list;
format id 3. date_a mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;
data B;
informat id 3. date_b mmddyy10. Drug $20.;
input id date_b Drug;
format id 3. date_b mmddyy10. Drug $200.;
cards;
1 2/01/2018 Panitumumab
1 3/02/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/01/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/01/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/02/2019 Temsirolimus
;;;
Proc sql;
Create table c as select a.*, b.date_b, b.Drug from a a, b b
Where index(b.drug, a.drug_list) > 0;
Quit;
Here is the log:
137 Where index(b.drug, a.drug_list) > 0;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can
not be optimized.
NOTE: Table WORK.C created, with 0 rows and 5 columns.
This doesn't work...
data have;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id A Drug_list;
format id 3. A mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;
data want;
informat id 3. date_b mmddyy10. Drug $20.;
input id A Drug;
format id 3. A mmddyy10. Drug $200.;
cards;
ID Date_b Drug
1 2/1/2018 Panitumumab
1 3/2/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/1/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/1/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/2/2019 Temsirolimus
;;;
Proc sql;
Create table c as select a.*, b.* from a a, b b
Where index(b.drug, a.drug_list) >0;
Quit;
data A;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id date_a Drug_list;
format id 3. date_a mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;
data B;
informat id 3. date_b mmddyy10. Drug $20.;
input id date_b Drug;
format id 3. date_b mmddyy10. Drug $200.;
cards;
1 2/01/2018 Panitumumab
1 3/02/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/01/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/01/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/02/2019 Temsirolimus
;;;
Proc sql;
Create table c as select a.*, b.date_b, b.Drug from a a, b b
Where index( a.drug_list,strip(b.drug)) > 0;
Quit;
Got the order of values wrong.
I recommend you make two changes:
Why is Cetuximab associated with the second but not the first drug list for ID=1?
What you showed in the output is this. @mkeintz , @newsas007 please notice this response
data A;
informat id 3. date_a mmddyy10. Drug_list $200.;
input id date_a Drug_list;
format id 3. date_a mmddyy10. Drug_list $200.;
cards;
1 1/2/2018 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
1 2/10/2019 Cetuximab,Panitumumab
2 3/3/2019 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
3 1/2/2018 Alpelisib,Everolimus,Temsirolimus
4 2/10/2019 Bosutinib,Dasatinib,Everolimus,Temsirolimus
5 3/3/2019 Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab
6 1/2/2018 Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib
7 2/10/2019 Alpelisib,Everolimus,Temsirolimus
;;;
proc sort; by id; run;
data B;
informat id 3. date_b mmddyy10. Drug $20.;
input id date_b Drug;
format id 3. date_b mmddyy10. Drug $200.;
cards;
1 2/01/2018 Panitumumab
1 3/02/2019 Binimetinib
1 3/28/2019 Cetuximab
2 3/23/2019 Trametinib
2 3/25/2019 Temsirolimus
3 2/01/2018 Alpelisib
5 3/21/2019 Dacomitinib
4 3/01/2019 Everolimus
6 1/17/2018 Idelalisib
7 3/02/2019 Temsirolimus
;;;
proc sort; by id; run;
data c;
merge b a;
by id;
if findw( drug_list,strip(drug),',') = 0 then drug_list='';
run;
Which might be wrong as you get this warning in the log
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 10 observations read from the data set WORK.B.
NOTE: There were 8 observations read from the data set WORK.A.
NOTE: The data set WORK.C has 10 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
instead the result should be
Proc sql;
Create table c2 as select a.date_a,a.drug_list, b.date_b, b.Drug,b.id from a a right join b b
on a.id=b.id and findw( a.drug_list,strip(b.drug),',') > 0 and (b.date_b);
Quit;
| date_a | Drug_list | date_b | Drug | id |
| 1/2/2018 | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab | 2/1/2018 | Panitumumab | 1 |
| 1/2/2018 | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab | 3/28/2019 | Cetuximab | 1 |
| 2/10/2019 | Cetuximab,Panitumumab | 2/1/2018 | Panitumumab | 1 |
| 2/10/2019 | Cetuximab,Panitumumab | 3/28/2019 | Cetuximab | 1 |
| 3/2/2019 | Binimetinib | 1 | ||
| 3/3/2019 | Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib | 3/23/2019 | Trametinib | 2 |
| 3/25/2019 | Temsirolimus | 2 | ||
| 1/2/2018 | Alpelisib,Everolimus,Temsirolimus | 2/1/2018 | Alpelisib | 3 |
| 2/10/2019 | Bosutinib,Dasatinib,Everolimus,Temsirolimus | 3/1/2019 | Everolimus | 4 |
| 3/3/2019 | Afatinib,Cetuximab,Dacomitinib,Erlotinib,Gefitinib,Lapatinib,Necitumumab,Osimertinib,Panitumumab | 3/21/2019 | Dacomitinib | 5 |
| 1/2/2018 | Binimetinib,Cetuximab,Cobimetinib,Idelalisib,Panitumumab,Trametinib | 1/17/2018 | Idelalisib | 6 |
| 2/10/2019 | Alpelisib,Everolimus,Temsirolimus | 3/2/2019 | Temsirolimus | 7 |
Adding a new variable does not change the result any way unless there is a business logic associated with it or the column is on both datasets. See my earlier response. You get your desired result using a data step merge, but this is wrong.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.