Hi,
I have a character variable named "Title and Authors" that contains a long text and I want to extract a certain text which is last Author name that lies between comma and period as shown in RED the example below:
Title and Authors
1- Minervina AA, Pogorelyy MV, Kirk AM, Crawford JC, Allen EK, Chou CH, Mettelman RC, Allison KJ, Lin CY, Brice DC, Zhu X, Vegesana K, Wu G, Trivedi S, Kottapalli P, Darnell D, McNeely S, Olsen SR, Schultz-Cherry S, McGargill MA, Wolf J, Thomas PG. SARS-CoV-2 antigen exposure history shapes phenotypes and specificity of memory CD8(+) T cells.
2- Bauler M, Roberts JK, Wu CC, Fan B, Ferrara F, Yip BH, Diao S, Kim YI, Moore J, Zhou S, Wielgosz MM, Ryu B, Throm RE. Production of lentiviral vectors using suspension cells grown in serum-free media.
Thanks
Since the style seems to be to not have periods for author initials it looks pretty simple to find the last author.
data have;
infile cards truncover;
input line $500.;
cards4;
1- Minervina AA, Pogorelyy MV, Kirk AM, Crawford JC, Allen EK, Chou CH, Mettelman RC, Allison KJ, Lin CY, Brice DC, Zhu X, Vegesana K, Wu G, Trivedi S, Kottapalli P, Darnell D, McNeely S, Olsen SR, Schultz-Cherry S, McGargill MA, Wolf J, Thomas PG. SARS-CoV-2 antigen exposure history shapes phenotypes and specificity of memory CD8(+) T cells.
2- Bauler M, Roberts JK, Wu CC, Fan B, Ferrara F, Yip BH, Diao S, Kim YI, Moore J, Zhou S, Wielgosz MM, Ryu B, Throm RE. Production of lentiviral vectors using suspension cells grown in serum-free media.
;;;;
data want;
length author $80;
set have;
author=left(scan(scan(line,1,'.'),-1,','));
run;
Below should work if the desired substring is always right after the 2nd period in your source string.
data null;
string='2- Bauler M, Roberts JK, Wu CC, Fan B, Ferrara F, Yip BH, Diao S, Kim YI, Moore J, Zhou S, Wielgosz MM, Ryu B, Throm RE. Production of lentiviral vectors using suspension cells grown in serum-free media.';
length lastAuthor $40 listAuthors $200;
listAuthors=scan(string,-2,'.');
lastAuthor=scan(listAuthors,-1,',');
put lastAuthor=;
run;
Thank you. The code you provided worked on only the second observation, however I have about 1900 rows/observations that I want to extract the last author's name from. Is there a way to make this code work on all observations?
The length for listAuthors was too short for your first sample. If you increase the length then the code as posted works for both rows of your sample data.
If it's going to work for all your data depends on the assumption that the text after the Authors doesn't contain some "embedded" period.
Since the style seems to be to not have periods for author initials it looks pretty simple to find the last author.
data have;
infile cards truncover;
input line $500.;
cards4;
1- Minervina AA, Pogorelyy MV, Kirk AM, Crawford JC, Allen EK, Chou CH, Mettelman RC, Allison KJ, Lin CY, Brice DC, Zhu X, Vegesana K, Wu G, Trivedi S, Kottapalli P, Darnell D, McNeely S, Olsen SR, Schultz-Cherry S, McGargill MA, Wolf J, Thomas PG. SARS-CoV-2 antigen exposure history shapes phenotypes and specificity of memory CD8(+) T cells.
2- Bauler M, Roberts JK, Wu CC, Fan B, Ferrara F, Yip BH, Diao S, Kim YI, Moore J, Zhou S, Wielgosz MM, Ryu B, Throm RE. Production of lentiviral vectors using suspension cells grown in serum-free media.
;;;;
data want;
length author $80;
set have;
author=left(scan(scan(line,1,'.'),-1,','));
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.