The program below creates variables using PRXMATCH() and SCAN() in SAS 9.4 M6.
The output for the variable FILE_ID seems to be correct while the values for FILE_ID_X are incorrect.
data test;
length file_id $ 10;
infile datalines length = reclen;
input string $varying32767. reclen;
position = prxmatch('m/"HC-\w+/i',string);
position2 = prxmatch('m/[HC]-\w+ /i',string);
if position ^= 0 then do;
file_id = scan(string, 2, '"');
file_id_x = scan(string, 2, '>');
output;
end;
datalines;
<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
<option value="HC-224">MEPS HC-219: 2020 Full Year Population Characteristics File</option>
<option value="HC-010I">HC-010I Appendix to MEPS 1996 Event Files</option>
;
run;
options nocenter ls=132;
options obs=max;
proc print data=test;
var file_id: ;
run;
The current PROC PRINT output:
Obs file_id file_id_x
1 HC-220A MEPS HC-220A: 2020 Prescribed Medicines File</option
2 HC-224 MEPS HC-219: 2020 Full Year Population Characteristics File</option
3 HC-010I HC-010I Appendix to MEPS 1996 Event Files</option
The desired PROC PRINT output:
Obs file_id file_id_x
1 HC-220A HC-220A
2 HC-224 HC-219
3 HC-010I HC-010I
Question: What changes would I make to the program to get the desired output for FILE_ID_X?
Any help would be appreciated.
Thank you,
How about replacing the following two statements
position = prxmatch('m/"HC-\w+/i',string);
if position ^= 0 then do;
with this statement below?
if find(string,'HC-')>0 then do;
File_id_x is created and shows the values I would expect given the code and values that you show.
This says that SCAN is supposed to only consider the > character as a delimiter.
file_id_x = scan(string, 2, '>');
In your first value
<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
^First delimiter ^Second delimiter
So given that code the result should be
MEPS HC-220A: 2020 Prescribed Medicines File</option
IF the value you want always starts with HC- , which is implied by your examples but not stated I might look for the position of the start of second (are there always 2) HC- and then get the rest of the value. But I am not sure that you have provided a clear enough rule for determining what you are actually looking for
Kind of elementary, but should do if it's just first three lines. I also don't understand why you've used regex:
data test;
length file_id $ 10;
infile datalines length = reclen;
input string $varying32767. reclen;
/*position = prxmatch('m/"HC-\w+/i',string); */
/*position2 = prxmatch('m/>PSHC-\w+ /i',string); */
/* if position ^= 0 then do;*/
file_id = scan(string, 2, '"');
file_id_x = scan(string, 2, '>');
if file_id_x =: "ME" then file_id_x = substr(file_id_x,6,7);
else if file_id_x =: "H" then file_id_x = substr(file_id_x,1,7);
output;
drop string;
datalines;
<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
<option value="HC-224">MEPS HC-219: 2020 Full Year Population Characteristics File</option>
<option value="HC-010I">HC-010I Appendix to MEPS 1996 Event Files</option>
;
run;
Thanks to Qoit and Ballardw. I revised the solution suggested by Qoit to accommodate additional scenarios (the last two records) from the data. My code revisions include the LENGTH() in the third argument of the SUBSTR(). Any other solutions (e.g., regex) are welcome.
data test;
length file_id $ 10;
infile datalines length = reclen;
input string $varying32767. reclen;
position = prxmatch('m/"HC-\w+/i',string);
if position ^= 0 then do;
file_id = scan(string, 2, '"');
file_id_x = scan(string, 2, '>');
if file_id_x =: "ME" then file_id_x = substr(file_id_x,6,LENGTH(file_id));
else if file_id_x =: "H" then file_id_x = substr(file_id_x,1,LENGTH(file_id));
output;
end;
datalines;
<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
<option value="HC-224">MEPS HC-219: 2020 Full Year Population Characteristics File</option>
<option value="HC-010I">HC-010I Appendix to MEPS 1996 Event Files</option>
<option value="HC-036BRR">MEPS HC-036BRR: MEPS 1996-2020 Replicate File for BRR Variance Estimation</option>
<td width="76" height="0"><strong><label for="datatype2"><span class="small">File type:</label></strong></span></td>
;
run;
proc print data=test;
var file_id file_id_x;
run;
Obs file_id file_id_x
1 HC-220A HC-220A
2 HC-224 HC-219
3 HC-010I HC-010I
4 HC-036BRR HC-036BRR
1 HC-220A HC-220A
2 HC-224 HC-219
3 HC-010I HC-010I
4 HC-036BRR HC-036BRR
How about replacing the following two statements
position = prxmatch('m/"HC-\w+/i',string);
if position ^= 0 then do;
with this statement below?
if find(string,'HC-')>0 then do;
Thank you so much for explaining the code.
Using regex functions:
data test;
length file_id file_id_x $ 10;
infile datalines truncover;
input string $32767.;
if not id then id + prxParse('|value="(HC-\w+)".*(HC-\w+).*</option>|i');
if prxmatch(id, string) then do;
file_id = prxPosn(id, 1, string);
file_id_x = prxPosn(id, 2, string);
output;
end;
datalines;
<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
<option value="HC-224">MEPS HC-219: 2020 Full Year Population Characteristics File</option>
<option value="HC-010I">HC-010I Appendix to MEPS 1996 Event Files</option>
<option value="HC-036BRR">MEPS HC-036BRR: MEPS 1996-2020 Replicate File for BRR Variance Estimation</option>
<td width="76" height="0"><strong><label for="datatype2"><span class="small">File type:</label></strong></span></td>
;
prxPosn returns the substring from the nth capture buffer. A capture buffer is part of a match, enclosed in parentheses, that is specified in a regular expression.
The matching pattern could be made more (or less) stringent, according to your needs.
Thanks to PGStat! This is certainly an interesting solution.
Could you please explain the code below if possible?
if not id then id + prxParse('|value="(HC-\w+)".*(HC-\w+).*</option>|i');
if prxmatch(id, string) then do;
file_id = prxPosn(id, 1, string);
file_id_x = prxPosn(id, 2, string);
I'll explain the first bit, since it is a bit tricky. The rest is well explained in SAS documentation for the PRX functions.
if not id then id + prxParse('...');
This statement parses the regex pattern only once. Initially the variable id is missing, so the test if not id is true and the sum statement id + prxParse('...') is run, which gives a value to variable id. This value is retained for next iterations because variable id was on the left side of a sum statement.
This statement is equivalent to :
retain id;
if _n_ = 1 then id = prxParse('...');
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.