Solved: Creating variables using PRXMATCH() and SCAN()

pkm_edu · Posted 05-01-2023 12:39 AM

The program below creates variables using PRXMATCH() and SCAN() in SAS 9.4 M6.

The output for the variable FILE_ID seems to be correct while the values for FILE_ID_X are incorrect.

data test;
length file_id $ 10;  
infile datalines length = reclen; 
input string $varying32767. reclen;
position  = prxmatch('m/"HC-\w+/i',string); 
position2 = prxmatch('m/[HC]-\w+ /i',string); 
	if position ^= 0 then do;
	    file_id = scan(string, 2, '"');
		file_id_x = scan(string, 2, '>');
	   output;
	end;
datalines;
<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
<option value="HC-224">MEPS HC-219: 2020 Full Year Population Characteristics File</option>
<option value="HC-010I">HC-010I Appendix to MEPS 1996 Event Files</option>
;
run;

options nocenter ls=132;
options obs=max;
proc print data=test;
var file_id: ;
run;

The current PROC PRINT output:

Obs file_id file_id_x

1 HC-220A MEPS HC-220A: 2020 Prescribed Medicines File</option
2 HC-224 MEPS HC-219: 2020 Full Year Population Characteristics File</option
3 HC-010I HC-010I Appendix to MEPS 1996 Event Files</option

The desired PROC PRINT output:

Obs file_id file_id_x

1 HC-220A HC-220A
2 HC-224 HC-219
3 HC-010I HC-010I

Question: What changes would I make to the program to get the desired output for FILE_ID_X?

Any help would be appreciated.

Thank you,

pkm_edu · Posted 05-01-2023 02:46 PM

How about replacing the following two statements

position  = prxmatch('m/"HC-\w+/i',string);
	if position ^= 0 then do;

with this statement below?

if find(string,'HC-')>0 then do;

View solution in original post

ballardw · Posted 05-01-2023 02:55 AM

File_id_x is created and shows the values I would expect given the code and values that you show.

This says that SCAN is supposed to only consider the > character as a delimiter.

file_id_x = scan(string, 2, '>');

In your first value

<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
                       ^First delimiter                                     ^Second delimiter

So given that code the result should be

MEPS HC-220A: 2020 Prescribed Medicines File</option

IF the value you want always starts with HC- , which is implied by your examples but not stated I might look for the position of the start of second (are there always 2) HC- and then get the rest of the value. But I am not sure that you have provided a clear enough rule for determining what you are actually looking for

qoit · Posted 05-01-2023 03:09 AM

Kind of elementary, but should do if it's just first three lines. I also don't understand why you've used regex:

data test;
length file_id $ 10;  
infile datalines length = reclen; 
input string $varying32767. reclen;
/*position  = prxmatch('m/"HC-\w+/i',string); */
/*position2 = prxmatch('m/>PSHC-\w+ /i',string); */
/*	if position ^= 0 then do;*/
	    file_id = scan(string, 2, '"');
		file_id_x = scan(string, 2, '>');
		if file_id_x =: "ME" then file_id_x = substr(file_id_x,6,7);
		else if file_id_x =: "H" then file_id_x = substr(file_id_x,1,7);
	   output;
	   drop string;

datalines;
<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
<option value="HC-224">MEPS HC-219: 2020 Full Year Population Characteristics File</option>
<option value="HC-010I">HC-010I Appendix to MEPS 1996 Event Files</option>
;
run;

pkm_edu · Posted 05-01-2023 02:27 PM

Thanks to Qoit and Ballardw. I revised the solution suggested by Qoit to accommodate additional scenarios (the last two records) from the data. My code revisions include the LENGTH() in the third argument of the SUBSTR(). Any other solutions (e.g., regex) are welcome.

data test;
length file_id $ 10;  
infile datalines length = reclen; 
input string $varying32767. reclen;
position  = prxmatch('m/"HC-\w+/i',string);
	if position ^= 0 then do;
	    file_id = scan(string, 2, '"');
		file_id_x = scan(string, 2, '>');
		if file_id_x =: "ME" then file_id_x = substr(file_id_x,6,LENGTH(file_id));
		else if file_id_x =: "H" then file_id_x = substr(file_id_x,1,LENGTH(file_id));
	 output;
	end;
datalines;
<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
<option value="HC-224">MEPS HC-219: 2020 Full Year Population Characteristics File</option>
<option value="HC-010I">HC-010I Appendix to MEPS 1996 Event Files</option>
<option value="HC-036BRR">MEPS HC-036BRR: MEPS 1996-2020 Replicate File for BRR Variance Estimation</option>
<td width="76" height="0"><strong><label for="datatype2"><span class="small">File type:</label></strong></span></td>
;
run;
proc print data=test;
var file_id file_id_x;
run;

Obs file_id file_id_x

1  HC-220A        HC-220A
2  HC-224         HC-219
3  HC-010I        HC-010I
4  HC-036BRR     HC-036BRR

1 HC-220A HC-220A
2 HC-224 HC-219
3 HC-010I HC-010I
4 HC-036BRR HC-036BRR

pkm_edu · Posted 05-01-2023 02:46 PM

How about replacing the following two statements

position  = prxmatch('m/"HC-\w+/i',string);
	if position ^= 0 then do;

with this statement below?

if find(string,'HC-')>0 then do;

pkm_edu · Posted 05-02-2023 02:28 PM

Thank you so much for explaining the code.

PGStats · Posted 05-01-2023 03:21 PM

Using regex functions:

data test;
length file_id file_id_x $ 10; 
infile datalines truncover; 
input string $32767.;
if not id then id + prxParse('|value="(HC-\w+)".*(HC-\w+).*</option>|i');
if prxmatch(id, string) then do;
    file_id = prxPosn(id, 1, string);
    file_id_x = prxPosn(id, 2, string);
    output;
    end;
datalines;
<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
<option value="HC-224">MEPS HC-219: 2020 Full Year Population Characteristics File</option>
<option value="HC-010I">HC-010I Appendix to MEPS 1996 Event Files</option>
<option value="HC-036BRR">MEPS HC-036BRR: MEPS 1996-2020 Replicate File for BRR Variance Estimation</option>
<td width="76" height="0"><strong><label for="datatype2"><span class="small">File type:</label></strong></span></td>
;

prxPosn returns the substring from the nth capture buffer. A capture buffer is part of a match, enclosed in parentheses, that is specified in a regular expression.

The matching pattern could be made more (or less) stringent, according to your needs.

PG

pkm_edu · Posted 05-01-2023 04:42 PM

Thanks to PGStat! This is certainly an interesting solution.

Could you please explain the code below if possible?

if not id then id + prxParse('|value="(HC-\w+)".*(HC-\w+).*</option>|i');
if prxmatch(id, string) then do;
    file_id = prxPosn(id, 1, string);
    file_id_x = prxPosn(id, 2, string);

PGStats · Posted 05-02-2023 02:23 PM

I'll explain the first bit, since it is a bit tricky. The rest is well explained in SAS documentation for the PRX functions.

if not id then id + prxParse('...');

This statement parses the regex pattern only once. Initially the variable id is missing, so the test if not id is true and the sum statement id + prxParse('...') is run, which gives a value to variable id. This value is retained for next iterations because variable id was on the left side of a sum statement.

This statement is equivalent to :

retain id;

if _n_ = 1 then id = prxParse('...');

PG

Creating variables using PRXMATCH() and SCAN()

Re: Creating variables using PRXMATCH() and SCAN()

Re: Creating variables using PRXMATCH() and SCAN()

Re: Creating variables using PRXMATCH() and SCAN()

Re: Creating variables using PRXMATCH() and SCAN()

Re: Creating variables using PRXMATCH() and SCAN()

Re: Creating variables using PRXMATCH() and SCAN()

Re: Creating variables using PRXMATCH() and SCAN()

Re: Creating variables using PRXMATCH() and SCAN()

Re: Creating variables using PRXMATCH() and SCAN()