BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
pkm_edu
Quartz | Level 8

The program below creates variables using PRXMATCH() and SCAN() in SAS 9.4 M6.

The output for the variable FILE_ID seems to be correct while the values for FILE_ID_X are incorrect.

data test;
length file_id $ 10;  
infile datalines length = reclen; 
input string $varying32767. reclen;
position  = prxmatch('m/"HC-\w+/i',string); 
position2 = prxmatch('m/[HC]-\w+ /i',string); 
	if position ^= 0 then do;
	    file_id = scan(string, 2, '"');
		file_id_x = scan(string, 2, '>');
	   output;
	end;
datalines;
<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
<option value="HC-224">MEPS HC-219: 2020 Full Year Population Characteristics File</option>
<option value="HC-010I">HC-010I Appendix to MEPS 1996 Event Files</option>
;
run;

options nocenter ls=132;
options obs=max;
proc print data=test;
var file_id: ;
run;

 

The current PROC PRINT output:

Obs file_id        file_id_x

1    HC-220A      MEPS HC-220A: 2020 Prescribed Medicines File</option
2    HC-224        MEPS HC-219: 2020 Full Year Population Characteristics File</option
3    HC-010I       HC-010I Appendix to MEPS 1996 Event Files</option

The desired PROC PRINT output:

Obs file_id        file_id_x

1    HC-220A      HC-220A
2    HC-224        HC-219
3    HC-010I       HC-010I

Question: What changes would I make to the program to get the desired output for FILE_ID_X?

Any help would be appreciated.

Thank you,

1 ACCEPTED SOLUTION

Accepted Solutions
pkm_edu
Quartz | Level 8

How about replacing the following two statements

position  = prxmatch('m/"HC-\w+/i',string);
	if position ^= 0 then do;

with this statement below?

if find(string,'HC-')>0 then do;

 

 

View solution in original post

8 REPLIES 8
ballardw
Super User

File_id_x is created and shows the values I would expect given the code and values that you show.

This says that SCAN is supposed to only consider the > character as a delimiter.

file_id_x = scan(string, 2, '>');

In your first value

<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
                       ^First delimiter                                     ^Second delimiter

So given that code the result should be

MEPS HC-220A: 2020 Prescribed Medicines File</option

 

IF the value you want always starts with HC- , which is implied by your examples but not stated I might look for the position of the start of second (are there always 2) HC- and then get the rest of the value. But I am not sure that you have provided a clear enough rule for determining what you are actually looking for

 

qoit
Pyrite | Level 9

Kind of elementary, but should do if it's just first three lines. I also don't understand why you've used regex:

data test;
length file_id $ 10;  
infile datalines length = reclen; 
input string $varying32767. reclen;
/*position  = prxmatch('m/"HC-\w+/i',string); */
/*position2 = prxmatch('m/>PSHC-\w+ /i',string); */
/*	if position ^= 0 then do;*/
	    file_id = scan(string, 2, '"');
		file_id_x = scan(string, 2, '>');
		if file_id_x =: "ME" then file_id_x = substr(file_id_x,6,7);
		else if file_id_x =: "H" then file_id_x = substr(file_id_x,1,7);
	   output;
	   drop string;

datalines;
<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
<option value="HC-224">MEPS HC-219: 2020 Full Year Population Characteristics File</option>
<option value="HC-010I">HC-010I Appendix to MEPS 1996 Event Files</option>
;
run;
pkm_edu
Quartz | Level 8

Thanks to Qoit and Ballardw.  I revised the solution suggested by Qoit to accommodate additional scenarios (the last two records) from the data.  My code revisions include the LENGTH() in the third argument of the SUBSTR(). Any other solutions (e.g., regex) are welcome.

data test;
length file_id $ 10;  
infile datalines length = reclen; 
input string $varying32767. reclen;
position  = prxmatch('m/"HC-\w+/i',string);
	if position ^= 0 then do;
	    file_id = scan(string, 2, '"');
		file_id_x = scan(string, 2, '>');
		if file_id_x =: "ME" then file_id_x = substr(file_id_x,6,LENGTH(file_id));
		else if file_id_x =: "H" then file_id_x = substr(file_id_x,1,LENGTH(file_id));
	 output;
	end;
datalines;
<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
<option value="HC-224">MEPS HC-219: 2020 Full Year Population Characteristics File</option>
<option value="HC-010I">HC-010I Appendix to MEPS 1996 Event Files</option>
<option value="HC-036BRR">MEPS HC-036BRR: MEPS 1996-2020 Replicate File for BRR Variance Estimation</option>
<td width="76" height="0"><strong><label for="datatype2"><span class="small">File type:</label></strong></span></td>
;
run;
proc print data=test;
var file_id file_id_x;
run;

Obs  file_id         file_id_x

1  HC-220A        HC-220A
2 HC-224         HC-219
3 HC-010I        HC-010I
4 HC-036BRR HC-036BRR

1 HC-220A       HC-220A
2 HC-224         HC-219
3 HC-010I        HC-010I
4 HC-036BRR HC-036BRR

 

pkm_edu
Quartz | Level 8

How about replacing the following two statements

position  = prxmatch('m/"HC-\w+/i',string);
	if position ^= 0 then do;

with this statement below?

if find(string,'HC-')>0 then do;

 

 

pkm_edu
Quartz | Level 8

Thank you so much for explaining the code.

PGStats
Opal | Level 21

Using regex functions:

 

data test;
length file_id file_id_x $ 10; 
infile datalines truncover; 
input string $32767.;
if not id then id + prxParse('|value="(HC-\w+)".*(HC-\w+).*</option>|i');
if prxmatch(id, string) then do;
    file_id = prxPosn(id, 1, string);
    file_id_x = prxPosn(id, 2, string);
    output;
    end;
datalines;
<option value="HC-220A">MEPS HC-220A: 2020 Prescribed Medicines File</option>
<option value="HC-224">MEPS HC-219: 2020 Full Year Population Characteristics File</option>
<option value="HC-010I">HC-010I Appendix to MEPS 1996 Event Files</option>
<option value="HC-036BRR">MEPS HC-036BRR: MEPS 1996-2020 Replicate File for BRR Variance Estimation</option>
<td width="76" height="0"><strong><label for="datatype2"><span class="small">File type:</label></strong></span></td>
;

prxPosn returns the substring from the nth capture buffer. A capture buffer is part of a match, enclosed in parentheses, that is specified in a regular expression.

The matching pattern could be made more (or less) stringent, according to your needs.

PG
pkm_edu
Quartz | Level 8

Thanks to PGStat! This is certainly an interesting solution.

Could you please explain the code below if possible?

if not id then id + prxParse('|value="(HC-\w+)".*(HC-\w+).*</option>|i');
if prxmatch(id, string) then do;
    file_id = prxPosn(id, 1, string);
    file_id_x = prxPosn(id, 2, string);
PGStats
Opal | Level 21

I'll explain the first bit, since it is a bit tricky. The rest is well explained in SAS documentation for the PRX functions.

 

if not id then id + prxParse('...');

 

This statement parses the regex pattern only once. Initially the variable id is missing, so the test if not id is true and the sum statement id + prxParse('...') is run, which gives a value to variable id. This value is retained for next iterations because variable id was on the left side of a sum statement.

 

This statement is equivalent to :

 

retain id;

if _n_ = 1 then id = prxParse('...');

PG

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1527 views
  • 3 likes
  • 4 in conversation