BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
pkm_edu
Quartz | Level 8
A call to PRXPOSN in the code snippet below gives me the capture buffer's 
START value (where regex begins matching) of 30 and the END value (which
represents the length of matching string) value of 4 (output below).

However, when I count the number of characters from left
to right in the output (html_line string), I find the START value to be 22.

Question 1: Why do these two numbers not match?
Thanks,



filename source temp; proc http url = "https://meps.ahrq.gov/data_stats/download_data_files.jsp" out=source; run; data year_values; length year $4; infile source length = reclen lrecl = 32767 end=eof; re = prxparse('/<option value="\d{4}">(\d{4})<\/option>/'); /* Read the HTML line by line */ do while (not eof); input html_line $varying32767. reclen; Len_html_line = length(html_line); /* Match and extract the years using regular expressions */ if prxmatch(re, html_line) > 0 then do; call prxposn(re, 1, start, end); year = substr(html_line, start, end); output year_values; end; end; run; proc print data=year_values noobs; var Len_html_line html_line start end year; run;

The Output:

Len_
html_
line html_line start end year

42 <option value="2021">2021</option> 30 4 2021
42 <option value="2020">2020</option> 30 4 2020
42 <option value="2019">2019</option> 30 4 2019
42 <option value="2018">2018</option> 30 4 2018
42 <option value="2017">2017</option> 30 4 2017
42 <option value="2016">2016</option> 30 4 2016
42 <option value="2015">2015</option> 30 4 2015
42 <option value="2014">2014</option> 30 4 2014
42 <option value="2013">2013</option> 30 4 2013
42 <option value="2012">2012</option> 30 4 2012
42 <option value="2011">2011</option> 30 4 2011
42 <option value="2010">2010</option> 30 4 2010
42 <option value="2009">2009</option> 30 4 2009
42 <option value="2008">2008</option> 30 4 2008
42 <option value="2007">2007</option> 30 4 2007
42 <option value="2006">2006</option> 30 4 2006
42 <option value="2005">2005</option> 30 4 2005
42 <option value="2004">2004</option> 30 4 2004
42 <option value="2003">2003</option> 30 4 2003
42 <option value="2002">2002</option> 30 4 2002
42 <option value="2001">2001</option> 30 4 2001
42 <option value="2000">2000</option> 30 4 2000
42 <option value="1999">1999</option> 30 4 1999
42 <option value="1998">1998</option> 30 4 1998
42 <option value="1997">1997</option> 30 4 1997
42 <option value="1996">1996</option> 30 4 1996

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Probably because you printed the text using the $ format

Tom_0-1686361984018.png

 

instead of the $CHAR format.

Tom_1-1686362015535.png

 

Or worse you looked at the ODS output instead of the LISTING output.

Tom_2-1686362041475.png

 

But also watch out for the TAB characters (your example code also had tab characters in it)

 

Try writing the values to a text file and read it in and display the lines using the LIST statement.


filename listing temp;
data _null_;
  set year_values ;
  file listing;
  put html_line $char42. +1 start= ;
run;

data _null_;
  infile listing;
  input;
  list;
run;

Results

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0

1   CHAR      ..  <option value="2021">2021</option> start=30 51
    ZONE  222200223677666276676323333233333326776663277677333
    NUMR  00009900CF049FE061C55D220212E2021CFF049FEE034124D30

2   CHAR      ..  <option value="2020">2020</option> start=30 51
    ZONE  222200223677666276676323333233333326776663277677333
    NUMR  00009900CF049FE061C55D220202E2020CFF049FEE034124D30

3   CHAR      ..  <option value="2019">2019</option> start=30 51
    ZONE  222200223677666276676323333233333326776663277677333
    NUMR  00009900CF049FE061C55D220192E2019CFF049FEE034124D30

4   CHAR      ..  <option value="2018">2018</option> start=30 51
    ZONE  222200223677666276676323333233333326776663277677333
    NUMR  00009900CF049FE061C55D220182E2018CFF049FEE034124D30

5   CHAR      ..  <option value="2017">2017</option> start=30 51
    ZONE  222200223677666276676323333233333326776663277677333
    NUMR  00009900CF049FE061C55D220172E2017CFF049FEE034124D30

6   CHAR      ..  <option value="2016">2016</option> start=30 51
    ZONE  222200223677666276676323333233333326776663277677333
    NUMR  00009900CF049FE061C55D220162E2016CFF049FEE034124D30

...

View solution in original post

15 REPLIES 15
Tom
Super User Tom
Super User

Probably because you printed the text using the $ format

Tom_0-1686361984018.png

 

instead of the $CHAR format.

Tom_1-1686362015535.png

 

Or worse you looked at the ODS output instead of the LISTING output.

Tom_2-1686362041475.png

 

But also watch out for the TAB characters (your example code also had tab characters in it)

 

Try writing the values to a text file and read it in and display the lines using the LIST statement.


filename listing temp;
data _null_;
  set year_values ;
  file listing;
  put html_line $char42. +1 start= ;
run;

data _null_;
  infile listing;
  input;
  list;
run;

Results

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0

1   CHAR      ..  <option value="2021">2021</option> start=30 51
    ZONE  222200223677666276676323333233333326776663277677333
    NUMR  00009900CF049FE061C55D220212E2021CFF049FEE034124D30

2   CHAR      ..  <option value="2020">2020</option> start=30 51
    ZONE  222200223677666276676323333233333326776663277677333
    NUMR  00009900CF049FE061C55D220202E2020CFF049FEE034124D30

3   CHAR      ..  <option value="2019">2019</option> start=30 51
    ZONE  222200223677666276676323333233333326776663277677333
    NUMR  00009900CF049FE061C55D220192E2019CFF049FEE034124D30

4   CHAR      ..  <option value="2018">2018</option> start=30 51
    ZONE  222200223677666276676323333233333326776663277677333
    NUMR  00009900CF049FE061C55D220182E2018CFF049FEE034124D30

5   CHAR      ..  <option value="2017">2017</option> start=30 51
    ZONE  222200223677666276676323333233333326776663277677333
    NUMR  00009900CF049FE061C55D220172E2017CFF049FEE034124D30

6   CHAR      ..  <option value="2016">2016</option> start=30 51
    ZONE  222200223677666276676323333233333326776663277677333
    NUMR  00009900CF049FE061C55D220162E2016CFF049FEE034124D30

...
Tom
Super User Tom
Super User

You could just let SAS do most of the work instead.  Just use @ string syntax of INPUT to find where the next field is located.

data year_values;
  length year 8;
  infile source end=eof dlm='<';
  input @'<option value="' @'>' year ?? @@;
  html_line=_infile_;
  if year in (1800:2500) then output;
run;
Obs    year                  html_line

  1    2021    		  <option value="2021">2021</option>
  2    2020    		  <option value="2020">2020</option>
  3    2019    		  <option value="2019">2019</option>
  4    2018    		  <option value="2018">2018</option>
  5    2017    		  <option value="2017">2017</option>
  6    2016    		  <option value="2016">2016</option>
  7    2015    		  <option value="2015">2015</option>
  8    2014    		  <option value="2014">2014</option>
  9    2013    		  <option value="2013">2013</option>
 10    2012    		  <option value="2012">2012</option>
 11    2011    		  <option value="2011">2011</option>
 12    2010    		  <option value="2010">2010</option>
 13    2009    		  <option value="2009">2009</option>
 14    2008    		  <option value="2008">2008</option>
 15    2007    		  <option value="2007">2007</option>
 16    2006    		  <option value="2006">2006</option>
 17    2005    		  <option value="2005">2005</option>
 18    2004    		  <option value="2004">2004</option>
 19    2003    		  <option value="2003">2003</option>
 20    2002    		  <option value="2002">2002</option>
 21    2001    		  <option value="2001">2001</option>
 22    2000    		  <option value="2000">2000</option>
 23    1999    		  <option value="1999">1999</option>
 24    1998    		  <option value="1998">1998</option>
 25    1997    		  <option value="1997">1997</option>
 26    1996    		  <option value="1996">1996</option>
Tom
Super User Tom
Super User

Another thing to consider is the EXPANDTABS option of the INFILE statement.

1035  options generic;
1036  data year_values;
1037    length year 8;
1038    infile source dlm='<' expandtabs;
1039    input @'<option value="' @'>' year ?? @@;
1040    if year in (1900:2100);
1041    list;
1042  run;

NOTE: The infile SOURCE is:
      (system-specific pathname),
      (system-specific file attributes)

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
741                         <option value="2021">2021</option> 52
744                         <option value="2020">2020</option> 52
747                         <option value="2019">2019</option> 52
750                         <option value="2018">2018</option> 52
753                         <option value="2017">2017</option> 52
756                         <option value="2016">2016</option> 52
759                         <option value="2015">2015</option> 52
762                         <option value="2014">2014</option> 52
765                         <option value="2013">2013</option> 52
768                         <option value="2012">2012</option> 52
771                         <option value="2011">2011</option> 52
774                         <option value="2010">2010</option> 52
777                         <option value="2009">2009</option> 52
780                         <option value="2008">2008</option> 52
783                         <option value="2007">2007</option> 52
786                         <option value="2006">2006</option> 52
789                         <option value="2005">2005</option> 52
792                         <option value="2004">2004</option> 52
795                         <option value="2003">2003</option> 52
798                         <option value="2002">2002</option> 52
801                         <option value="2001">2001</option> 52
804                         <option value="2000">2000</option> 52
807                         <option value="1999">1999</option> 52
810                         <option value="1998">1998</option> 52
813                         <option value="1997">1997</option> 52
816                         <option value="1996">1996</option> 52
NOTE: 3244 records were read from the infile (system-specific pathname).
      The minimum record length was 0.
      The maximum record length was 452.
NOTE: SAS went to a new line when INPUT @'CHARACTER_STRING' scanned past the end of a line.
NOTE: The data set WORK.YEAR_VALUES has 26 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds
pkm_edu
Quartz | Level 8
Hello Tom,
Mine is from the listing output, not the ODS output.
Do you see anything wrong with the NPUT statement below?
I used a similar syntax for hundreds of PROC HTTP response data files. No issues,
Thanks,

data year_values;
length year $4;
infile source length = reclen lrecl = 32767 end=eof;
re = prxparse('/(\d{4})<\/option>/');
/* Read the HTML line by line */
do while (not eof);
input html_line $varying32767. reclen;
Len_html_line = length(html_line);
/* Match and extract the years using regular expressions */
if prxmatch(re, html_line) > 0 then do;
call prxposn(re, 1, start, end);
year = substr(html_line, start, end);
output year_values;
end;
end;
run;
Tom
Super User Tom
Super User

It is not the INPUT statement.  It is the fact that you did not attach the $CHAR format to the variable.  That will cause SAS to use the normal $ format, which left aligns the values.  The only reason it looked like there were some spaces is because of the TAB characters.

 

As to the INPUT statement there are two issues:

 

The input statement will only read the first 32,767 bytes of the line.

Since it does not have a trailing @ the logic will only work for HTML files that has each <option> value on a separate line.  In general that might not be true for an HTML file.

pkm_edu
Quartz | Level 8
format html_line $char50.;
Adding the above SAS statement did not change the output.

pkm_edu
Quartz | Level 8

Hi Tom,

Thank you so much for sending me two solutions to reading highly unstructured data into SAS.

 

Questions: Why does SAS issue the following note? Is this something that we need to be concerned about?

 

NOTE: SAS went to a new line when INPUT @'CHARACTER_STRING' scanned past the end of a line

 

pkm_edu
Quartz | Level 8

Sorry, I got  three more  clarification questions.

 

1) In DATA step below, I have changed the conditional IF statement to 'if not missing (year);'  in order to filter out  the observations with missing values for the numeric variable year.  Does this code change make sense to you?

 

2) The  @'<option value="' will cause SAS to move  its  pointer past the '<option value="' field.  Then  the  @'>' will make SAS move  its pointer past the '>' field
to read the numeric value, for example, 2021 (not "2021").  Is this a correct code explanation?

 

3)  The ?? modifier suppresses  the following  notes (examples from the partial SAS Log),

NOTE: Invalid data for year in line 738 39-57.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
738 <option value="All">All available years</option> 66
year=. _ERROR_=1 _N_=1
Is this a correct code explanation?

 

Thanks,

 

options generic;
data year_values3;
length year 8;
infile source dlm='<' expandtabs;
input @@'<option value="'   @'>' year ?? @@;
 if not missing (year);
list;
run
pkm_edu
Quartz | Level 8
Correction"
input @'<option value="' @'>' year ?? @@;
Tom
Super User Tom
Super User

1) Just testing for missing year will allow option values of 5 or 10 or 1.234E12 to be read as a year.   By using the IN operator with a range of integers I insure that the value is reasonable.  Not your regular expression would filter to just 4 digit numbers (or really the last 4 digits) so it will also find invalid year values, like 1000.

 

2)  Yes.  And that might not be the right way to do it.  It depends on what information you want from the HTML code. The string <option value="ABC">XYZ</option>  will show the XYZ in the browser but return the ABC as the result when selected.  In your example form both have the same string.  So the code I showed is taking the XYZ value.  You might want to re-work it to take the ABC value.   Can you tell what needs to change to do that?

Spoiler
Remove the extra @'<' from the INPUT statement and change the DLM= option on the INFILE statement to DLM='"'.

 

3) Yes the ?? suppresses the error message and the setting of the _ERROR_ flag variable that cause those notes when value being read, in this case the next "word" on the line, is not valid for the informat being used.

 

 

pkm_edu
Quartz | Level 8

 

Here is the revised code I tried. It has worked for me.

 

Question 1:  I did not use the @@  in the INPUT statement because the data file does not have multiple  YEAR records on the same line. Does not using the @@ make sense to you?

Question 2: Sorry I could not find the @Syntax of INPUT in the SAS(R) Documentation. Could you please send me a reference, if possible?

Question 3:  What does the generic option on the OPTIONS statement do? Any reference?

Thanks,

 

 

options generic;
data year_values;
    length year 8;
    infile source dlm='"' expandtabs;
    input @'<option value="' year  ??;
    if year in (1900:2023);
run;
proc print data=year_values;
run;
pkm_edu
Quartz | Level 8

Sorry, I  would like to add the following comments, which I missed earlier.

 

1)  It seems that the EXPANDTABS option on the INFILE statement is redundant. To avoid the following note in the SAS Log, I used the TRUNCOVER option on the same statement.

NOTE: SAS went to a new line when INPUT @'CHARACTER_STRING' scanned past the end of a line.

2) To get the desired results your earlier code  works fine.

 infile source dlm='<' truncover;
input @'<option value="' @'>' year ??;

 

filename source 'c:\Data\web_data';
proc http
     url = "https://meps.ahrq.gov/data_stats/download_data_files.jsp"
     out=source;
run;

options generic;
data year_values;
    length year 8;
    infile source dlm='<' truncover;
    input @'<option value="'   @'>' year ??;
if not missing (year); /* no invalid year values possible */
run;
Tom
Super User Tom
Super User

You only need the EXPANDTABS option to replace the tabs with the appropriate number of spaces if you are planning to use the line from the file for something.  Like in your original program where you saved it and parsed it, or like in my program where I used the LIST statement to display it on the LOG so you can make sure it worked properly. 

 

Unless of course there is a tab in the middle of the value you are trying to read in the YEAR.  SAS will not input '<tab>1999' as a number but it would input '   1999' as a number.

 

The secret GENERIC option was just so the log I posted wouldn't show the irrelevant details about where the file being read was located on my SAS server.

 

Adding the TRUNCOVER option means that the data step will have to iterate once for each line in the HTML file instead of just once for each occurrence of the OPTION tag.

 

Also your method has eliminated the possibility of detecting multiple OPTION tags on the same line of the HTML file.

pkm_edu
Quartz | Level 8

1) I agree with you  about  the use of  EXPANDTABS on the INFILE statement, which I have added to the code below. Thanks for clarification!

2) I want only one "year" value per observation from each record. That is why I did not use the @@ at the end of the INPUT statement in the code below.

3)  I have used the  TRUNCOVER option to eliminate the  note, "SAS went to a new line" message.

DM "Log; clear; output; clear; odsresults; clear";
filename source temp;
proc http
     url = "https://meps.ahrq.gov/data_stats/download_data_files.jsp"
     out=source;
run;

options generic;
data year_values;
    length year 8;
    infile source dlm='<' expandtabs truncover;
    input @'<option value="'   @'>' year ??;
if not missing (year);
run;

proc print data=year_values;
run;

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 1854 views
  • 2 likes
  • 2 in conversation