A call to PRXPOSN in the code snippet below gives me the capture buffer's
START value (where regex begins matching) of 30 and the END value (which
represents the length of matching string) value of 4 (output below).
However, when I count the number of characters from left
to right in the output (html_line string), I find the START value to be 22.
Question 1: Why do these two numbers not match?
Thanks,
filename source temp; proc http url = "https://meps.ahrq.gov/data_stats/download_data_files.jsp" out=source; run; data year_values; length year $4; infile source length = reclen lrecl = 32767 end=eof; re = prxparse('/<option value="\d{4}">(\d{4})<\/option>/'); /* Read the HTML line by line */ do while (not eof); input html_line $varying32767. reclen; Len_html_line = length(html_line); /* Match and extract the years using regular expressions */ if prxmatch(re, html_line) > 0 then do; call prxposn(re, 1, start, end); year = substr(html_line, start, end); output year_values; end; end; run; proc print data=year_values noobs; var Len_html_line html_line start end year; run;
The Output:
Len_
html_
line html_line start end year
42 <option value="2021">2021</option> 30 4 2021
42 <option value="2020">2020</option> 30 4 2020
42 <option value="2019">2019</option> 30 4 2019
42 <option value="2018">2018</option> 30 4 2018
42 <option value="2017">2017</option> 30 4 2017
42 <option value="2016">2016</option> 30 4 2016
42 <option value="2015">2015</option> 30 4 2015
42 <option value="2014">2014</option> 30 4 2014
42 <option value="2013">2013</option> 30 4 2013
42 <option value="2012">2012</option> 30 4 2012
42 <option value="2011">2011</option> 30 4 2011
42 <option value="2010">2010</option> 30 4 2010
42 <option value="2009">2009</option> 30 4 2009
42 <option value="2008">2008</option> 30 4 2008
42 <option value="2007">2007</option> 30 4 2007
42 <option value="2006">2006</option> 30 4 2006
42 <option value="2005">2005</option> 30 4 2005
42 <option value="2004">2004</option> 30 4 2004
42 <option value="2003">2003</option> 30 4 2003
42 <option value="2002">2002</option> 30 4 2002
42 <option value="2001">2001</option> 30 4 2001
42 <option value="2000">2000</option> 30 4 2000
42 <option value="1999">1999</option> 30 4 1999
42 <option value="1998">1998</option> 30 4 1998
42 <option value="1997">1997</option> 30 4 1997
42 <option value="1996">1996</option> 30 4 1996
Probably because you printed the text using the $ format
instead of the $CHAR format.
Or worse you looked at the ODS output instead of the LISTING output.
But also watch out for the TAB characters (your example code also had tab characters in it)
Try writing the values to a text file and read it in and display the lines using the LIST statement.
filename listing temp;
data _null_;
set year_values ;
file listing;
put html_line $char42. +1 start= ;
run;
data _null_;
infile listing;
input;
list;
run;
Results
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 CHAR .. <option value="2021">2021</option> start=30 51 ZONE 222200223677666276676323333233333326776663277677333 NUMR 00009900CF049FE061C55D220212E2021CFF049FEE034124D30 2 CHAR .. <option value="2020">2020</option> start=30 51 ZONE 222200223677666276676323333233333326776663277677333 NUMR 00009900CF049FE061C55D220202E2020CFF049FEE034124D30 3 CHAR .. <option value="2019">2019</option> start=30 51 ZONE 222200223677666276676323333233333326776663277677333 NUMR 00009900CF049FE061C55D220192E2019CFF049FEE034124D30 4 CHAR .. <option value="2018">2018</option> start=30 51 ZONE 222200223677666276676323333233333326776663277677333 NUMR 00009900CF049FE061C55D220182E2018CFF049FEE034124D30 5 CHAR .. <option value="2017">2017</option> start=30 51 ZONE 222200223677666276676323333233333326776663277677333 NUMR 00009900CF049FE061C55D220172E2017CFF049FEE034124D30 6 CHAR .. <option value="2016">2016</option> start=30 51 ZONE 222200223677666276676323333233333326776663277677333 NUMR 00009900CF049FE061C55D220162E2016CFF049FEE034124D30 ...
Probably because you printed the text using the $ format
instead of the $CHAR format.
Or worse you looked at the ODS output instead of the LISTING output.
But also watch out for the TAB characters (your example code also had tab characters in it)
Try writing the values to a text file and read it in and display the lines using the LIST statement.
filename listing temp;
data _null_;
set year_values ;
file listing;
put html_line $char42. +1 start= ;
run;
data _null_;
infile listing;
input;
list;
run;
Results
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 CHAR .. <option value="2021">2021</option> start=30 51 ZONE 222200223677666276676323333233333326776663277677333 NUMR 00009900CF049FE061C55D220212E2021CFF049FEE034124D30 2 CHAR .. <option value="2020">2020</option> start=30 51 ZONE 222200223677666276676323333233333326776663277677333 NUMR 00009900CF049FE061C55D220202E2020CFF049FEE034124D30 3 CHAR .. <option value="2019">2019</option> start=30 51 ZONE 222200223677666276676323333233333326776663277677333 NUMR 00009900CF049FE061C55D220192E2019CFF049FEE034124D30 4 CHAR .. <option value="2018">2018</option> start=30 51 ZONE 222200223677666276676323333233333326776663277677333 NUMR 00009900CF049FE061C55D220182E2018CFF049FEE034124D30 5 CHAR .. <option value="2017">2017</option> start=30 51 ZONE 222200223677666276676323333233333326776663277677333 NUMR 00009900CF049FE061C55D220172E2017CFF049FEE034124D30 6 CHAR .. <option value="2016">2016</option> start=30 51 ZONE 222200223677666276676323333233333326776663277677333 NUMR 00009900CF049FE061C55D220162E2016CFF049FEE034124D30 ...
You could just let SAS do most of the work instead. Just use @ string syntax of INPUT to find where the next field is located.
data year_values;
length year 8;
infile source end=eof dlm='<';
input @'<option value="' @'>' year ?? @@;
html_line=_infile_;
if year in (1800:2500) then output;
run;
Obs year html_line 1 2021 <option value="2021">2021</option> 2 2020 <option value="2020">2020</option> 3 2019 <option value="2019">2019</option> 4 2018 <option value="2018">2018</option> 5 2017 <option value="2017">2017</option> 6 2016 <option value="2016">2016</option> 7 2015 <option value="2015">2015</option> 8 2014 <option value="2014">2014</option> 9 2013 <option value="2013">2013</option> 10 2012 <option value="2012">2012</option> 11 2011 <option value="2011">2011</option> 12 2010 <option value="2010">2010</option> 13 2009 <option value="2009">2009</option> 14 2008 <option value="2008">2008</option> 15 2007 <option value="2007">2007</option> 16 2006 <option value="2006">2006</option> 17 2005 <option value="2005">2005</option> 18 2004 <option value="2004">2004</option> 19 2003 <option value="2003">2003</option> 20 2002 <option value="2002">2002</option> 21 2001 <option value="2001">2001</option> 22 2000 <option value="2000">2000</option> 23 1999 <option value="1999">1999</option> 24 1998 <option value="1998">1998</option> 25 1997 <option value="1997">1997</option> 26 1996 <option value="1996">1996</option>
Another thing to consider is the EXPANDTABS option of the INFILE statement.
1035 options generic; 1036 data year_values; 1037 length year 8; 1038 infile source dlm='<' expandtabs; 1039 input @'<option value="' @'>' year ?? @@; 1040 if year in (1900:2100); 1041 list; 1042 run; NOTE: The infile SOURCE is: (system-specific pathname), (system-specific file attributes) RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 741 <option value="2021">2021</option> 52 744 <option value="2020">2020</option> 52 747 <option value="2019">2019</option> 52 750 <option value="2018">2018</option> 52 753 <option value="2017">2017</option> 52 756 <option value="2016">2016</option> 52 759 <option value="2015">2015</option> 52 762 <option value="2014">2014</option> 52 765 <option value="2013">2013</option> 52 768 <option value="2012">2012</option> 52 771 <option value="2011">2011</option> 52 774 <option value="2010">2010</option> 52 777 <option value="2009">2009</option> 52 780 <option value="2008">2008</option> 52 783 <option value="2007">2007</option> 52 786 <option value="2006">2006</option> 52 789 <option value="2005">2005</option> 52 792 <option value="2004">2004</option> 52 795 <option value="2003">2003</option> 52 798 <option value="2002">2002</option> 52 801 <option value="2001">2001</option> 52 804 <option value="2000">2000</option> 52 807 <option value="1999">1999</option> 52 810 <option value="1998">1998</option> 52 813 <option value="1997">1997</option> 52 816 <option value="1996">1996</option> 52 NOTE: 3244 records were read from the infile (system-specific pathname). The minimum record length was 0. The maximum record length was 452. NOTE: SAS went to a new line when INPUT @'CHARACTER_STRING' scanned past the end of a line. NOTE: The data set WORK.YEAR_VALUES has 26 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds
It is not the INPUT statement. It is the fact that you did not attach the $CHAR format to the variable. That will cause SAS to use the normal $ format, which left aligns the values. The only reason it looked like there were some spaces is because of the TAB characters.
As to the INPUT statement there are two issues:
The input statement will only read the first 32,767 bytes of the line.
Since it does not have a trailing @ the logic will only work for HTML files that has each <option> value on a separate line. In general that might not be true for an HTML file.
Hi Tom,
Thank you so much for sending me two solutions to reading highly unstructured data into SAS.
Questions: Why does SAS issue the following note? Is this something that we need to be concerned about?
NOTE: SAS went to a new line when INPUT @'CHARACTER_STRING' scanned past the end of a line
Sorry, I got three more clarification questions.
1) In DATA step below, I have changed the conditional IF statement to 'if not missing (year);' in order to filter out the observations with missing values for the numeric variable year. Does this code change make sense to you?
2) The @'<option value="' will cause SAS to move its pointer past the '<option value="' field. Then the @'>' will make SAS move its pointer past the '>' field
to read the numeric value, for example, 2021 (not "2021"). Is this a correct code explanation?
3) The ?? modifier suppresses the following notes (examples from the partial SAS Log),
NOTE: Invalid data for year in line 738 39-57.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
738 <option value="All">All available years</option> 66
year=. _ERROR_=1 _N_=1
Is this a correct code explanation?
Thanks,
options generic; data year_values3; length year 8; infile source dlm='<' expandtabs; input @@'<option value="' @'>' year ?? @@; if not missing (year); list; run
1) Just testing for missing year will allow option values of 5 or 10 or 1.234E12 to be read as a year. By using the IN operator with a range of integers I insure that the value is reasonable. Not your regular expression would filter to just 4 digit numbers (or really the last 4 digits) so it will also find invalid year values, like 1000.
2) Yes. And that might not be the right way to do it. It depends on what information you want from the HTML code. The string <option value="ABC">XYZ</option> will show the XYZ in the browser but return the ABC as the result when selected. In your example form both have the same string. So the code I showed is taking the XYZ value. You might want to re-work it to take the ABC value. Can you tell what needs to change to do that?
3) Yes the ?? suppresses the error message and the setting of the _ERROR_ flag variable that cause those notes when value being read, in this case the next "word" on the line, is not valid for the informat being used.
Here is the revised code I tried. It has worked for me.
Question 1: I did not use the @@ in the INPUT statement because the data file does not have multiple YEAR records on the same line. Does not using the @@ make sense to you?
Question 2: Sorry I could not find the @Syntax of INPUT in the SAS(R) Documentation. Could you please send me a reference, if possible?
Question 3: What does the generic option on the OPTIONS statement do? Any reference?
Thanks,
options generic; data year_values; length year 8; infile source dlm='"' expandtabs; input @'<option value="' year ??; if year in (1900:2023); run; proc print data=year_values; run;
Sorry, I would like to add the following comments, which I missed earlier.
1) It seems that the EXPANDTABS option on the INFILE statement is redundant. To avoid the following note in the SAS Log, I used the TRUNCOVER option on the same statement.
NOTE: SAS went to a new line when INPUT @'CHARACTER_STRING' scanned past the end of a line.
2) To get the desired results your earlier code works fine.
infile source dlm='<' truncover;
input @'<option value="' @'>' year ??;
filename source 'c:\Data\web_data'; proc http url = "https://meps.ahrq.gov/data_stats/download_data_files.jsp" out=source; run; options generic; data year_values; length year 8; infile source dlm='<' truncover; input @'<option value="' @'>' year ??; if not missing (year); /* no invalid year values possible */ run;
You only need the EXPANDTABS option to replace the tabs with the appropriate number of spaces if you are planning to use the line from the file for something. Like in your original program where you saved it and parsed it, or like in my program where I used the LIST statement to display it on the LOG so you can make sure it worked properly.
Unless of course there is a tab in the middle of the value you are trying to read in the YEAR. SAS will not input '<tab>1999' as a number but it would input ' 1999' as a number.
The secret GENERIC option was just so the log I posted wouldn't show the irrelevant details about where the file being read was located on my SAS server.
Adding the TRUNCOVER option means that the data step will have to iterate once for each line in the HTML file instead of just once for each occurrence of the OPTION tag.
Also your method has eliminated the possibility of detecting multiple OPTION tags on the same line of the HTML file.
1) I agree with you about the use of EXPANDTABS on the INFILE statement, which I have added to the code below. Thanks for clarification!
2) I want only one "year" value per observation from each record. That is why I did not use the @@ at the end of the INPUT statement in the code below.
3) I have used the TRUNCOVER option to eliminate the note, "SAS went to a new line" message.
DM "Log; clear; output; clear; odsresults; clear"; filename source temp; proc http url = "https://meps.ahrq.gov/data_stats/download_data_files.jsp" out=source; run; options generic; data year_values; length year 8; infile source dlm='<' expandtabs truncover; input @'<option value="' @'>' year ??; if not missing (year); run; proc print data=year_values; run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.