BookmarkSubscribeRSS Feed
mathias
Quartz | Level 8

 

Hello,

I'm getting this strange error when trying to scrape a website. I don't have any results when searching for this error on the web.

It was working fine few weeks ago, my setup hasn't changed so that rules that out.

 

I guess they changed something in their page, but I can't pinpoint what exactly.

 

Since I don't have the hand on the source website, I'm searching an alternative solution to make it work.

 

I'm just collecting the list of XLS filenames in the list

 

%let fileListURL = https://www.inami.fgov.be/fr/professionnels/etablissements-services/laboratoires/Pages/historique-labos-agrees-prestations.aspx;

filename listURL url "&fileListURL";
data INAMI_FILES;
infile listURL length=len lrecl=32767;
input line $varying32767. len;
* filter all the lines with .xls;
if find(line,".xls") then do;
* this is a big line so we will split it into several lines with '<';
count = countw(line,'<');
do i=1 to count;
tag = scan(line, i, '<');
* now filter again only those containing ".xls";
if find(tag,".xls") then do;
* extract the url;
file_url = scan(tag,2,'"');
* extract the id;
file_id = scan( scan( file_url, -1, '_'), 1, '.' );
* extract the extension;
file_ext = cats('.xl', scan( tranwrd( file_url, '.xl', '*' ), 2, '*' ) );
output;
end;
end;
end;
drop line count i tag ;
run;
filename listURL clear;

ERROR: SSL error is "The message received was unexpected or badly formatted. (0x80090326).".
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.INAMI_FILES may be incomplete. When this step was stopped there were 0
observations and 3 variables.
WARNING: Data set WORK.INAMI_FILES was not replaced because this step was stopped.

 

 

Any ideas?

3 REPLIES 3
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Interestingly I get:

ERROR: Connection refused.

 

Probably best to take it up with the website in question, they may have switched off this type of access (possible DDoS access for example).  Could be that it is behind a login wall, or some other mechanism.  Any reason why you would need code to do this?  

 

mathias
Quartz | Level 8

hmm that's strange, this website should be public.

 

> Any reason why you would need code to do this? 

 

Because the list of files updates regularly and I don't want to watch it and download and import manually.

Just automating boring repetitive work.

 

 IDEA:

 

Let's say there is an error in the page syntax or something badly formatted.

 

Isn't there a way to still get all the text, regardless of that?

Maybe something like pre-cropping the headers and footers of the page that I'm either way never going to read?

 

 

 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

"hmm that's strange, this website should be public." - it is, I can goto the web address in browser and click on and download a file, I can't programmatically access the site however, could be my virtual setup, could be the site.

 

"Because the list of files updates regularly and I don't want to watch it and download and import manually.

Just automating boring repetitive work." - get access to the source data rather than some bunch of Excel files?

 

I am afraid I do not know is the answer to your other question.

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 2138 views
  • 0 likes
  • 2 in conversation