Re: Loop from URLs in SAS dataset

jjb123 · Posted 12-01-2017 06:26 PM

Hi everyone,

I'll try to be as thorough as I can, and I apologize if I omit any necessary information. I'm basically trying to do a word count (for a couple of different phrases) within quite a few web sites. The sites are "txt" sites so I can bring them into SAS and do the counts fairly easily, like so:

filename file1 url 'http:......txt';

data trythis; infile file1; informat all $3000.; input@1 all $3500.; run;

data try2; set trythis; wordcounta = count(all, "a b c"); wordcountx = count(all, "x y z"); rownum = _n_; actualfile=http:......txt; run;

data try2; set try2; if wordcounta = 0 and wordcountx = 0 then delete; run;

proc sql; create table try3 as select *, sum(wordcounta) as acount, sum(wordcount) as bcount from try2; quit;

data perm.bigfile; set try3 perm.bigfile; by actualfile; if last.actualfile then output; run;

I've omitted the actual sites and phrases and replaced them with pseudo names, but this code works fine for what I'm doing. Although I'm definitely open to suggestions for improvements to this setup, that's not the main help I seek.

Each of the sites I am using (http:....txt) is coming from a column that is in a SAS dataset. I am seeking to write a loop to automatically replace the url with the value from that dataset and then loop and do the next url the next time, and so on. I've experimented with several macros and I've been unable to get anything workable. Although I'd like to dothis automatically in the original dataset (with the urls), I currently have the code set up where it will just add to a running dataset with all of the results. Please let me know if this is doable, or if you need any more information. I'm using SAS 9.4. Thanks!

Reeza · Posted 12-01-2017 06:30 PM

Look up a paper called Don't be loopy by David Cassell. It will help show you how to structure your program the way you'd like and what's the most efficient.

In general, a macro approach should work. Please feel free to post what you've tried and explain what doesn't work. Starting out from working code is a good place to be.

You should be able to create your macro and then use CALL EXECUTE to call it multiple times. The documentation for CALL EXECUTE has a good example of how that can work.

@jjb123 wrote:

Hi everyone,

I'll try to be as thorough as I can, and I apologize if I omit any necessary information. I'm basically trying to do a word count (for a couple of different phrases) within quite a few web sites. The sites are "txt" sites so I can bring them into SAS and do the counts fairly easily, like so:

filename file1 url 'http:......txt';

data trythis; infile file1; informat all $3000.; input@1 all $3500.; run;

data try2; set trythis; wordcounta = count(all, "a b c"); wordcountx = count(all, "x y z"); rownum = _n_; actualfile=http:......txt; run;

data try2; set try2; if wordcounta = 0 and wordcountx = 0 then delete; run;

proc sql; create table try3 as select *, sum(wordcounta) as acount, sum(wordcount) as bcount from try2; quit;

data perm.bigfile; set try3 perm.bigfile; by actualfile; if last.actualfile then output; run;

I've omitted the actual sites and phrases and replaced them with pseudo names, but this code works fine for what I'm doing. Although I'm definitely open to suggestions for improvements to this setup, that's not the main help I seek.

Each of the sites I am using (http:....txt) is coming from a column that is in a SAS dataset. I am seeking to write a loop to automatically replace the url with the value from that dataset and then loop and do the next url the next time, and so on. I've experimented with several macros and I've been unable to get anything workable. Although I'd like to dothis automatically in the original dataset (with the urls), I currently have the code set up where it will just add to a running dataset with all of the results. Please let me know if this is doable, or if you need any more information. I'm using SAS 9.4. Thanks!

ballardw · Posted 12-01-2017 07:09 PM

Since you have the URL in a data set then a data _null_ step that uses call execute is one way to solve this.

Call execute creates SAS code lines that in effect stack up and execute after the data _null_ executes. A line like

Call Execute(' any text'); creates a section of code. So a single call execute statement may create multiple lines, an entire procedure call or just one word (likely a value from a variable in the data set) at that position in the code stack.

There was no reason at all to use 3 data steps in reading and manipulating the data. I am assuming that your were attempting one step at time.

We can reuse the data set name while switching which URL is read as you can then summarize into a new set. From the code you showed it looked like your final goal only involved one line from each URL with the word counts so I modified your Proc SQL (I probably would have used Proc Means/Summary as if you have many wordcount variables the syntax gets shorter:

Proc summary data=temp nway;

class actualfile;

var wordcount: ; /* the : here says to use all variables whose names start with wordcount*/

output out= dn (drop= _:) sum= ; /* instead of creating a bunch of different named variables put the count into a variable named as you created in the data set on reading*/

run;

I would suggest testing the example code below with a data set containing maybe 2 urls as that will be much easier to see an error. Likely if you have an error it will be repeated for each line in the input set.

Check the log!

Note to your future self: Habitually using

data thisname;

set thisname;

run;

will eventually lead to pulling of hair trying to figure out "where did that value or variable come from" or "where did that value or variable go" or even "why is my data set empty".

data _null_;
   set <yourdatesetwithurls>;
   length dn $ 8 longstr $ 150;
   dn = cats('try',_n_);
   /* assign filename*/
   longstr = catx(' ','filename file1',cats("'",url,"';"));
   call  execute (longstr);
   /* read the data
    set the length for ACTUALfile to that of the longest URL 
    you have to prevent some issues later with possible differing
    lengths of that variable
   */
   call execute ('data temp;');
   call execute ('infile file1; informat all $3000. ; 
      input@1 all $3500.; 
      wordcounta = count(all, "a b c"); 
      wordcountx = count(all, "x y z"); 
      rownum = _n_;
      length actualfile $ 200;
   ');
   call execute ('actualfile="' ||url||'";'); 
   call execute ('if wordcounta = 0 and wordcountx = 0 then delete; run;');
   /* clear the filename just in case*/
   call execute ('filename file1;');
   /* summarize the temp data into a numbered dataset*/
   call execute ('Proc sql;');
   call execute ('create table '||dn||'as');
   call execute ('select actualfile,sum(wordcounta) as acount, sum(wordcount) as bcount ');
   call execute ('from temp
                  group by actualfile
                  ;
                  quit;');
run;
/* Following your pattern it creates a bunch of datasets named try1 that
will have one record with the Actualfile, acount and bcount variables.

one way to stack them up is:
*/
 
data perm.bigfile;
   set try: ;
run;

The SAS variable list operator : also works on dataset names.

jjb123 · Posted 12-02-2017 12:52 AM

Thanks for the quick response. So I'm trying your code, and I have the following now:

data _null_;
set trying;
length dn $ 8 longstr $ 150;
dn = cats('try',_n_);
longstr = catx(' ','filename file1',cats("'",url,"';"));
call execute (longstr);
call execute ('data temp;');
call execute ('infile file1; informat all $3000. ;
input@1 all $3500.;
wordcounta = count(all, "a b c");
wordcountx = count(all, "x y z");
rownum = _n_;
length actualfile $ 100;
');
call execute ('actualfile="' ||url||'";');
call execute ('if wordcounta = 0 and wordcountx = 0 then delete; run;');
call execute ('filename file1;');
call execute ('Proc sql;');
call execute ('create table '||dn||'as');
call execute ('select actualfile,sum(wordcounta) as acount, sum(wordcountx) as bcount ');
call execute ('from temp
group by actualfile
;
quit;');
run;

The "trying" dataset I am using has 2 observations. I get the following results:

NOTE: There were 2 observations read from the data set WORK.TRYING.
NOTE: DATA statement used (Total process time):
real time 0.05 seconds
cpu time 0.06 seconds

NOTE: CALL EXECUTE generated line.
1 + filename file1 'urlhere.txt';
2 + data temp;
3 + infile file1; informat all $3000. ; input@1 all $3500.; wordcounta = count(all,
"a b c"); wordcountx = count(all, "x y z"); rownum = _n_; length
actualfile $ 100;
4 + actualfile="urlhere.txt";
5 + if wordcounta = 0 and wordcountx = 0 then delete; run;

ERROR: A component of
C:\Users\myname\urlhere.txt /*for some reason, it's adding this to the beginning of the url
is not a directory.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.TEMP may be incomplete. When this step was stopped there were 0
observations and 5 variables.
WARNING: Data set WORK.TEMP was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds

6 + filename file1;
NOTE: Fileref FILE1 has been deassigned.
7 + Proc sql;
8 + create table try1 as
9 + select actualfile,sum(wordcounta) as acount, sum(wordcountx) as bcount
10 + from temp group by actualfile ;
NOTE: Table WORK.TRY1 created, with 0 rows and 3 columns.

10 + quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds

11 + filename file1 'url2here.txt';
12 + data temp;
13 + infile file1; informat all $3000. ; input@1 all $3500.; wordcounta = count(all,
"a b c"); wordcountx = count(all, "x y z"); rownum = _n_; length
actualfile $ 100;
14 + actualfile="url2here.txt";
15 + if wordcounta = 0 and wordcountx = 0 then delete; run;

ERROR: A component of
C:\Users\myname\url2here.txt
is not a directory.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.TEMP may be incomplete. When this step was stopped there were 0
observations and 5 variables.
WARNING: Data set WORK.TEMP was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds

16 + filename file1;
NOTE: Fileref FILE1 has been deassigned.
17 + Proc sql;
18 + create table try2 as
19 + select actualfile,sum(wordcounta) as acount, sum(wordcountx) as bcount
20 + from temp group by actualfile ;
NOTE: Table WORK.TRY2 created, with 0 rows and 3 columns.

20 + quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.01 seconds

It looks like I'm really close. Thanks again for your help, and please let me know if you can help me troubleshoot to get everything working completely.

jjb123 · Posted 12-04-2017 01:35 AM

Hi again ballardw and Reeza,

I have the code working how I like it now, but I'm having trouble with efficiency at this point. I'm adding my code below. The main hangup appears to be around the last two call execute lines. I assume the way I'm joining the datasets every time is problematic. However, I want this code to work for several thousand sites so efficiency is very important. Please let me know if you can help with that. If I need to start a new thread, I'd be open to that as well.

data _null_;

set doingit;

length dn $ 8 longstr $ 150;

dn = cats('try',_n_);

dn2 = cats('try',_n_ - 1);

longstr = catx(' ','filename file1 url', cats("'",fullfilename,"';"));

call execute (longstr);

call execute ('data temp;');

call execute ('infile file1; informat all $3000. ;

input@1 all $3500.;

wordcounta = count(all, "a b c");

wordcountx = count(all, "x y z");

rownum = _n_;

length actualfile $ 100;

');

call execute ('actualfile="' ||fullfilename||'";');

call execute ('if wordcounta = 0 and wordcountx = 0 then delete; run;');

call execute ('filename file1;');

call execute ('Proc sql;');

call execute ('create table '||dn||' as');

call execute ('select actualfile,sum(wordcounta) as acount, sum(wordcountx) as xcount ');

call execute ('from temp

group by actualfile

;

quit;');

call execute ('data perm.tryingloop ; set '||dn||' perm.tryingloop ; run;');

call execute ('proc delete data= '||dn||' ; run;');

run;

Please let me know of any improvements that you recommend.

ballardw · Posted 12-04-2017 10:55 AM

1: Post code on this site in code boxes. The main message windows reformat text, sometimes making things unreadable.

Any process involving 1,000s of data sets will take some time.

Another approach is to write thousands of proc append statements. This could be added as the last bit

replacing

call execute ('data perm.tryingloop ; set '||dn||' perm.tryingloop ; run;');

with

call execute ('proc append base=perm.tryingloop data=' ||dn||';run;');

But I doubt there will be much improvement in time if any as the sheer amount to IO start and stop is likely to more time intensive that the actual combination of the data.

Another option might be to FTP or download the entire contents of the source URLS to a common local folder and use a data step to read all the txt files at one time.

Filename myfile "c:\path\*.txt";

for example on a Windows system will read ALL of the files with TXT as the extension in a given folder when used in an Infile statement.

There are options to keep the source filename (Filename= ) on the infile if you need to know the source. That could be used for by group processing. Note that the Filename= variable is temporary so would need to assign the value to a data step variable with sufficient length to hold the path and file name for use as a by variable.

Loop from URLs in SAS dataset