BookmarkSubscribeRSS Feed
Ksharp
Super User
Define a LENGTH statement might save you some time.

data have2(drop=count i);
set have;
LENGTH WORDS $ 40;
do i=1 to countw(comments," ");
words = scan(comments,i ," ");
output;
end;
run;


ScottBass
Rhodochrosite | Level 12

@kumarK wrote:

i have successfully able to import the file. now i want to strip the words in text. below code works fine for small datset.

 

however i have large datset it's taking too much time.. anyone have any suggestions to improve the code. Thanks

 

data have1;
set have(keep=comments);
count =countw(comments," ");
run;

 

data have2(drop=count i);
set have1;
do i=1 to count;
words = scan(comments,i ," ");
output;
end;
run;

 


Well, sure, there's plenty of ways to make this run faster:

 

  • Get a faster server
  • Use MP_Connect to process your data in multiple SAS processes
  • If you're using EG, use EG's parallel processing to process your data in multiple SAS processes
  • Use PROC DS2's threading capability to process your data in multiple threads
  • As @Ksharp said, don't do a double pass over your data.  Your first step reads comments into the PDV - do your processing on it then.
  • You might get a slight performance gain by using the SPDE engine for your source and target datasets.

However, it would be a huge help to know WHY you're doing what your doing?  Clearly, you're transforming your data, writing each word of the comments field as a separate record in your final output table, where "word" is a token delimited by a space.

 

What do you then want to do with the data?  If we knew that, perhaps there would be other approaches that wouldn't involve parsing your comments field into one record per word.  For example, if you need to get a count of each word, you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller.

 

If your source data has 1M records, and your comments field averages 100 words, you've just converted your 1M records into 100M. 

 

For small datasets, inefficient approaches are ok - quick and dirty rules the day.  But, if your data is "large", and you're experiencing performance issues, you may need to apply more creative approaches.

 

So, forget the technicalities...what do you want to DO?


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
kumarK
Quartz | Level 8

"What do you then want to do with the data?  If we knew that, perhaps there would be other approaches that wouldn't involve parsing your comments field into one record per word.  For example, if you need to get a count of each word, you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller.

 

If your source data has 1M records, and your comments field averages 100 words, you've just converted your 1M records into 100M." 

 

 

Exactly you got my problem. the program i written unnessarly creating 100M dataset like the above example.

 

My objective is to  to get a count of each word in comments variable. 

 

"you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller."  -- can you please help me with code how we can get this using hash object.

Tom
Super User Tom
Super User

Try using a data step view to split the words. Then at least you do not need to save the table.

 

data words / view=words ;
  set have (keep=comment);
  do i=1 to countw(comment,' ');
     length word $100 ;
     word = scan(comment,i,' ');
     output;
  end;
  keep word ;
run;
proc summary nway data=words;
   class word ;
   output out=word_freqs ;
run;
ScottBass
Rhodochrosite | Level 12

@kumarK wrote:

"you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller."  -- can you please help me with code how we can get this using hash object.


 

I don't have time right now to write example code, but the logic would go like this:

 

You would have a data _null_ step

Your set statement would have "end=eof"

Create a hash object when _n_=1

The hash object key is "word", the hash object var is "count"

Parse your words as you are doing now

I honestly can't remember if the keys in a hash object are case sensitive, and I'm not in front of SAS right now to test.  If so, convert the case of your words.

For each word (i.e. within your parsing loop):

Explicitly set count to missing

Find the existing word in your hash object

If the find is successful, count will be set to the current count from the hash object

If the find is unsuccessful, count will remain missing

Set count=sum(count,1).  This will set count to either 1, or count+1

Update the hash object.  This will either update the existing word, or add the new word

When end=eof, write the hash object to an output dataset

 

 

Read the SAS doc on the hash object: http://support.sas.com/documentation/cdl/en/lecompobjref/67221/HTML/default/viewer.htm#p0zjo5imt7rb9...

 

See the Add, Declare, Find, Output, Replace, and Sum methods.  See the examples.  If you don't understand the examples, run them until you do.  They can be confusing when you're first learning hash objects, but once you understand them, they are a powerful tool in your SAS kit bag.

 

I know the above approach would work.  However, the Sum method may be a an alternative approach, which may also work for you.

 

HTH...

 

Edit:  See http://support.sas.com/documentation/cdl/en/lecompobjref/67221/HTML/default/viewer.htm#p00ilfw5pzcjv... (Example 5).  This should solve your your issue.

 

Edit #2:  But you should also try Tom's approach and compare the performance.  If his approach performs well for you, it would be a simpler approach requiring less code.  Regardless of your final approach, use this as an opportunity to learn more about hash objects.


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 2387 views
  • 9 likes
  • 5 in conversation