BookmarkSubscribeRSS Feed
Ksharp
Super User
Define a LENGTH statement might save you some time.

data have2(drop=count i);
set have;
LENGTH WORDS $ 40;
do i=1 to countw(comments," ");
words = scan(comments,i ," ");
output;
end;
run;


ScottBass
Rhodochrosite | Level 12

@kumarK wrote:

i have successfully able to import the file. now i want to strip the words in text. below code works fine for small datset.

 

however i have large datset it's taking too much time.. anyone have any suggestions to improve the code. Thanks

 

data have1;
set have(keep=comments);
count =countw(comments," ");
run;

 

data have2(drop=count i);
set have1;
do i=1 to count;
words = scan(comments,i ," ");
output;
end;
run;

 


Well, sure, there's plenty of ways to make this run faster:

 

  • Get a faster server
  • Use MP_Connect to process your data in multiple SAS processes
  • If you're using EG, use EG's parallel processing to process your data in multiple SAS processes
  • Use PROC DS2's threading capability to process your data in multiple threads
  • As @Ksharp said, don't do a double pass over your data.  Your first step reads comments into the PDV - do your processing on it then.
  • You might get a slight performance gain by using the SPDE engine for your source and target datasets.

However, it would be a huge help to know WHY you're doing what your doing?  Clearly, you're transforming your data, writing each word of the comments field as a separate record in your final output table, where "word" is a token delimited by a space.

 

What do you then want to do with the data?  If we knew that, perhaps there would be other approaches that wouldn't involve parsing your comments field into one record per word.  For example, if you need to get a count of each word, you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller.

 

If your source data has 1M records, and your comments field averages 100 words, you've just converted your 1M records into 100M. 

 

For small datasets, inefficient approaches are ok - quick and dirty rules the day.  But, if your data is "large", and you're experiencing performance issues, you may need to apply more creative approaches.

 

So, forget the technicalities...what do you want to DO?


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
kumarK
Quartz | Level 8

"What do you then want to do with the data?  If we knew that, perhaps there would be other approaches that wouldn't involve parsing your comments field into one record per word.  For example, if you need to get a count of each word, you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller.

 

If your source data has 1M records, and your comments field averages 100 words, you've just converted your 1M records into 100M." 

 

 

Exactly you got my problem. the program i written unnessarly creating 100M dataset like the above example.

 

My objective is to  to get a count of each word in comments variable. 

 

"you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller."  -- can you please help me with code how we can get this using hash object.

Tom
Super User Tom
Super User

Try using a data step view to split the words. Then at least you do not need to save the table.

 

data words / view=words ;
  set have (keep=comment);
  do i=1 to countw(comment,' ');
     length word $100 ;
     word = scan(comment,i,' ');
     output;
  end;
  keep word ;
run;
proc summary nway data=words;
   class word ;
   output out=word_freqs ;
run;
ScottBass
Rhodochrosite | Level 12

@kumarK wrote:

"you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller."  -- can you please help me with code how we can get this using hash object.


 

I don't have time right now to write example code, but the logic would go like this:

 

You would have a data _null_ step

Your set statement would have "end=eof"

Create a hash object when _n_=1

The hash object key is "word", the hash object var is "count"

Parse your words as you are doing now

I honestly can't remember if the keys in a hash object are case sensitive, and I'm not in front of SAS right now to test.  If so, convert the case of your words.

For each word (i.e. within your parsing loop):

Explicitly set count to missing

Find the existing word in your hash object

If the find is successful, count will be set to the current count from the hash object

If the find is unsuccessful, count will remain missing

Set count=sum(count,1).  This will set count to either 1, or count+1

Update the hash object.  This will either update the existing word, or add the new word

When end=eof, write the hash object to an output dataset

 

 

Read the SAS doc on the hash object: http://support.sas.com/documentation/cdl/en/lecompobjref/67221/HTML/default/viewer.htm#p0zjo5imt7rb9...

 

See the Add, Declare, Find, Output, Replace, and Sum methods.  See the examples.  If you don't understand the examples, run them until you do.  They can be confusing when you're first learning hash objects, but once you understand them, they are a powerful tool in your SAS kit bag.

 

I know the above approach would work.  However, the Sum method may be a an alternative approach, which may also work for you.

 

HTH...

 

Edit:  See http://support.sas.com/documentation/cdl/en/lecompobjref/67221/HTML/default/viewer.htm#p00ilfw5pzcjv... (Example 5).  This should solve your your issue.

 

Edit #2:  But you should also try Tom's approach and compare the performance.  If his approach performs well for you, it would be a simpler approach requiring less code.  Regardless of your final approach, use this as an opportunity to learn more about hash objects.


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 19 replies
  • 3060 views
  • 9 likes
  • 5 in conversation