Define a LENGTH statement might save you some time. data have2(drop=count i); set have; LENGTH WORDS $ 40; do i=1 to countw(comments," "); words = scan(comments,i ," "); output; end; run;
@kumarK wrote:i have successfully able to import the file. now i want to strip the words in text. below code works fine for small datset.
however i have large datset it's taking too much time.. anyone have any suggestions to improve the code. Thanks
data have1;
set have(keep=comments);
count =countw(comments," ");
run;
data have2(drop=count i);
set have1;
do i=1 to count;
words = scan(comments,i ," ");
output;
end;
run;
Well, sure, there's plenty of ways to make this run faster:
However, it would be a huge help to know WHY you're doing what your doing? Clearly, you're transforming your data, writing each word of the comments field as a separate record in your final output table, where "word" is a token delimited by a space.
What do you then want to do with the data? If we knew that, perhaps there would be other approaches that wouldn't involve parsing your comments field into one record per word. For example, if you need to get a count of each word, you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller.
If your source data has 1M records, and your comments field averages 100 words, you've just converted your 1M records into 100M.
For small datasets, inefficient approaches are ok - quick and dirty rules the day. But, if your data is "large", and you're experiencing performance issues, you may need to apply more creative approaches.
So, forget the technicalities...what do you want to DO?
"What do you then want to do with the data? If we knew that, perhaps there would be other approaches that wouldn't involve parsing your comments field into one record per word. For example, if you need to get a count of each word, you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller.
If your source data has 1M records, and your comments field averages 100 words, you've just converted your 1M records into 100M."
Exactly you got my problem. the program i written unnessarly creating 100M dataset like the above example.
My objective is to to get a count of each word in comments variable.
"you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller." -- can you please help me with code how we can get this using hash object.
Try using a data step view to split the words. Then at least you do not need to save the table.
data words / view=words ;
set have (keep=comment);
do i=1 to countw(comment,' ');
length word $100 ;
word = scan(comment,i,' ');
output;
end;
keep word ;
run;
proc summary nway data=words;
class word ;
output out=word_freqs ;
run;
@kumarK wrote:"you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller." -- can you please help me with code how we can get this using hash object.
I don't have time right now to write example code, but the logic would go like this:
You would have a data _null_ step
Your set statement would have "end=eof"
Create a hash object when _n_=1
The hash object key is "word", the hash object var is "count"
Parse your words as you are doing now
I honestly can't remember if the keys in a hash object are case sensitive, and I'm not in front of SAS right now to test. If so, convert the case of your words.
For each word (i.e. within your parsing loop):
Explicitly set count to missing
Find the existing word in your hash object
If the find is successful, count will be set to the current count from the hash object
If the find is unsuccessful, count will remain missing
Set count=sum(count,1). This will set count to either 1, or count+1
Update the hash object. This will either update the existing word, or add the new word
When end=eof, write the hash object to an output dataset
Read the SAS doc on the hash object: http://support.sas.com/documentation/cdl/en/lecompobjref/67221/HTML/default/viewer.htm#p0zjo5imt7rb9...
See the Add, Declare, Find, Output, Replace, and Sum methods. See the examples. If you don't understand the examples, run them until you do. They can be confusing when you're first learning hash objects, but once you understand them, they are a powerful tool in your SAS kit bag.
I know the above approach would work. However, the Sum method may be a an alternative approach, which may also work for you.
HTH...
Edit: See http://support.sas.com/documentation/cdl/en/lecompobjref/67221/HTML/default/viewer.htm#p00ilfw5pzcjv... (Example 5). This should solve your your issue.
Edit #2: But you should also try Tom's approach and compare the performance. If his approach performs well for you, it would be a simpler approach requiring less code. Regardless of your final approach, use this as an opportunity to learn more about hash objects.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.