Re: Import not cleaned text string column into a dataset

Ksharp · Posted 09-15-2016 11:28 PM

Define a LENGTH statement might save you some time.

data have2(drop=count i);
set have;
LENGTH WORDS $ 40;
do i=1 to countw(comments," ");
words = scan(comments,i ," ");
output;
end;
run;

ScottBass · Posted 09-16-2016 12:49 AM

@kumarK wrote:
i have successfully able to import the file. now i want to strip the words in text. below code works fine for small datset.

however i have large datset it's taking too much time.. anyone have any suggestions to improve the code. Thanks

data have1;
set have(keep=comments);
count =countw(comments," ");
run;

data have2(drop=count i);
set have1;
do i=1 to count;
words = scan(comments,i ," ");
output;
end;
run;

Well, sure, there's plenty of ways to make this run faster:

Get a faster server
Use MP_Connect to process your data in multiple SAS processes
If you're using EG, use EG's parallel processing to process your data in multiple SAS processes
Use PROC DS2's threading capability to process your data in multiple threads
As @Ksharp said, don't do a double pass over your data. Your first step reads comments into the PDV - do your processing on it then.
You might get a slight performance gain by using the SPDE engine for your source and target datasets.

However, it would be a huge help to know WHY you're doing what your doing? Clearly, you're transforming your data, writing each word of the comments field as a separate record in your final output table, where "word" is a token delimited by a space.

What do you then want to do with the data? If we knew that, perhaps there would be other approaches that wouldn't involve parsing your comments field into one record per word. For example, if you need to get a count of each word, you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller.

If your source data has 1M records, and your comments field averages 100 words, you've just converted your 1M records into 100M.

For small datasets, inefficient approaches are ok - quick and dirty rules the day. But, if your data is "large", and you're experiencing performance issues, you may need to apply more creative approaches.

So, forget the technicalities...what do you want to DO?

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

kumarK · Posted 09-17-2016 12:15 PM

"What do you then want to do with the data? If we knew that, perhaps there would be other approaches that wouldn't involve parsing your comments field into one record per word. For example, if you need to get a count of each word, you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller.

If your source data has 1M records, and your comments field averages 100 words, you've just converted your 1M records into 100M."

Exactly you got my problem. the program i written unnessarly creating 100M dataset like the above example.

My objective is to to get a count of each word in comments variable.

"you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller." -- can you please help me with code how we can get this using hash object.

Tom · Posted 09-17-2016 03:02 PM

Try using a data step view to split the words. Then at least you do not need to save the table.

data words / view=words ;
  set have (keep=comment);
  do i=1 to countw(comment,' ');
     length word $100 ;
     word = scan(comment,i,' ');
     output;
  end;
  keep word ;
run;
proc summary nway data=words;
   class word ;
   output out=word_freqs ;
run;

ScottBass · Posted 09-17-2016 06:51 PM

@kumarK wrote:
"you could use a hash object to sum for each word, and your final dataset (and therefore I/O) would be much smaller." -- can you please help me with code how we can get this using hash object.

I don't have time right now to write example code, but the logic would go like this:

You would have a data _null_ step

Your set statement would have "end=eof"

Create a hash object when _n_=1

The hash object key is "word", the hash object var is "count"

Parse your words as you are doing now

I honestly can't remember if the keys in a hash object are case sensitive, and I'm not in front of SAS right now to test. If so, convert the case of your words.

For each word (i.e. within your parsing loop):

Explicitly set count to missing

Find the existing word in your hash object

If the find is successful, count will be set to the current count from the hash object

If the find is unsuccessful, count will remain missing

Set count=sum(count,1). This will set count to either 1, or count+1

Update the hash object. This will either update the existing word, or add the new word

When end=eof, write the hash object to an output dataset

Read the SAS doc on the hash object: http://support.sas.com/documentation/cdl/en/lecompobjref/67221/HTML/default/viewer.htm#p0zjo5imt7rb9...

See the Add, Declare, Find, Output, Replace, and Sum methods. See the examples. If you don't understand the examples, run them until you do. They can be confusing when you're first learning hash objects, but once you understand them, they are a powerful tool in your SAS kit bag.

I know the above approach would work. However, the Sum method may be a an alternative approach, which may also work for you.

HTH...

Edit: See http://support.sas.com/documentation/cdl/en/lecompobjref/67221/HTML/default/viewer.htm#p00ilfw5pzcjv... (Example 5). This should solve your your issue.

Edit #2: But you should also try Tom's approach and compare the performance. If his approach performs well for you, it would be a simpler approach requiring less code. Regardless of your final approach, use this as an opportunity to learn more about hash objects.

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.