Hi @Venkat4,
Glad to see that mkeintz's suggestion worked for you. So, my thoughts below might be of interest for later readers with a similar problem.
@Venkat4 wrote:
(...) I split the data into three datasets as 50+50+40 mil. But still I can't run proc freq (...)
Whether you use PROC FREQ or PROC SQL or a DATA step with a hash object, splitting the input dataset should help if the subsets don't overlap. To ensure that, don't split by observation number, but by other criteria such as length(word), lowcase(first(word)), etc.
Example:
data h1 h2 h3 h4 h5;
set inputdata;
select(length(word));
when(1,2,3) output h1;
when(4,5) output h2;
when(6,7) output h3;
when(8,9,10) output h4;
otherwise output h5;
end;
run;
(Consider using the COMPRESS=YES dataset option on h1, h2, ...)
If you're unsure about suitable cut points, you can draw a reasonably sized random sample ...
proc surveyselect data=inputdata
method=srs n=100000
seed=2718 out=samp;
run;
[edit: use a DATA step for this if you don't have SAS/STAT]
and analyze that quickly to get an idea of the distributions of lengths, first letters, etc. and even relative word frequencies. A frequency count of the combinations of the first two letters like
proc freq data=samp noprint;
format word $2.;
tables word / out=cnt;
run;
might be feasible even on the full dataset.
Using SPDE should be faster than using proc sort.
And SAS is not very efficient when dealing with 10k-char strings.
Do you want to try this?
data WORDS; * Generate sample data;
length WORD $10000;
do I=1 to 100e6;
WORD=repeat( cats(int(ranuni(1)*100)), abs(int(rannor(1)*99)) );
output;
end;
run;
data _null_; * find longest string;
set WORDS ;
retain LEN;
LEN=max(LEN,length(WORD));
if LASTOBS then call symputx('len',LEN);
run;
%let wdir=%sysfunc(pathname(WORK));
libname W spde "&wdir" partsize=1T compress=binary;
data W.WORDS1; * Copy data to SPDE and trim variable;
length WORD $&len.;
proc append data=WORDS base=W.WORDS1 force;
run;
data COUNTS; * Derive word cardinality;
set W.WORDS1;
by WORD;
CNT+1;
if last.WORD then do;
output;
CNT=0;
end;
run;
This uses 35 minutes:
37 data _null_; 38 set WORDS end=LASTOBS; 39 retain LEN; 40 LEN=max(LEN,length(WORD)); 41 if LASTOBS then call symputx('len',LEN); 42 run; NOTE: DATA statement used (Total process time):
real time 4:09.76
user cpu time 3:53.46
system cpu time 15.75 seconds 43 44 data W.WORDS1; 45 length WORD $&len.; 46 proc append data=WORDS base=W.WORDS1 force; 47 run; NOTE: PROCEDURE APPEND used (Total process time): real time 4:55.10 user cpu time 7:17.37 system cpu time 36.93 seconds
48 49 data COUNTS; 51 set W.WORDS1; 52 by WORD; 53 CNT+1; 54 if last.WORD then do; 55 output; 56 CNT=0; 57 end; 58 run; NOTE: DATA statement used (Total process time): real time 26:15.31 user cpu time 13:07.46 system cpu time 14:12.42
on my old server. So old I cannot use rand('uniform') ! 🙂
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.