I have a very big data set of 500 million observations. When I use the complete data set and count the number of unique observations by different groups (e.g., age, race, sex, state) I get the following error: "unable to insert into temporary index while processing summary functions". When I use 10% of the data set my code runs without errors. Here is my code:
proc sql;
create table want as select age, count(distinct unique_id) as count
from have
group by age;
quit;
proc sort
data=have (keep=age unique_id)
out=sorted
nodupkey
;
by age unique_id;
run;
proc freq data=sorted;
tables age;
run;
As an alternative.
Or run this step instead of PROC FREQ:
data want;
set sorted;
by age;
if first.age
then count = 1;
else count + 1;
if last.age;
drop unique_id;
run;
proc sort data=have;
by age;
run;
proc freq data=have nlevels;
ods output nlevels=nlevels;
by age;
tables name/noprint;
run;
@PaigeMiller do you know why proc sql doesn't work with the big data set?
@PaigeMiller I don't understand how the code you are proposing gives just the number of unique observations as a new variable. I get an error that variable name cannot be found.
Probably should say
tables unique_id/noprint;
Are you working on a PC, or big server? 500M records is getting big for a PC, but probably feasible if you've got enough disk space and/or memory.
PROC SQL is probably running out of resources (disk space or memory). PROC FREQ might as well.
Depending on your memory and your data, you could try a hash table approach. If you make a hash table with age and unique_ID as keys, and just add your data to it, you'll get a table with all the distinct values (because hash tables de-dup by default), and you could output it to a dataset and then count it. But if you have LOTS of unique_id's and not enough memory you could hit memory problems.
If you're very patient and have the resources, you could try proc sort nodupkey out=... with the TAGSORT option, and see if that blows up.
proc sort
data=have (keep=age unique_id)
out=sorted
nodupkey
;
by age unique_id;
run;
proc freq data=sorted;
tables age;
run;
As an alternative.
Or run this step instead of PROC FREQ:
data want;
set sorted;
by age;
if first.age
then count = 1;
else count + 1;
if last.age;
drop unique_id;
run;
Henderson & Dorfman have a great paper on using hash functions to chunk your data, and then process each chunk and aggregate:
Even without hashing, you could try a simple chunk approach by writing a macro (or whatever code generation technique you like) to get the count once per age group:
proc sql;
create table want_18 as select age, count(distinct unique_id) as count
from have
where age=18;
quit;
proc sql;
create table want_19 as select age, count(distinct unique_id) as count
from have
where age=19;
quit;
*etc;
data want;
set want_: ;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.