I have a very big data set of 500 million observations. When I use the complete data set and count the number of unique observations by different groups (e.g., age, race, sex, state) I get the following error: "unable to insert into temporary index while processing summary functions". When I use 10% of the data set my code runs without errors. Here is my code:
proc sql;
create table want as select age, count(distinct unique_id) as count
from have
group by age;
quit;
proc sort
data=have (keep=age unique_id)
out=sorted
nodupkey
;
by age unique_id;
run;
proc freq data=sorted;
tables age;
run;
As an alternative.
Or run this step instead of PROC FREQ:
data want;
set sorted;
by age;
if first.age
then count = 1;
else count + 1;
if last.age;
drop unique_id;
run;
proc sort data=have;
by age;
run;
proc freq data=have nlevels;
ods output nlevels=nlevels;
by age;
tables name/noprint;
run;
@PaigeMiller do you know why proc sql doesn't work with the big data set?
@PaigeMiller I don't understand how the code you are proposing gives just the number of unique observations as a new variable. I get an error that variable name cannot be found.
Probably should say
tables unique_id/noprint;
Are you working on a PC, or big server? 500M records is getting big for a PC, but probably feasible if you've got enough disk space and/or memory.
PROC SQL is probably running out of resources (disk space or memory). PROC FREQ might as well.
Depending on your memory and your data, you could try a hash table approach. If you make a hash table with age and unique_ID as keys, and just add your data to it, you'll get a table with all the distinct values (because hash tables de-dup by default), and you could output it to a dataset and then count it. But if you have LOTS of unique_id's and not enough memory you could hit memory problems.
If you're very patient and have the resources, you could try proc sort nodupkey out=... with the TAGSORT option, and see if that blows up.
proc sort
data=have (keep=age unique_id)
out=sorted
nodupkey
;
by age unique_id;
run;
proc freq data=sorted;
tables age;
run;
As an alternative.
Or run this step instead of PROC FREQ:
data want;
set sorted;
by age;
if first.age
then count = 1;
else count + 1;
if last.age;
drop unique_id;
run;
Henderson & Dorfman have a great paper on using hash functions to chunk your data, and then process each chunk and aggregate:
Even without hashing, you could try a simple chunk approach by writing a macro (or whatever code generation technique you like) to get the count once per age group:
proc sql;
create table want_18 as select age, count(distinct unique_id) as count
from have
where age=18;
quit;
proc sql;
create table want_19 as select age, count(distinct unique_id) as count
from have
where age=19;
quit;
*etc;
data want;
set want_: ;
run;
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and save with the early bird rate—just $795!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.