BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
trevand
Obsidian | Level 7

I have a very big data set of 500 million observations. When I use the complete data set and count the number of unique observations by different groups (e.g., age, race, sex, state) I get the following error: "unable to insert into temporary index while processing summary functions". When I use 10% of the data set my code runs without errors. Here is my code:

 

proc sql;
create table want as select age, count(distinct unique_id) as count 
from have 
group by age;
quit;
1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User
proc sort
  data=have (keep=age unique_id)
  out=sorted
  nodupkey
;
by age unique_id;
run;

proc freq data=sorted;
tables age;
run;

As an alternative.

Or run this step instead of PROC FREQ:

data want;
set sorted;
by age;
if first.age
then count = 1;
else count + 1;
if last.age;
drop unique_id;
run;

View solution in original post

9 REPLIES 9
PaigeMiller
Diamond | Level 26
proc sort data=have;
    by age;
run;
proc freq data=have nlevels;
    ods output nlevels=nlevels;
    by age;
    tables name/noprint;
run;
--
Paige Miller
trevand
Obsidian | Level 7

@PaigeMiller do you know why proc sql doesn't work with the big data set?

PaigeMiller
Diamond | Level 26

Maxim 10

--
Paige Miller
trevand
Obsidian | Level 7

@PaigeMiller I don't understand how the code you are proposing gives just the number of unique observations as a new variable. I get an error that variable name cannot be found.

PaigeMiller
Diamond | Level 26

Probably should say

 

tables unique_id/noprint;
--
Paige Miller
Quentin
Super User

Are you working on a PC, or big server?  500M records is getting big for a PC, but probably feasible if you've got enough disk space and/or memory. 

 

PROC SQL is probably running out of resources (disk space or memory).  PROC FREQ might as well.

 

Depending on your memory and your data, you could try a hash table approach.  If you make a hash table with age and unique_ID as keys, and just add your data to it, you'll get a table with all the distinct values (because hash tables de-dup by default), and you could output it to a dataset and then count it.  But if you have LOTS of unique_id's and not enough memory you could hit memory problems.

 

If you're very patient and have the resources, you could try proc sort nodupkey out=... with the TAGSORT option, and see if that blows up.

Kurt_Bremser
Super User
proc sort
  data=have (keep=age unique_id)
  out=sorted
  nodupkey
;
by age unique_id;
run;

proc freq data=sorted;
tables age;
run;

As an alternative.

Or run this step instead of PROC FREQ:

data want;
set sorted;
by age;
if first.age
then count = 1;
else count + 1;
if last.age;
drop unique_id;
run;
Ksharp
Super User
1) You could try to build an index before runing this SQL.
proc sql;
create index age on have;
create index unique_id on have;
quit;

2)You could try to runing this code under BATCH mode to request more memory of PC.
"D:\SASHome\SASFoundation\9.4\sas.exe" -nosplash -sysin "c:\temp\temp.sas" -log "c:\temp\temp#Y#m#d-#H-#M-#s.log" -logparm "rollover=auto" -memsize 90G


3) Split you big table into many small sub-tables and make a macro for this PROC SQL.
Quentin
Super User

Henderson & Dorfman have a great paper on using hash functions to chunk your data, and then process each chunk and aggregate:

https://communities.sas.com/t5/SAS-Global-Forum-Proceedings/Uniform-Hashing-of-Arbitrary-Input-Into-...

 

Even without hashing, you could try a simple chunk approach by writing a macro (or whatever code generation technique you like) to get the count once per age group:

 

 

proc sql;
   create table want_18 as select age, count(distinct unique_id) as count 
   from have 
   where age=18;
quit;

proc sql;
   create table want_19 as select age, count(distinct unique_id) as count 
   from have 
   where age=19;
quit;

*etc;

data want;
  set want_: ;
run;

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2564 views
  • 5 likes
  • 5 in conversation