BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
trevand
Obsidian | Level 7

I have a very big data set of 500 million observations. When I use the complete data set and count the number of unique observations by different groups (e.g., age, race, sex, state) I get the following error: "unable to insert into temporary index while processing summary functions". When I use 10% of the data set my code runs without errors. Here is my code:

 

proc sql;
create table want as select age, count(distinct unique_id) as count 
from have 
group by age;
quit;
1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User
proc sort
  data=have (keep=age unique_id)
  out=sorted
  nodupkey
;
by age unique_id;
run;

proc freq data=sorted;
tables age;
run;

As an alternative.

Or run this step instead of PROC FREQ:

data want;
set sorted;
by age;
if first.age
then count = 1;
else count + 1;
if last.age;
drop unique_id;
run;

View solution in original post

9 REPLIES 9
PaigeMiller
Diamond | Level 26
proc sort data=have;
    by age;
run;
proc freq data=have nlevels;
    ods output nlevels=nlevels;
    by age;
    tables name/noprint;
run;
--
Paige Miller
trevand
Obsidian | Level 7

@PaigeMiller do you know why proc sql doesn't work with the big data set?

PaigeMiller
Diamond | Level 26

Maxim 10

--
Paige Miller
trevand
Obsidian | Level 7

@PaigeMiller I don't understand how the code you are proposing gives just the number of unique observations as a new variable. I get an error that variable name cannot be found.

PaigeMiller
Diamond | Level 26

Probably should say

 

tables unique_id/noprint;
--
Paige Miller
Quentin
Super User

Are you working on a PC, or big server?  500M records is getting big for a PC, but probably feasible if you've got enough disk space and/or memory. 

 

PROC SQL is probably running out of resources (disk space or memory).  PROC FREQ might as well.

 

Depending on your memory and your data, you could try a hash table approach.  If you make a hash table with age and unique_ID as keys, and just add your data to it, you'll get a table with all the distinct values (because hash tables de-dup by default), and you could output it to a dataset and then count it.  But if you have LOTS of unique_id's and not enough memory you could hit memory problems.

 

If you're very patient and have the resources, you could try proc sort nodupkey out=... with the TAGSORT option, and see if that blows up.

Kurt_Bremser
Super User
proc sort
  data=have (keep=age unique_id)
  out=sorted
  nodupkey
;
by age unique_id;
run;

proc freq data=sorted;
tables age;
run;

As an alternative.

Or run this step instead of PROC FREQ:

data want;
set sorted;
by age;
if first.age
then count = 1;
else count + 1;
if last.age;
drop unique_id;
run;
Ksharp
Super User
1) You could try to build an index before runing this SQL.
proc sql;
create index age on have;
create index unique_id on have;
quit;

2)You could try to runing this code under BATCH mode to request more memory of PC.
"D:\SASHome\SASFoundation\9.4\sas.exe" -nosplash -sysin "c:\temp\temp.sas" -log "c:\temp\temp#Y#m#d-#H-#M-#s.log" -logparm "rollover=auto" -memsize 90G


3) Split you big table into many small sub-tables and make a macro for this PROC SQL.
Quentin
Super User

Henderson & Dorfman have a great paper on using hash functions to chunk your data, and then process each chunk and aggregate:

https://communities.sas.com/t5/SAS-Global-Forum-Proceedings/Uniform-Hashing-of-Arbitrary-Input-Into-...

 

Even without hashing, you could try a simple chunk approach by writing a macro (or whatever code generation technique you like) to get the count once per age group:

 

 

proc sql;
   create table want_18 as select age, count(distinct unique_id) as count 
   from have 
   where age=18;
quit;

proc sql;
   create table want_19 as select age, count(distinct unique_id) as count 
   from have 
   where age=19;
quit;

*etc;

data want;
  set want_: ;
run;

 

 

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 3178 views
  • 5 likes
  • 5 in conversation