Solved: Re: Count Unique Observations

trevand · Posted 10-23-2024 10:22 AM

I have a very big data set of 500 million observations. When I use the complete data set and count the number of unique observations by different groups (e.g., age, race, sex, state) I get the following error: "unable to insert into temporary index while processing summary functions". When I use 10% of the data set my code runs without errors. Here is my code:

proc sql;
create table want as select age, count(distinct unique_id) as count 
from have 
group by age;
quit;

Kurt_Bremser · Posted 10-23-2024 01:59 PM

proc sort
  data=have (keep=age unique_id)
  out=sorted
  nodupkey
;
by age unique_id;
run;

proc freq data=sorted;
tables age;
run;

As an alternative.

Or run this step instead of PROC FREQ:

data want;
set sorted;
by age;
if first.age
then count = 1;
else count + 1;
if last.age;
drop unique_id;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

PaigeMiller · Posted 10-23-2024 10:38 AM

proc sort data=have;
    by age;
run;
proc freq data=have nlevels;
    ods output nlevels=nlevels;
    by age;
    tables name/noprint;
run;

--
Paige Miller

trevand · Posted 10-23-2024 10:42 AM

@PaigeMiller do you know why proc sql doesn't work with the big data set?

PaigeMiller · Posted 10-23-2024 10:43 AM

Maxim 10

--
Paige Miller

trevand · Posted 10-23-2024 01:30 PM

@PaigeMiller I don't understand how the code you are proposing gives just the number of unique observations as a new variable. I get an error that variable name cannot be found.

PaigeMiller · Posted 10-23-2024 01:32 PM

Probably should say

tables unique_id/noprint;

--
Paige Miller

Quentin · Posted 10-23-2024 01:52 PM

Are you working on a PC, or big server? 500M records is getting big for a PC, but probably feasible if you've got enough disk space and/or memory.

PROC SQL is probably running out of resources (disk space or memory). PROC FREQ might as well.

Depending on your memory and your data, you could try a hash table approach. If you make a hash table with age and unique_ID as keys, and just add your data to it, you'll get a table with all the distinct values (because hash tables de-dup by default), and you could output it to a dataset and then count it. But if you have LOTS of unique_id's and not enough memory you could hit memory problems.

If you're very patient and have the resources, you could try proc sort nodupkey out=... with the TAGSORT option, and see if that blows up.

Kurt_Bremser · Posted 10-23-2024 01:59 PM

proc sort
  data=have (keep=age unique_id)
  out=sorted
  nodupkey
;
by age unique_id;
run;

proc freq data=sorted;
tables age;
run;

As an alternative.

Or run this step instead of PROC FREQ:

data want;
set sorted;
by age;
if first.age
then count = 1;
else count + 1;
if last.age;
drop unique_id;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 10-23-2024 10:26 PM

1) You could try to build an index before runing this SQL.
proc sql;
create index age on have;
create index unique_id on have;
quit;

2)You could try to runing this code under BATCH mode to request more memory of PC.
"D:\SASHome\SASFoundation\9.4\sas.exe" -nosplash -sysin "c:\temp\temp.sas" -log "c:\temp\temp#Y#m#d-#H-#M-#s.log" -logparm "rollover=auto" -memsize 90G

3) Split you big table into many small sub-tables and make a macro for this PROC SQL.

Quentin · Posted 10-24-2024 09:12 AM

Henderson & Dorfman have a great paper on using hash functions to chunk your data, and then process each chunk and aggregate:

https://communities.sas.com/t5/SAS-Global-Forum-Proceedings/Uniform-Hashing-of-Arbitrary-Input-Into-...

Even without hashing, you could try a simple chunk approach by writing a macro (or whatever code generation technique you like) to get the count once per age group:

proc sql;
   create table want_18 as select age, count(distinct unique_id) as count 
   from have 
   where age=18;
quit;

proc sql;
   create table want_19 as select age, count(distinct unique_id) as count 
   from have 
   where age=19;
quit;

*etc;

data want;
  set want_: ;
run;

SAS Training: Just a Click Away