Finding super-utilizers in a data set of patient records seems similar to looking for housing units with lots of people in survey data.
I wrote the code below for the 2015 American Community Survey public use microdata files, available on the Census Bureau web site. The data set has 3,147,005 observations and 284 variables.
In the data set, SERIALNO is the identifier for the housing unit. The first PROC SQL counts the number of person records for each SERIALNO.
The PROC FREQ shows the number of housing units that have 1 person record, 2 person records, etc. I can look at the output from the PROC FREQ to find housing units containing relatively large numbers of people.
The second PROC SQL creates a macro variable that has the list of values of SERIALNO which have 20 person records.
The third PROC SQL creates a data set of the person records for people in housing units that have 20 person records.
libname acs '/folders/myfolders/ACS PUMS';
proc sql;
create table record_counts as
select serialno,
count(serialno) as number_of_records
from acs.population
group by serialno
order by serialno;
quit;
proc freq data=record_counts;
tables number_of_records;
title 'Number of person records by SERIALNO in 2015 American Community Survey Public Use Microdata';
run;
proc sql noprint;
select quote(serialno)
into :list_of_largest_housing_units separated by ', '
from record_counts
where number_of_records=20;
quit;
proc sql;
create table units_with_20_people as
select serialno,
sporder,
agep,
sex,
rac1p,
mar,
hisp,
relp
from acs.population
where serialno in (&list_of_largest_housing_units)
order by serialno, sporder;
quit;
PROC FREQ output showing number of person records per housing unit
... View more