Filtering Dataset by Variable Count

cyrribrae · Posted 09-20-2018 05:20 PM

I am currently working on a dataset with 20 million entries. Needless to say, even simple things take a while to happen. As I've been getting deeper into the data, I've increasingly run into instances where the VM simply doesn't have the memory available to format and display the results. Even when using a much smaller subset of 200,000, I'm running into issues.

Each entry is tied to an ID number variable. Some IDs have multiple entries. I would like to get a list of the IDs that have the most entries (count), say the top 20% or those with counts above 5. Ideally, I would like to have this set of entries available as a subset dataset to then do more work with. In my head, I'm just using the out= function, but I'm worried that if I end up with just a list of the most frequent IDs, then I'd still have to manually create the subset. In my perfect headspace world, I'd just use a Proc Freq / Where function to find the highest counts. But I haven't found anything like that yet. So.. can I do that?

I am on version 3.71 Basic. Sorry I don't have code to share, I honestly don't even know where to begin. Thanks!

ballardw · Posted 09-20-2018 06:46 PM

By VM do you mean that you are using the SAS University Edition?

This may get you started:

proc freq data=have noprint order=freq;
   tables id / out=work.cnt(where=(CUM_PCT le 20)) outcum;
run;

No print means the result window won't eat your memory resources, order=freq generates the output with the most common values first.

The option outcum means that the output data set specified in the OUT= option with contain the cumulative count and percent. So cum_pct le 20 as the dataset option gets the top 20 percent of cases.

Or you could use where=(count ge 5)) or similar to select cases with "large" counts.

mkeintz · Posted 09-20-2018 11:14 PM

Is your data sorted by id? If so, then if you can generate a sorted list of the most frequent id's (say top 20%), then a simple merge can solve the problem.

You should first generate a permanent data set of ID's, by descending frequency:

proc freq data=have (keep=id) noprint order=freq;
  tables id / out=idfreqs outcum;
run;

The data set IDFREQS has id's listed from most frequent to least frequent (i.e. sorted by descending COUNT), with the variables ID, COUNT, PERCENT, CUM_FREQ, and CUM_PCT. Save it for all your subsequent subsetting tasks.

Now let's say you want the most frequent id's accounting for 20% of the original data set.

data wantedids (keep=id);
  set idfreqs;
  output;
  if cum_pct>=20 and lag(cum_pct)<20 then stop;
run;

proc sort ;
  by id;
run;

data want;
  merge have wantedids (in=inwant);
  by id;
  if inwant;
run;

You could easily apply the above for any percent. Or you could create wantedids as, say, all id's with 5 or more records, as in

data wantedids;
  set idfreqs;
  if count>=5;
run;

... remainder of program is the same ...

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

cyrribrae · Posted 09-21-2018 09:05 AM

Oh boy. Thanks for the replies! Sorry for not being clear with my initial request, hard to know what I don't know haha. I am indeed on the University Edition, using a VM. I tried both methods (probably incorrectly), but I couldn't really get anything from either. Either getting errors thrown or it just not behaving as expected.

For more clarity, I'm using the SID HCUP dataset, which tallies an anonymized list of every discharge from (in my case) CA hospitals in 2011. It's a lot. Not only are there a lot of rows, there are a lot of columns as well. The ID variable I mentioned is a partially-reidentified unique patient number. I'm not sure what a dataset being organized by any one variable means per se, but this is actually not the key ID variable. If a patient returns for another stay at a CA hospital in 2011, that patient will have 2+ entries in the database with the same patient ID. Those patients that come back the most often are known as super-utilizers. This is a fairly small group, so I don't actually care about the 20% per se, it's just that 1) I need a table or report that suggests where I might draw a reasonable line of visit counts for defining a super utilizer category (no exact science) and 2) once I have identified that group, I need a dataset that I can manipulate and use to compare to other groups from the original dataset.

I initially tried just using a proc freq order=freq // table patientID; But there were 2 problems. First, since there are almost as many patient IDs as discharges, we're talking potentially millions of rows. Obviously, there was no way I was going to display a dataset that large. When I used a subset of the dataset that looked only at discharges with mental health diagnoses, stays at the hospital for less than 60 days, and who were homeless (about 9,000 rows), I was able to run proc freq and look at the table cumulative %s and infer a reasonable cutoff point for visit count. But second, even once I chose a smaller subset, I realized I would have to manually make note of which IDs were coming up a lot. Which, again, is an insurmountable task.

Thus, I understand the logic of 1) not printing the table and 2) outputting to a dataset. But, when I tried to use the two methods presented above here, I got back a table with all freq 1 in the first case and just quite a few errors in the 2nd. Even when I got some form of results back, the number of entries I was losing at each step was unexpectedly large. I feel like I understand the rationale behind the two methods, generally, so I'm certain that this is my fault in entering or adapting your examples. But again, I don't have the background to understand what I'm doing incorrectly.

I realize I'm asking a lot here haha, but here's to hoping. Thanks!

mkeintz · Posted 09-21-2018 03:07 PM

Please show the logs of the failed attempts. That will facilitate a better diagnosis of what went wrong.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ballardw · Posted 09-21-2018 06:10 PM

@cyrribrae wrote:

Oh boy. Thanks for the replies! Sorry for not being clear with my initial request, hard to know what I don't know haha. I am indeed on the University Edition, using a VM. I tried both methods (probably incorrectly), but I couldn't really get anything from either. Either getting errors thrown or it just not behaving as expected.

For more clarity, I'm using the SID HCUP dataset, which tallies an anonymized list of every discharge from (in my case) CA hospitals in 2011. It's a lot. Not only are there a lot of rows, there are a lot of columns as well. The ID variable I mentioned is a partially-reidentified unique patient number. I'm not sure what a dataset being organized by any one variable means per se, but this is actually not the key ID variable. If a patient returns for another stay at a CA hospital in 2011, that patient will have 2+ entries in the database with the same patient ID. Those patients that come back the most often are known as super-utilizers. This is a fairly small group, so I don't actually care about the 20% per se, it's just that 1) I need a table or report that suggests where I might draw a reasonable line of visit counts for defining a super utilizer category (no exact science) and 2) once I have identified that group, I need a dataset that I can manipulate and use to compare to other groups from the original dataset.

I initially tried just using a proc freq order=freq // table patientID; But there were 2 problems. First, since there are almost as many patient IDs as discharges, we're talking potentially millions of rows. Obviously, there was no way I was going to display a dataset that large. When I used a subset of the dataset that looked only at discharges with mental health diagnoses, stays at the hospital for less than 60 days, and who were homeless (about 9,000 rows), I was able to run proc freq and look at the table cumulative %s and infer a reasonable cutoff point for visit count. But second, even once I chose a smaller subset, I realized I would have to manually make note of which IDs were coming up a lot. Which, again, is an insurmountable task.

Thus, I understand the logic of 1) not printing the table and 2) outputting to a dataset. But, when I tried to use the two methods presented above here, I got back a table with all freq 1 in the first case and just quite a few errors in the 2nd. Even when I got some form of results back, the number of entries I was losing at each step was unexpectedly large. I feel like I understand the rationale behind the two methods, generally, so I'm certain that this is my fault in entering or adapting your examples. But again, I don't have the background to understand what I'm doing incorrectly.

I realize I'm asking a lot here haha, but here's to hoping. Thanks!

Time to provide a small example data set with 1) just enough variables to indicate what you need 2) combinations of values that demonstrate the issues you have and 3) [and this is pretty critical] what you expect the result should be from the given example data.

It sounds like the solution I provided before with that patient partial identifier may be all that you need. Can you describe why it was insufficient in better detail? If the COUNT for every ID is one then there are not duplicates at the level you describe for your " super-utilizers". You might show the exact code that you ran to test this.

Here is something else you might try. This only counts patients that occur more than one time in the data.

proc summary data=have nway;
   class patientid;
   output out=patientcount(drop=_type_ where=(_freq_>1));
run;

As a bare minimum it never hurts to at least list variable names so pseudo code makes more sense.

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

I would suggest creating a subset of your data using something like obs= and keep= options and use the above macro with that data set.

data temp;

set have (obs=15 keep= var1 var2 var3);

run;

SuzanneDorinski · Posted 09-22-2018 04:51 PM

Finding super-utilizers in a data set of patient records seems similar to looking for housing units with lots of people in survey data.

I wrote the code below for the 2015 American Community Survey public use microdata files, available on the Census Bureau web site. The data set has 3,147,005 observations and 284 variables.

In the data set, SERIALNO is the identifier for the housing unit. The first PROC SQL counts the number of person records for each SERIALNO.

The PROC FREQ shows the number of housing units that have 1 person record, 2 person records, etc. I can look at the output from the PROC FREQ to find housing units containing relatively large numbers of people.

The second PROC SQL creates a macro variable that has the list of values of SERIALNO which have 20 person records.

The third PROC SQL creates a data set of the person records for people in housing units that have 20 person records.

libname acs '/folders/myfolders/ACS PUMS';

proc sql;
  create table record_counts as
    select serialno,
           count(serialno) as number_of_records
      from acs.population
        group by serialno
          order by serialno;
quit;

proc freq data=record_counts;
  tables number_of_records;
  title 'Number of person records by SERIALNO in 2015 American Community Survey Public Use Microdata';
run;

proc sql noprint;
  select quote(serialno)
    into :list_of_largest_housing_units separated by ', '
      from record_counts 
        where number_of_records=20;
quit;

proc sql;
  create table units_with_20_people as
    select serialno,
           sporder,
           agep,
           sex,
           rac1p,
           mar,
           hisp,
           relp
      from acs.population
        where serialno in (&list_of_largest_housing_units)
          order by serialno, sporder;
quit;

PROC FREQ output showing number of person records per housing unit

Filtering Dataset by Variable Count

Re: Filtering Dataset by Variable Count

Re: Filtering Dataset by Variable Count

Re: Filtering Dataset by Variable Count

Re: Filtering Dataset by Variable Count

Re: Filtering Dataset by Variable Count

Re: Filtering Dataset by Variable Count

SAS Innovate 2025: Call for Content

Classroom Training Available!