Oh boy. Thanks for the replies! Sorry for not being clear with my initial request, hard to know what I don't know haha. I am indeed on the University Edition, using a VM. I tried both methods (probably incorrectly), but I couldn't really get anything from either. Either getting errors thrown or it just not behaving as expected. For more clarity, I'm using the SID HCUP dataset, which tallies an anonymized list of every discharge from (in my case) CA hospitals in 2011. It's a lot. Not only are there a lot of rows, there are a lot of columns as well. The ID variable I mentioned is a partially-reidentified unique patient number. I'm not sure what a dataset being organized by any one variable means per se, but this is actually not the key ID variable. If a patient returns for another stay at a CA hospital in 2011, that patient will have 2+ entries in the database with the same patient ID. Those patients that come back the most often are known as super-utilizers. This is a fairly small group, so I don't actually care about the 20% per se, it's just that 1) I need a table or report that suggests where I might draw a reasonable line of visit counts for defining a super utilizer category (no exact science) and 2) once I have identified that group, I need a dataset that I can manipulate and use to compare to other groups from the original dataset. I initially tried just using a proc freq order=freq // table patientID; But there were 2 problems. First, since there are almost as many patient IDs as discharges, we're talking potentially millions of rows. Obviously, there was no way I was going to display a dataset that large. When I used a subset of the dataset that looked only at discharges with mental health diagnoses, stays at the hospital for less than 60 days, and who were homeless (about 9,000 rows), I was able to run proc freq and look at the table cumulative %s and infer a reasonable cutoff point for visit count. But second, even once I chose a smaller subset, I realized I would have to manually make note of which IDs were coming up a lot. Which, again, is an insurmountable task. Thus, I understand the logic of 1) not printing the table and 2) outputting to a dataset. But, when I tried to use the two methods presented above here, I got back a table with all freq 1 in the first case and just quite a few errors in the 2nd. Even when I got some form of results back, the number of entries I was losing at each step was unexpectedly large. I feel like I understand the rationale behind the two methods, generally, so I'm certain that this is my fault in entering or adapting your examples. But again, I don't have the background to understand what I'm doing incorrectly. I realize I'm asking a lot here haha, but here's to hoping. Thanks!
... View more