About cyrribrae

cyrribrae · ‎09-21-2018

Oh boy. Thanks for the replies! Sorry for not being clear with my initial request, hard to know what I don't know haha. I am indeed on the University Edition, using a VM. I tried both methods (probably incorrectly), but I couldn't really get anything from either. Either getting errors thrown or it just not behaving as expected. For more clarity, I'm using the SID HCUP dataset, which tallies an anonymized list of every discharge from (in my case) CA hospitals in 2011. It's a lot. Not only are there a lot of rows, there are a lot of columns as well. The ID variable I mentioned is a partially-reidentified unique patient number. I'm not sure what a dataset being organized by any one variable means per se, but this is actually not the key ID variable. If a patient returns for another stay at a CA hospital in 2011, that patient will have 2+ entries in the database with the same patient ID. Those patients that come back the most often are known as super-utilizers. This is a fairly small group, so I don't actually care about the 20% per se, it's just that 1) I need a table or report that suggests where I might draw a reasonable line of visit counts for defining a super utilizer category (no exact science) and 2) once I have identified that group, I need a dataset that I can manipulate and use to compare to other groups from the original dataset. I initially tried just using a proc freq order=freq // table patientID; But there were 2 problems. First, since there are almost as many patient IDs as discharges, we're talking potentially millions of rows. Obviously, there was no way I was going to display a dataset that large. When I used a subset of the dataset that looked only at discharges with mental health diagnoses, stays at the hospital for less than 60 days, and who were homeless (about 9,000 rows), I was able to run proc freq and look at the table cumulative %s and infer a reasonable cutoff point for visit count. But second, even once I chose a smaller subset, I realized I would have to manually make note of which IDs were coming up a lot. Which, again, is an insurmountable task. Thus, I understand the logic of 1) not printing the table and 2) outputting to a dataset. But, when I tried to use the two methods presented above here, I got back a table with all freq 1 in the first case and just quite a few errors in the 2nd. Even when I got some form of results back, the number of entries I was losing at each step was unexpectedly large. I feel like I understand the rationale behind the two methods, generally, so I'm certain that this is my fault in entering or adapting your examples. But again, I don't have the background to understand what I'm doing incorrectly. I realize I'm asking a lot here haha, but here's to hoping. Thanks!

cyrribrae · ‎09-20-2018

I am currently working on a dataset with 20 million entries. Needless to say, even simple things take a while to happen. As I've been getting deeper into the data, I've increasingly run into instances where the VM simply doesn't have the memory available to format and display the results. Even when using a much smaller subset of 200,000, I'm running into issues. Each entry is tied to an ID number variable. Some IDs have multiple entries. I would like to get a list of the IDs that have the most entries (count), say the top 20% or those with counts above 5. Ideally, I would like to have this set of entries available as a subset dataset to then do more work with. In my head, I'm just using the out= function, but I'm worried that if I end up with just a list of the most frequent IDs, then I'd still have to manually create the subset. In my perfect headspace world, I'd just use a Proc Freq / Where function to find the highest counts. But I haven't found anything like that yet. So.. can I do that? I am on version 3.71 Basic. Sorry I don't have code to share, I honestly don't even know where to begin. Thanks!

Online Status	Offline
Date Last Visited	‎09-21-2018 09:36 AM

Re: Filtering Dataset by Variable Count

Filtering Dataset by Variable Count

Re: Filtering Dataset by Variable Count

Filtering Dataset by Variable Count