Re: How to Create a New Variable Based on Frequency

joshknut · Posted 12-22-2016 01:35 PM

Hello,

I am currently using the National Cancer Database and I would like to use the variable PUF_FACILITY_ID to create a new variable based on the frequency a facility ID appears. I've attached a screen shot of the printout of part of the PROC FREQ procedure for this variable.

Using the frequency with which the unique facility ID shows up, I would like to create a new variable based on this frequency. This would be a way to create a "Facility Volume" variable. From this new variable I would like to create percentiles and ultimately a dichotomous variable based off of a percentile cutoff.

Thank you

Astounding · Posted 12-22-2016 02:43 PM

Since you are already familiar with PROC FREQ, that would be a good tool for the first step:

proc freq data=have noprint;

tables PUF_FACILITY_ID /

out=facility_counts (keep=PUF_FACILITY_ID rename=(count=Facility_Volume));

run;

That gives you a SAS data set holding the facilities, and their volume (first two columns of your picture). For creating percentiles, you can use PROC SUMMARY just to get a limited number of percentiles. But PROC UNIVARIATE gives you much more flexibility as to which percentiles you can calculate.

joshknut · Posted 12-22-2016 05:01 PM

Thank you for your quick reply!

I tried to use all of your code, but I could not get the following code to run:

(keep=PUF_FACILITY_ID rename= (count=Facility_Volume));

So I omitted this and it worked well at creating what I needed. See (Attachment labeled "Code")

However, it did create the variable that I desired (see attachment "Count Variable").

I then tried to merge this new dataset with the COUNT term in the Main dataset (see attachment "Main Dataset"). I would like to merge based on the Case Key (PUF_CASE_ID) but I cannot, since this variable is not included in the new dataset with the new COUNT variable.

How do I merge these two datasets by the Case Key?

Thank you very much! This has been incredibly helpful.

Josh

no attachments. -##

ballardw · Posted 12-22-2016 05:39 PM

Show the code you used where the OUT= options did not work and any messages from the log. I suspect you may have inserted a ; wher not needed.

You say you want to merge on PUF_CASE_ID. HOW is that related to PUF_FACILITY_ID? You did not mention that PUF_CASE_ID in the question at all. You will need to describe the relationship between the two variables or did you mean to count PUF_CASE_ID in the first place???

Astounding · Posted 12-22-2016 07:25 PM

You're right about the error. The KEEP= list should also include COUNT:

out=counts (keep=PUF_FACILITY_ID count rename= (count=Facility_Volume));

And yes, all you can merge by is PUF_FACILITY_ID. Would that not do what you need?

joshknut · Posted 12-23-2016 09:26 AM

Hello,

The new code you provided worked perfectly this time.

I'll try to explain my main goal better and maybe you will be able to guide me on how to do that.

In the main dataset, all of the variables are attached to a unique patient ID. Each patient was seen at a particular facility, indicated by their facility ID. I wanted to get the frequency of patients seen at each facility (the counts) to get a measure for the facility volume. Now I need to take that facility volume and make it a new variable in the main dataset, but linked to the unique patient ID. This way I'll be able to conduct a survival analysis based on facility volume.

There are some 48,000 patients in the dataset and about 1,200 facilities. When I merged the data over, by facility ID, it just had 1,200 observations. I need to new Facility_Volume variable to have 48,000 observations, for each of the patients based on the volume of their Facility_ID.

Is there a way to bring the patient ID over with the new data output when creating the Facility_Volume variable? That way I can merge it back by patient ID?

Thanks for your help with this! We are almost to what I need to do. I really appreciate it.

no attachments. -##

ballardw · Posted 12-23-2016 10:55 AM

You have a much better chance if link to the facility. I can see a strong likelihood of patients actually being seen at multiple facilities and so matching a single facility count to all records for the the patient could be terribly misleading.

If you actually want the count of the patient visits at that facility then that is something else and would involve, if using Proc Freq a tables statement with both variables.

You might want to post a small example of data, maybe 20 or 30 records. Dummy data is fine as long as it shows a reasonable combination of facility patient data AND what the desired result for that example data would be. You can get instructions here https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... on how to turn a SAS data set into code that we can execute to recreate your data and test code with the data.

Astounding · Posted 12-23-2016 02:30 PM

If you can rely on the connection of each patient to one particular facility ID, you only need to get counts for each patient. Similarly to what was done previously:

proc freq data=have;

tables patient_id / noprint out=counts (keep=patient_id count rename=(count=patient_volume));

run;

Then merge back into the original data by patient_id.

joshknut · Posted 12-23-2016 02:50 PM

I got it to work! Thank you so much for your help
All of your assistance is greatly appreciated

Happy Holidays

Josh

SAS Innovate 2025: Save the Date