I am using a dataset that merges cross-sectional data from 2004-2014. Physicians are noted by a unique phycode each year (although the phycode is repeated in subsequent years -- i.e. starts again with 1 each year -- even though we are talking about different physicians). Therefore, both the phycode and year are needed to identify each unique physician in the dataset.
I am interested in calculating the average number of patient visits (each unique visit is numbered within each physician -- "patcode") per physician and I created a new count variable to count the number of visits for each unique provider in each year:
proc sort data=test;
by phycode year;
run;
Data count;
Set test;
keep phycode patcode count insurance year;
by phycode year;
if first.phycode or first.year then count=1;
else count+1;
Run;
However, I had problems trying to then calculate the average number of visits per unique physician from this count. I had tried
Data count1;
Set count;
by phycode;
if count>=1;
if last.phycode then output;
Run;
proc means data=count;
var count;
run;
However, this misses physicians in some years with the same phycode but different year (this is only giving me last physician code number overall with highest count, which will not work since there are duplicate codes in some years). Any thoughts on how I could adjust the code above?
Thanks!
Can you use SQL? Count Distinct would be helpful here.
proc sql;
create table want as
select count(distinct age), sex
from sashelp.class
group by sex;
quit;
What would that look like instead here? Would it be:
proc sql;
create table want as
select count(distinct phycode), year
from test
group by phycode year;
quit;
@gonzy31 wrote:
What would that look like instead here? Would it be:
proc sql;
create table want as
select count(distinct phycode), year
from test
group by phycode year;
quit;
Depends on your data. Post sample data (as a data step) and expected output if you want code that works with your data. If you post generic questions, you get generic answers.
Here is a sample of the data with the variables I care about
proc sort data=test;
by phycode year;
run;
Data count;
Set test;
keep phycode patcode count year;
by phycode year;
if first.phycode or first.year then count=1;
else count+1;
Run;
11 | 1 | 2007 | 1 | |
11 | 2 | 2007 | 2 | |
11 | 3 | 2007 | 3 | |
11 | 4 | 2007 | 4 | |
11 | 5 | 2007 | 5 | |
13 | 1 | 2005 | 1 | |
13 | 2 | 2005 | 2 | |
13 | 3 | 2005 | 3 | |
13 | 4 | 2005 | 4 | |
13 | 1 | 2006 | 1 | |
13 | 2 | 2006 | 2 | |
I want the output to look like this though:
11 | 5 | 2007 | . | 5 |
13 | 4 | 2005 | 4 | |
13 | 2 | 2006 | 2 | |
You've overcomplicated this.
data want;
set have;
by phycode year;
if last.year;
run;
@gonzy31 wrote:
I am using a dataset that merges cross-sectional data from 2004-2014. Physicians are noted by a unique phycode each year (although the phycode is repeated in subsequent years -- i.e. starts again with 1 each year -- even though we are talking about different physicians). Therefore, both the phycode and year are needed to identify each unique physician in the dataset.
However, I had problems trying to then calculate the average number of visits per unique physician from this count. I had tried
However, this misses physicians in some years with the same phycode but different year (this is only giving me last physician code number overall with highest count, which will not work since there are duplicate codes in some years). Any thoughts on how I could adjust the code above?
Thanks!
I am not sure I am understaning your last problem. You said that to uniquely identify in an given year tha tyou need both the year and the phycode. And then you are looking only at phycode?!? If the physicican changes code from year to year how do you expect to recognize the same one from multiple years??? Or did you misstate the requirement that both year and phycode are required for unique identifcation and phycode alone identfies the physician and you just need to get counts per year?
It might help considerably to provide some dummy example data as far as what is needed to identify some records for one physician in multiple years and what the expected count should be. Include at least one where the problem you describe exists. The data should be in form of data step code.
I would thint that perhaps the first thing may be to actually create real unique per physician codes that work across years.
I think I misspoke. Physicians are unique each year, but you need both the phycode and year to identify a unique physician (this being because with each year's data, the phycode identified starts from 1 again for each unique physician in that year).
SAS Output -- phycode is unique to each individual physician in a specific year but the same phycode can be used in another year for a different physician as seen here; therefore, need bothyear and phycode as identifier. Patcode represents each unique patient visit to a physician. I created a count variable to ensure that I could count total number of visits per physician in each unique year (see line 89 when count restarts with repeat phycode but in different year)
11 | 1 | 2007 | 0 | 1 |
11 | 2 | 2007 | 1 | 2 |
11 | 3 | 2007 | 1 | 3 |
11 | 4 | 2007 | 1 | 4 |
11 | 5 | 2007 | 1 | 5 |
11 | 6 | 2007 | 1 | 6 |
11 | 7 | 2007 | 0 | 7 |
11 | 8 | 2007 | 1 | 8 |
11 | 9 | 2007 | 1 | 9 |
11 | 10 | 2007 | 1 | 10 |
11 | 11 | 2007 | 1 | 11 |
11 | 12 | 2007 | 0 | 12 |
11 | 13 | 2007 | 1 | 13 |
11 | 14 | 2007 | 1 | 14 |
11 | 15 | 2007 | 1 | 15 |
11 | 16 | 2007 | 1 | 16 |
11 | 17 | 2007 | 1 | 17 |
11 | 18 | 2007 | 1 | 18 |
11 | 19 | 2007 | 1 | 19 |
11 | 20 | 2007 | 1 | 20 |
11 | 21 | 2007 | 1 | 21 |
11 | 22 | 2007 | 1 | 22 |
11 | 23 | 2007 | 1 | 23 |
11 | 24 | 2007 | 1 | 24 |
11 | 25 | 2007 | 0 | 25 |
11 | 26 | 2007 | 1 | 26 |
11 | 27 | 2007 | 1 | 27 |
11 | 28 | 2007 | 0 | 28 |
11 | 29 | 2007 | 1 | 29 |
11 | 30 | 2007 | . | 30 |
13 | 1 | 2005 | 1 | 1 |
13 | 2 | 2005 | 1 | 2 |
13 | 3 | 2005 | 1 | 3 |
13 | 4 | 2005 | 0 | 4 |
13 | 5 | 2005 | 1 | 5 |
13 | 6 | 2005 | 1 | 6 |
13 | 7 | 2005 | 1 | 7 |
13 | 8 | 2005 | 1 | 8 |
13 | 9 | 2005 | 1 | 9 |
13 | 10 | 2005 | 1 | 10 |
13 | 11 | 2005 | 1 | 11 |
13 | 12 | 2005 | 1 | 12 |
13 | 13 | 2005 | 1 | 13 |
13 | 14 | 2005 | 1 | 14 |
13 | 15 | 2005 | 1 | 15 |
13 | 16 | 2005 | 1 | 16 |
13 | 17 | 2005 | 1 | 17 |
13 | 18 | 2005 | 1 | 18 |
13 | 19 | 2005 | 1 | 19 |
13 | 20 | 2005 | 1 | 20 |
13 | 21 | 2005 | 1 | 21 |
13 | 22 | 2005 | 0 | 22 |
13 | 23 | 2005 | 1 | 23 |
13 | 24 | 2005 | 1 | 24 |
13 | 1 | 2006 | 0 | 1 |
13 | 2 | 2006 | 0 | 2 |
13 | 3 | 2006 | 1 | 3 |
13 | 4 | 2006 | . | 4 |
13 | 5 | 2006 | 1 | 5 |
13 | 6 | 2006 | 1 | 6 |
13 | 7 | 2006 | 1 | 7 |
13 | 8 | 2006 | 0 | 8 |
13 | 9 | 2006 | 1 | 9 |
13 | 10 | 2006 | 0 | 10 |
13 | 11 | 2006 | . | 11 |
13 | 12 | 2006 | 1 | 12 |
13 | 13 | 2006 | . | 13 |
13 | 14 | 2006 | 1 | 14 |
13 | 15 | 2006 | 0 | 15 |
13 | 16 | 2006 | 1 | 16 |
13 | 17 | 2006 | . | 17 |
13 | 18 | 2006 | 1 | 18 |
13 | 19 | 2006 | 1 | 19 |
13 | 20 | 2006 | 1 | 20 |
13 | 21 | 2006 | 1 | 21 |
13 | 22 | 2006 | 0 | 22 |
13 | 23 | 2006 | 1 | 23 |
13 | 24 | 2006 | 1 | 24 |
13 | 25 | 2006 | 0 | 25 |
13 | 26 | 2006 | 1 | 26 |
13 | 27 | 2006 | 1 | 27 |
13 | 28 | 2006 | 1 | 28 |
13 | 29 | 2006 | . | 29 |
13 | 1 | 2007 | 0 | 1 |
13 | 2 | 2007 | 1 | 2 |
13 | 3 | 2007 | 1 | 3 |
13 | 4 | 2007 | 0 | 4 |
13 | 5 | 2007 | 1 | 5 |
13 | 6 | 2007 | 1 | 6 |
13 | 7 | 2007 | 1 | 7 |
13 | 8 | 2007 | 0 | 8 |
13 | 9 | 2007 | 1 | 9 |
13 | 10 | 2007 | 1 | 10 |
13 | 11 | 2007 | 0 | 11 |
13 | 12 | 2007 | 0 | 12 |
13 | 13 | 2007 | 0 | 13 |
13 | 14 | 2007 | 0 | 14 |
13 | 15 | 2007 | 0 | 15 |
13 | 16 | 2007 | 0 | 16 |
13 | 17 | 2007 | 1 | 17 |
After creating this count variable though for the number of visits per physician, I am not sure how to calculate the average number of visits per provider. I had tried to create a new dataset with just the last count per physician code and doing proc means on the count variable, but this loses the highest count for a physician code in earlier years when the same physician code is repeated over years.
SAS Output
11 | 30 | 2007 | . | 30 |
13 | 28 | 2007 | 1 | 28 |
14 | 20 | 2009 | 1 | 20 |
15 | 32 | 2010 | 1 | 32 |
16 | 13 | 2008 | 1 | 13 |
17 | 21 | 2008 | 1 | 21 |
18 | 30 | 2009 | 1 | 30 |
19 | 47 | 2009 | 1 | 47 |
20 | 25 | 2013 | 1 | 25 |
21 | 27 | 2008 | 1 | 27 |
22 | 35 | 2010 | 1 | 35 |
23 | 17 | 2012 | 1 | 17 |
24 | 13 | 2014 | 1 | 13 |
26 | 28 | 2012 | 1 | 28 |
27 | 9 | 2008 | 1 | 9 |
28 | 45 | 2007 | 1 | 45 |
29 | 30 | 2008 | 1 | 30 |
30 | 1 | 2014 | 1 | 1 |
31 | 33 | 2011 | 1 | 33 |
32 | 24 | 2013 | 1 | 24 |
33 | 40 | 2012 | 1 | 40 |
34 | 11 | 2014 | 1 | 11 |
35 | 3 | 2012 | 0 | 3 |
36 | 40 | 2014 | 1 | 40 |
How are you defining the average of the providers? You can't combine over years, so you can only count the number of distinct per year, per doc and then average that number.
Your patient code has no duplicates...so what are you actually trying to count here? Count is equal to patcode. This data doesn't make sense to me.
I want the average number of visits per provider. Each provider is unique in each year -- so you can take an overall average number of visits per provider over the ten years (as long as each provider is uniquely identified by year and phycode). Patcode, you are correct, is the same as count here --- but not later when I restrict the visits to only patients without insurance, but this is why I am creating a distinct count variable.
The last.year output worked -- I think you were right that I was overcomplicating it.
I then just got the average count from the output above --
proc means data=test;
var count;
run;
I think this solved the issue... thanks!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.