I am brand new to SAS and still learning for work. I am currently assigned to work with the NETS database which includes every business in California from 1990-2012. We are trying to look at Kern County and compare it to six other benchmarks over time. We will be doing things like summing employment by industry for each year to see trends, looking at sales numbers, looking at average paydex scores by industry, mostly comparing Kern to the benchmark counties.
Is there a way to write a Macro that would allow me to do the same calculations for every county in every year without having to use long OR statements and multiple commands? I have included two of the commands I wrote out, but each one is an individual step.
The first is a program I wrote to pull any company that had ever been in a specific county using FIPS code. We want to be able to track movements so if a company was only in Kern county in 1997 we still want it in the larger dataset. The second program is the only way that I know how to sum employment by industry, but with 7 counties and 22 years I would have to rewrite this program with just minor changes 154 times....arghhhhh!
Thank you so much for any help.
*----To separate data by county-----*
Data Kern.KernCounty;
Set Kern.Alldata;
If FIPS90='6029' or FIPS91='6029' or FIPS92='6029' or FIPS93='6029' or FIPS94='6029' or FIPS95='6029' or FIPS96='6029' or FIPS97='6029' or FIPS98='6029' or FIPS99='6029' or FIPS00='6029' or FIPS01='6029' or FIPS02='6029' or FIPS03='6029' or FIPS04='6029' or FIPS05='6029' or FIPS06='6029' or FIPS07='6029' or FIPS08='6029' or FIPS09='6029' or FIPS10='6029' or FIPS11='6029' or FIPS12='6029';
Run;
Data Kern.FresnoCounty;
Set Kern.Alldata;
If FIPS90='6019' or FIPS91='6019' or FIPS92='6019' or FIPS93='6019' or FIPS94='6019' or FIPS95='6019' or FIPS96='6019' or FIPS97='6019' or FIPS98='6019' or FIPS99='6019' or FIPS00='6019' or FIPS01='6019' or FIPS02='6019' or FIPS03='6019' or FIPS04='6019' or FIPS05='6019' or FIPS06='6019' or FIPS07='6019' or FIPS08='6019' or FIPS09='6019' or FIPS10='6019' or FIPS11='6019' or FIPS12='6019';
Run;
data kern.Losangelescounty;
set kern.Alldata;
If FIPS90='6037' or FIPS91='6037' or FIPS92='6037' or FIPS93='6037' or FIPS94='6037' or FIPS95='6037' or FIPS96='6037' or FIPS97='6037' or FIPS98='6037' or FIPS99='6037' or FIPS00='6037' or FIPS01='6037' or FIPS02='6037' or FIPS03='6037' or FIPS04='6037' or FIPS05='6037' or FIPS06='6037' or FIPS07='6037' or FIPS08='6037' or FIPS09='6037' or FIPS10='6037' or FIPS11='6037' or FIPS12='6037';
Run;
data kern.Riversidecounty;
set kern.Alldata;
If FIPS90='6065' or FIPS91='6065' or FIPS92='6065' or FIPS93='6065' or FIPS94='6065' or FIPS95='6065' or FIPS96='6065' or FIPS97='6065' or FIPS98='6065' or FIPS99='6065' or FIPS00='6065' or FIPS01='6065' or FIPS02='6065' or FIPS03='6065' or FIPS04='6065' or FIPS05='6065' or FIPS06='6065' or FIPS07='6065' or FIPS08='6065' or FIPS09='6065' or FIPS10='6065' or FIPS11='6065' or FIPS12='6065';
Run;
data kern.SanBernadinocounty;
set kern.Alldata;
If FIPS90='6071' or FIPS91='6071' or FIPS92='6071' or FIPS93='6071' or FIPS94='6071' or FIPS95='6071' or FIPS96='6071' or FIPS97='6071' or FIPS98='6071' or FIPS99='6071' or FIPS00='6071' or FIPS01='6071' or FIPS02='6071' or FIPS03='6071' or FIPS04='6071' or FIPS05='6071' or FIPS06='6071' or FIPS07='6071' or FIPS08='6071' or FIPS09='6071' or FIPS10='6071' or FIPS11='6071' or FIPS12='6071';
Run;
data kern.SanJaoquincounty;
set kern.Alldata;
If FIPS90='6077' or FIPS91='6077' or FIPS92='6077' or FIPS93='6077' or FIPS94='6077' or FIPS95='6077' or FIPS96='6077' or FIPS97='6077' or FIPS98='6077' or FIPS99='6077' or FIPS00='6077' or FIPS01='6077' or FIPS02='6077' or FIPS03='6077' or FIPS04='6077' or FIPS05='6077' or FIPS06='6077' or FIPS07='6077' or FIPS08='6077' or FIPS09='6077' or FIPS10='6077' or FIPS11='6077' or FIPS12='6077';
Run;
data kern.Venturacounty;
set kern.Alldata;
If FIPS90='6111' or FIPS91='6111' or FIPS92='6111' or FIPS93='6111' or FIPS94='6111' or FIPS95='6111' or FIPS96='6111' or FIPS97='6111' or FIPS98='6111' or FIPS99='6111' or FIPS00='6111' or FIPS01='6111' or FIPS02='6111' or FIPS03='6111' or FIPS04='6111' or FIPS05='6111' or FIPS06='6111' or FIPS07='6111' or FIPS08='6111' or FIPS09='6111' or FIPS10='6111' or FIPS11='6111' or FIPS12='6111';
Run;
*----To sum employment by year, NAICS code, and county----*
data kern.KernEmp90 (drop=DunsNUmber Address10 City10 State10 Emp10 NAICS10 Multiplier Spending);
set kern.KernCounty;
by NAICS90; *----This is a single year example, I need to go 1990-2012 for all seven counties-----*
If first.NAICS90 then TotalEmp=0;
TotalEmp + Emp90;
if last.NAICS90;
Run;
When the data represents a time series then very likely you want one record per time period. That makes use of the actual time series procs much nicer.
The wide to long data form would be accomplished in a single data step with multiple arrays and if the variables are named with 1990 to 2012 the array code is cleaner AND easier to understand.
data long;
set wide;
array fip Fips1990-Fips2012;
Array dun Dun1990-Dun2012; /* one of these array definitions per time stamped variable*/;
do year = 1990 to 2012;
Fips = Fip[year];
Dun = Dun[year]; /* repeat through all of the arrays*/
output; /* Key to output one record for each year*/
end;
drop Fips1990-Fips2012 Dun1990-Dun2012; /* you don't want original varibles*/
run;
Ooops sorry ignore the drop statement, that was from an earlier program. I will be dropping different variables to clean up the file, but that is mostly to make it easier to read.
data kern.KernEmp90;
set kern.KernCounty;
by NAICS90; *----This is a single year example, I need to go 1990-2012 for all seven counties-----*
If first.NAICS90 then TotalEmp=0;
TotalEmp + Emp90;
if last.NAICS90;
Run;
So far you don't need macros, based on what you've shown.
Arrays, WhichC function and using proc means can simplify things.
data kern.County;
length region $20.;
set kern.Alldata;
array fips(23) fips:;
if whichc('6077', of fips(*))>0 then region="SanJoaquinCounty";
if whichc('6111', of fips(*))>0 then region = "VenturaCounty";
Run;
proc sort data=kern.county;
by region naics90;
run;
proc means data=kern.county;
by region naics90;
var emp90;
output out=summary sum(emp90)=TotEmp;
run;
Hi Reeza,
Thank you for your help. When I tried to run the first portion of the program:
data kern.County;
length region $20.;
set kern.Alldata;
array fips(23) fips:;
if whichc('6077', of fips(*))>0 then region="SanJoaquinCounty";
if whichc('6111', of fips(*))>0 then region = "VenturaCounty";
Run;
I received this error message:
ERROR: Too many variables defined for the dimension(s) specified for the array fips.
Any ideas?
Thank you again.
List out the fips variables you're using in the array statement or try the shorthand below.
data kern.County;
length region $20.;
set kern.Alldata;
array fips(23) fips90-fips99 fips00-fips012;
if whichc('6077', of fips(*))>0 then region="SanJoaquinCounty";
if whichc('6111', of fips(*))>0 then region = "VenturaCounty";
Run;
On second thought ...that may not work because a company can belong to multiple regions over time can't it?
I am attempting to run it again and no error messages so far! But yes, a company can move to different regions overtime. Would this command just replace the old region and move the company only into its most recent region?
Unfortunately yes
I'll think about it. Some options jump to mind but not sure how useful they are.
1. Output a record for each region - multiple records so have to be careful with calculations thereafter
2. Create indicator variables, one for each region that you flag as 1 or 0 if the company belongs to the region.
Can you elaborate more on what you're doing after you identify the regions?
Yes, of course. We are trying to create a county profile on Kern and compare it to 6 other benchmarks from 1990-2012. We have the NETS database which includes about over 200 variables on every business in California between 1990-2012 giving use over 6 million observations.
Originally I just created county specific datasets so that the information was more manageable and SAS took less time to run any of my commands. The issue with the different data sets is that then every single command needs to be run 7 times in each different data set. It also needs to be run for each different year because the data is formatted like NAICS90, NAICS91....EMP90, EMP91....FIPS90,FIPS91....SALES90.SALE91.
So some of what we want to do is at what and how many companies move in and out of different regions based on NAICS code each year. Then we want to see where companies are moving too and from for each of the counties .The other issue though is that not all businesses started in 1990 so we will also want to keep track of the new business not necessarily that moved to a county but that started in the county as well. We need to be able to distinguish between the two throughout all 23 years of the data.
We also want to be able to sum employment for every year by NAICS code and county so we can see employment shifts that occur in 4 digit NAICS code industries as well as total employment shifts at the county level. It is similar issues with everything we would want to look at from PayDex Score to Sales.
I don't mind having to run a program 7 times for each county dataset, but it is running for all 7 counties and all 23 years.
Like in the code elow, I am going to have to run it for each year since I need to proc sort the data. I would need to then do that for NAICS90. NAICS91, NAICS92.... since new companies are added and sometimes change NAICS numbers in different years.
proc sort data=kern.county;
by region naics90;
run;
proc means data=kern.county;
by region naics90;
var emp90;
output out=summary sum(emp90)=TotEmp;
run;
At least the code above wouldn't have to be run for each region so I would only need to run it for each year, but I need to be able to distinguish movements and FIPS code changes which as you said the region variable doesn't give me that variation .
Does all of that information makes sense? Sometimes it is difficult to describe data with words. I would attempt to send you the main file, but it is too big.
Thank you so much for your help.
Without lots more description of the data it may be that the data should be moved from what looks like a "wide" format with multiplve variables indicating the year(?) of data to one that has a year variable with only one of each of your current year.
Then you would be able to do analysis by FIP across your years of data.
And didn't Y2K teach people not to use 2 digit years crossing century bounds. If you had the variables as FIPS1990 - FIPS2012 some things would be much cleaner...
Here is a complete description of the database that we are using. I can easily change the variables to four digit years, the text file was just originally two digit. Do you think we would have an easier time working with the data if we shifted it to more of a panel data analysis stacked by company?
When the data represents a time series then very likely you want one record per time period. That makes use of the actual time series procs much nicer.
The wide to long data form would be accomplished in a single data step with multiple arrays and if the variables are named with 1990 to 2012 the array code is cleaner AND easier to understand.
data long;
set wide;
array fip Fips1990-Fips2012;
Array dun Dun1990-Dun2012; /* one of these array definitions per time stamped variable*/;
do year = 1990 to 2012;
Fips = Fip[year];
Dun = Dun[year]; /* repeat through all of the arrays*/
output; /* Key to output one record for each year*/
end;
drop Fips1990-Fips2012 Dun1990-Dun2012; /* you don't want original varibles*/
run;
Thank you so much!
I generally agree with this methodology, however given what the OP wants, a company to belong to different categories some of the original issues will still be there
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.