I need to classify employment by sector: Services or Goods. I need to fill in the NAICS_sector field by using word or words / phrases in the Place of employment field. The code is below. It "works" if it is classifying on one word like "disabled" or a phrase like "Not_working" in the Place of employment field. I used that to fill in the Other_categories field.
As I mentioned, I am starting to classify the responses in the Place of Employment field into either Services or Goods. I decided to start out with
Place of Employment responses related to Education. One thing to keep in mind is that this data is from contact tracing so if a high school student had covid or quarantined for covid the tracer would put Eagle Crest High School as shown in the following example as Place of Employment. Obviously high school isn't the place of employment for the person but it does relate back to the service sector. It is the sector that is being measured. I've shown some other variations under Place of Employment relating to different types of Educational settings.
The headers are
NAICS_sector Person_ID Place of Employment Sector_Type Type_Firm Other_categories
123456 Eagle Crest High School
555555 Little Preschool
332425 UO
1111111 Central Elementary School
345267 Student
and so on.
Rather than classifying manually I have been using coding like
Data SASCDC_2.Arias_NAICS_Classify_A; Length NAICS_Sector $20; Informat NAICS_Sector $20.; Format NAICS_Sector $20.; Set SASCDC_2.Arias_NAICS_Classify; If _n_ ge 3003; If Place_of_Employment in ("UNEMPLOYED", "NOT WORKING", "NO JOB", "NOT EMPLOYED") Then Other_Categories = "Unemployed___Not_working"; If Place_of_Employment in ("RETIRED", "REITRED") Then Other_Categories = "Retired"; If Place_of_Employment in ("CHILD", "MINOR CHILD", "MINOR", "INFANT", "TODDLER", "BABY", "(MINOR)") Then Other_Categories = "Child_minor"; If Place_of_Employment in ("WORK FROM HOME", "WORKS FROM HOME", "REMOTE", "AT HOME", "WORKING FROM HOME", "STAY AT HOME", "STAYS HOME", "WORKS REMOTELY", "STAY AT HOME", "STAY AT HOME MOM") Then Other_Categories = "Work_from_home_remotely"; If Place_of_Employment in ("SELF EMPLOYED", "SELF EMPLOYED(WORKS FROM HOME)", "SELF-EMPLOYED") Then NAICS_Sector = "Other_Services"; If NAICS_Sector in ("Other_Services") Then Sector_Type = "Self_employed"; If Place_of_Employment in ("DISABLED", "ON DISABILITY", "DISABILITY") Then Other_Categories = "Disabled"; If Place_of_Employment in ("SCHOOL", "HIGH SCHOOL", "ELEMENTARY", "STUDENT", "EDUCATION", "PRESCHOOL", "ELEMENTARY SCHOOL", "PRIMARY", "HS", "CHRISTIAN SCHOOL", "ACADEMY", "UO", "OSU", "COLLEGE", "UNIVERSITY", "LEARNING", "CHARTER SCHOOL") Then NAICS_Sector = "Services"; run;
This is will do okay if the match is on one word like Student. However, no matches result on phrases like: "Central Elementary School" even though Elementary and School are listed in the in clause.
How should the code be modified to capture the Place of Employment that are phrases like "Western State College" etc and not just one
word like "Student" or "UO"?
Thanks.
When looking for values where one word is often combined with others, such as "School" in "High School" "Grade School" "Elementary School" and such you may want to consider if FINDW or INDEX are more to your need:
if findw(Place_of_Employment,'SCHOOL','i')> 0 then NAICS_Sector = "Services";
for example.
And don't try to force all comparisons into a single statement. That way lies madness. When your IF statement values might overlap you may want to use "else if". Example a school named "Doctor Johnson Elementary School" could be confused with a search on "Doctor". You would likely not want to assign the category "Doctor" might be in when in a compound with School as above. Order of comparisons is important in such.
Having worked with data where people are referencing schools you may find a very iterative approach needed as they don't always spell out "school", may imply it with works like elementary, i.e. Lincoln Elementary, "junior high","senior high" or any number of abbreviations: HS, JHS, MS just a few.
When looking for values where one word is often combined with others, such as "School" in "High School" "Grade School" "Elementary School" and such you may want to consider if FINDW or INDEX are more to your need:
if findw(Place_of_Employment,'SCHOOL','i')> 0 then NAICS_Sector = "Services";
for example.
And don't try to force all comparisons into a single statement. That way lies madness. When your IF statement values might overlap you may want to use "else if". Example a school named "Doctor Johnson Elementary School" could be confused with a search on "Doctor". You would likely not want to assign the category "Doctor" might be in when in a compound with School as above. Order of comparisons is important in such.
Having worked with data where people are referencing schools you may find a very iterative approach needed as they don't always spell out "school", may imply it with works like elementary, i.e. Lincoln Elementary, "junior high","senior high" or any number of abbreviations: HS, JHS, MS just a few.
Thank you. The approach that you explain does help.
Appreciate it. I may have some other questions before this step is complete.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.