I have a dataset of 35,000 observations. These are self-reported places of employment from a contact-tracing database of persons identified as having covid and completing treatment / quarantine or presumed to have been exposed to someone with covid and quarantine. The self-reported value is text like XYZ Drug Store, State University, Central Manufacturing, Safeway, Bob's Honda and so on.
I am classifying the place of employment according to the BLS NAICS Industrial Classification System. I have completed 1,279 obs. My idea is to use those completed records as sort of a training set and then apply SAS techniques to loop through the remaining records with identified keywords so that the basic Goods / Services designation can be made. Then from there hopefully identify the sub_type of firm. The schematic would look like this
NAICS / Industry-Sector Goods Services
NAICS subsector Natural Resources, Mining Trade, Transportation, Utilities
Construction Information
Manufacturing Financial Activities
Professional & Business Services
Education_Health Services
Leisure & Hospitality
Other
In the 1,279 records I have completed manually I have hit every one of these sub-sectors many time over.
I think (hope) SAS can help here but I don't know the best method to use. Are there technique/s in SAS so that identified keywords e.g., Bank, Restaurant, Bar, Food, School etc can be used to identify similar keywords in the remaining 33,000 plus rows of Places of Employment so that further identification can be automated.
I realize the matching won't obviously be close to 100% but even a match rate of 50 or 60% would be great.
Thank you for your help.
wklierman
Are you asking for Machine Learning techniques?
"I have completed 1,279 obs" - What does this mean? If you have built some code that handles the first 1,279 rows in your data what is the problem with running it over the full 30,000 rows?
An initial exploration could be done by pulling the NAICS list(s) into a SAS dataset then using that to search your data. Depending on how similar the data sources are you might be surprised how good the match is. Start by importing the NAICS data into SAS and post an example of it here along with some rows from your data. Then we can provide a search method for you.
Okay. I have transcribed the NAICS codes into a SAS Data Set. I have also included the place of employment field from my data that i manually transcribed.
I have attached the two files - both sas datasets. The arias_economics_occupations is the data after I did the manual classification on the 1,290 some records. I eliminated lines with no response to place of employment N/A, NA, n/a,na etc. So there are about 29,500 or so total obs.
I hope this is useful. Even being able to make a little progress on classification will be a big time saver.
Thank you.
wklierman
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.