BookmarkSubscribeRSS Feed
wlierman
Lapis Lazuli | Level 10

I have a dataset of 35,000 observations.  These are self-reported places of employment from a contact-tracing database of persons identified as having covid and completing treatment / quarantine or presumed to have been exposed to someone with covid and quarantine.  The self-reported value is text like XYZ Drug Store, State University, Central Manufacturing, Safeway, Bob's Honda and so on.

 

I am classifying the place of employment according to the BLS NAICS Industrial Classification System.  I have completed 1,279 obs.  My idea is to use those completed records as sort of a training set and then apply SAS techniques to loop through the remaining records with identified keywords so that the basic Goods / Services designation can be made. Then from there hopefully identify the sub_type of firm.  The schematic would look like this

 

 NAICS / Industry-Sector                                      Goods                                                  Services

 

    NAICS subsector                           Natural Resources, Mining                            Trade, Transportation, Utilities

                                                           Construction                                                  Information

                                                           Manufacturing                                               Financial Activities

                                                                                                                                 Professional & Business Services

                                                                                                                                 Education_Health Services

                                                                                                                                 Leisure & Hospitality

                                                                                                                                 Other

 

In the 1,279 records I have completed manually I have hit every one of these sub-sectors many time over.

I think (hope) SAS can help here but I don't know the best method to use. Are there technique/s in SAS so that identified keywords e.g., Bank, Restaurant, Bar, Food, School etc can be used to identify similar keywords in the remaining 33,000 plus rows of Places of Employment so that further identification can be automated.

 

I realize the matching won't obviously be close to 100% but even a match rate of 50 or 60% would be great.

 

Thank you for your help.

 

wklierman

10 REPLIES 10
wlierman
Lapis Lazuli | Level 10
Hello,
Anything (Python, SAS, even R). The 1,299 obs was done manually on an xlsx worksheet and to complete the other 33,000 that's a good half-month work.  And of course the data base increases every week.


SASKiwi
PROC Star

"I have completed 1,279 obs" - What does this mean? If you have built some code that handles the first 1,279 rows in your data what is the problem with running it over the full 30,000 rows?

wlierman
Lapis Lazuli | Level 10
That was done manually on an xlsx worksheet.  To do 33,000 additional records will be oh, I'd say two weeks.  That's why I hope there is some coding alternative with SAS.
Thanks.

SASKiwi
PROC Star

An initial exploration could be done by pulling the NAICS list(s) into a SAS dataset then using that to search your data. Depending on how similar the data sources are you might be surprised how good the match is. Start by importing the NAICS data into SAS and post an example of it here along with some rows from your data. Then we can provide a search method for you.

wlierman
Lapis Lazuli | Level 10
Thanks. I will work to construct the NAICS.

wklierman
wlierman
Lapis Lazuli | Level 10

Okay.  I have transcribed the NAICS codes into a SAS Data Set. I have also included the place of employment field from my data that i manually transcribed.

 

 I have attached the two files - both sas datasets.  The arias_economics_occupations is the data after I did the manual classification on the 1,290 some records.  I eliminated lines with no response to place of employment N/A, NA, n/a,na etc.  So there are about 29,500 or so total obs.

 

I hope this is useful.  Even being able to make a little progress on classification will be a big time saver.

 

Thank you.

 

wklierman

wlierman
Lapis Lazuli | Level 10
Any ideas on the matching/classification question?

I have come across some potentially good SAS papers. Nothing looks straightforward but some combination of Select statements and maybe If/Then coding.

Thanks for looking into the problem.

wklierman

Reeza
Super User
Not sure if this is helpful or not, but you can buy this data as well....it is pricey though.
wlierman
Lapis Lazuli | Level 10
I appreciate your input. The self-identified responses are pretty unique so to preserve that I think it'll be a combination of SAS search methods and more manual input.
Data...you got to love it...

Thanks

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 1674 views
  • 0 likes
  • 4 in conversation