BookmarkSubscribeRSS Feed
wlierman
Lapis Lazuli | Level 10

What I need to do is group similar occupations / places of employment together in a contact tracing dataset.

 

I have 124,803 rows and the field I want to "scrape" is Place of Employment.  What I am trying to do is get a picture of the impact of covid-19 on contacts from an "economic" POV.  I want to group together like work / occupations and then label them as one might see in Bureau of Labor Stats e.g., finance, trade, manufacturing, services and so on.  Then hopefully the development team can incorporate these occupational titles into subsequent surveys and make it easier for work / employment analyses in the future.

 

As usual not every obs has  response.  A small example of some of the responses include

 

row                                  Place of Employment

 

10000                                   unemployed

 

10210                                   XYZ elementary school

 

11800                                    Seven - 11

 

23453                                    retired

 

86754                                    Tri-state aviation

 

100256                                   student

 

111876                                   City of Richland

 

123245                                   St. Randall's Hospital

 

and so.  As I mentioned there there are a lot of missing /  no responses. Of the 124,803 obs there are around

35,000 that have something included.  So this will be a somewhat tedious to crawl through even 35,000 plus obs but it is a needed data cleaning exercise. Hopefully it will help provide a little more information  / knowledge for stakeholders and interested parties.

 

My question is: Are there any SAS tips, techniques, coding, and data tricks that could make the "scraping / aggregation" not quite as tedious but more accurate in the end? 

 

Thank you for any ideas, techniques, or processes in advance.

 

wklierman

2 REPLIES 2
PGStats
Opal | Level 21

Have a look at OpenRefine (AKA Google Refine). A free tool for dealing with messy data.

PG
wlierman
Lapis Lazuli | Level 10
Thank you for the suggestion. From what I can see it isn't what I need. It isn't so much that the data is messy - the data is what it is. This seems like it will take more of "brute force" approach (a piece-meal approach) with a lot of sub-setting If / then statements. Maybe some type of 'fuzzy-matching' in addition to spedis or soundex.

I will keep looking. Thanks for the OpenRefine suggestion.

wklierman

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 286 views
  • 0 likes
  • 2 in conversation