BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
wlierman
Lapis Lazuli | Level 10

I have been working on a contact tracing data set to provide any info that would be helpful to Public Health now that SARS-Cov-2 is becoming more endemic. I have classified about 29000 obs using the NAICS Industrial Classification used by the Bureau of Labor Statistics.  It is easy to separate on the basis of either a Good or a Service business.  There is a field called Type_Firm which is a bit more descriptive of the survey and the resultsand I have included just a small number of Type_Firm descriptions in the data window.  This is just for Goods Construction (and not all of it) - Services is even more varied.

CONSTRUCTION_AGRICULTURE_MINING 
CONSTRUCTION_BUILDINGS 
CONSTRUCTION_BUILDINGS 
CONSTRUCTION_BUILDINGS_CNTRACTOR 
CONSTRUCTION_BUILDINGS_COMMERCIAL 
CONSTRUCTION_BUILDINGS_COMMERCIAL_INDUSTRIAL 
CONSTRUCTION_BUILDINGS_CONTRACTOR 
CONSTRUCTION_BUILDINGS_CUSTOM_HOMES  
CONSTRUCTION_BUILDINGS_GENERAL_CONTRACTOR 
CONSTRUCTION_BUILDINGS_GENERAL_CONTRACTORS 
CONSTRUCTION_BUILDINGS_HEAVY_CONSTRUCTION 
CONSTRUCTION_BUILDINGS_HOMES 
CONSTRUCTION_BUILDINGS_HOME_BUILDERS 
CONSTRUCTION_BUILDINGS_MANUFACTURED_HOMES 
CONSTRUCTION_BUILDINGS_METAL_BUILDINGS_CONTRACTORS 
CONSTRUCTION_BUILDINGS_MOBILE_HOMES 
CONSTRUCTION_BUILDINGS_MODULAR_HOMES 
CONSTRUCTION_BUILDINGS_NEW_HOMES_REMODELING 
CONSTRUCTION_BUILDINGS_NEW_HOME_AND_REMODELING 
CONSTRUCTION_BUILDINGS_PREFABRICATED_HOMES 
CONSTRUCTION_BUILDINGS_PREFAB_HOMES 
CONSTRUCTION_BUILDINGS_PUBLIC_PRIVATE_CLIENTS 
CONSTRUCTION_BUILDINGS_REMODELING 
CONSTRUCTION_BUILDINGS_REMODELING_CONTRACTORS 
CONSTRUCTION_BUILDINGS_REMODELING_NEW_CONSTRUCTION 
CONSTRUCTION_BUILDINGS_REMODELS 
CONSTRUCTION_BUILDINGS_REMODELS_RENOVATIONS 
CONSTRUCTION_BUILDINGS_RESIDENTIAL 
CONSTRUCTION_BUILDINGS_RESIDENTIAL_COMMERCIAL 
CONSTRUCTION_BUILDINGS_RESIDENTIAL_COMMERCIAL_REMODELING 
CONSTRUCTION_BUILDINGS_RESIDENTIAL_HOMES 
CONSTRUCTION_BUILDINGS_RESIDENTIAL_MULTI_FAMILY 
CONSTRUCTION_BUILDINGS_RESIDENTIAL_NEW_REMODELS 
CONSTRUCTION_BUILDINGS_RESIDENTIAL_REMODELING 
CONSTRUCTION_BUILDINGS_RESTORATION_SERVICES_WATER_DAMAGE 
CONSTRUCTION_BUILDINGS_ROOFING_CONTRACTORS 
CONSTRUCTION_BUILDINGS_SOCIAL_ASSISTANCE 
CONSTRUCTION_BUILDINGS_STEEL_STRUCTURAL_CONCRETE 
CONSTRUCTION_BUILDING_COMMERCIAL_GENERAL_CONTRACTORS  
CONSTRUCTION_BUILDING_CUSTOM_HOMES 
CONSTRUCTION_BUILDING_MASONRY 
CONSTRUCTION_BUILDING_RESIDENTIAL_HOMES 
CONSTRUCTION_BUILKDINGS_RESIDENTIAL 
CONSTRUCTION_BUUILDINGS_MANUAFCTURED_HOMES 
CONSTRUCTION_CLEANING_SERVICES 
CONSTRUCTION_CONTRACTOR 
CONSTRUCTION_CONTRACTORS 
CONSTRUCTION_EXCAVATION 
CONSTRUCTION_GENERAL_CONTRACTOR 
CONSTRUCTION_LANDSCAPING 
CONSTRUCTION_MATERIALS_SERVICES 
CONSTRUCTION_MOBILE_HOMES 
CONSTRUCTION_NONRESIDENTIAL_BUILDINGS 
CONSTRUCTION_OF_BUILDINGS 
CONSTRUCTION_OTHER 
CONSTRUCTION_REMODELING 
CONSTRUCTION_REMODEL_SERVICES 
CONSTRUCTION_REPAIR_RETROFITTING_TUGS_BARGES 

What approach would you apply so that these titles could be put into only

3 maybe 4 containers.  Just hypothetically, may Construction_Heavy_Construction; Construction_Residential; Construction_Remodeling; Construction_Other?  Maybe a few more containers might be better - and Services would definitely require more than 4 or 5.  And there are typos in some of these descriptions I've noticed.

 

Your suggestions to hopefully making this categorization step less onerous than it looks will be very appreciated.

 

Thanks.

 

wlierman

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
wlierman
Lapis Lazuli | Level 10
Hello,
Thank you for the detailed response.  It is very helpful - it gets me thinking in a whole different direction.
Unfortunately, it is manually intensive but I think in the end will be worth the effort.
I will keep you posted.  I may (likely) have other questions.
wlierman



View solution in original post

4 REPLIES 4
ballardw
Super User

If this were my project I would be tempted to make a pair of custom Informats/Formats to read the data into the categories I defined and then use the format on the categories for most reporting.

I am not going to do a full blown version because that is a lot of stuff

Something like:

Proc format
invalue Firm_type (upcase)
"CONSTRUCTION_AGRICULTURE_MINING"   = 4
"CONSTRUCTION_BUILDINGS"            = 1
"CONSTRUCTION_BUILDINGS_CNTRACTOR"  = 1
"CONSTRUCTION_BUILDINGS_COMMERCIAL" = 2
"CONSTRUCTION_BUILDINGS_CONTRACTOR" = 1
"CONSTRUCTION_BUILDINGS_CUSTOM_HOMES" = 3
"CONSTRUCTION_BUILDINGS_GENERAL_CONTRACTOR" = 1
"CONSTRUCTION_BUILDINGS_GENERAL_CONTRACTORS" = 1
"CONSTRUCTION_BUILDINGS_HEAVY_CONSTRUCTION" = 2
"CONSTRUCTION_BUILDINGS_HOMES"      = 3
"CONSTRUCTION_BUILDINGS_HOME_BUILDERS" = 3
/* repeat as needed*/
;
value Firm_type
1 = "Construction: General"
2 = "Construction: Commercial"
3 = "Construction: Residential"
4 = "Construction: Other"
;

Then use the Informat Firm_type to read the text into new variable that you assign the Firm_type format to.

 

This has a minor advantage of documenting ALL the values actually occurring in your data in one place.

The use of a numeric value  behind the scenes can make some report formats easier to control for order.

 

I don't use NAICS so I'm not sure if you are wanting to assign their codes directly from this or not.

 

Likely more digits would be useful and have the first digits more significant such as 10000 as "Construction" in general and then 10001, 10002, 10003 etc for the sub classes. If you are going to report using the procedures that support MULTILABEL formats, Means/summary and Tabulate it may be worth the effort to make such.

 

Note: This is an intensive manual operation. Either build a data set(s) to use with Proc Format CNTLIN option or find a way to manage the text for Proc Format. One of the things I actually use spreadsheets for is to help write such code as I can enter (or paste) cells with the main value, have another cell on the same line to enter the code and then use a formula to make the column text such as "Value text" = number for the informat.

I would spend more time on the INFORMAT. Once you have that it should be relatively easy to come up with the format. You may find that your initial approach has more values than needed. It is very easy, meaning no change to the data set, to combine categories in the format definition. 

 

 

wlierman
Lapis Lazuli | Level 10
Hello,
Thank you for the detailed response.  It is very helpful - it gets me thinking in a whole different direction.
Unfortunately, it is manually intensive but I think in the end will be worth the effort.
I will keep you posted.  I may (likely) have other questions.
wlierman



ballardw
Super User

One of the things to consider if you use the INFORMAT approach is that you can designate an option of

Other = _error_;

Which will throw an Invalid data message which may be helpful when you have a majority of the fields with an initial code set. I find the option very helpful with a few of my projects where the data suppliers slip in new values occasionally. The option means the LOG shows me that there is now an issue with some values and what the values may be. So I can ask the supplier what is going on (i.e. up date the metadata related to the field).

 

wlierman
Lapis Lazuli | Level 10
Great thanks for that new tip.

wlierman

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 372 views
  • 1 like
  • 2 in conversation