I have been working on a contact tracing data set to provide any info that would be helpful to Public Health now that SARS-Cov-2 is becoming more endemic. I have classified about 29000 obs using the NAICS Industrial Classification used by the Bureau of Labor Statistics. It is easy to separate on the basis of either a Good or a Service business. There is a field called Type_Firm which is a bit more descriptive of the survey and the resultsand I have included just a small number of Type_Firm descriptions in the data window. This is just for Goods Construction (and not all of it) - Services is even more varied.
CONSTRUCTION_AGRICULTURE_MINING
CONSTRUCTION_BUILDINGS
CONSTRUCTION_BUILDINGS
CONSTRUCTION_BUILDINGS_CNTRACTOR
CONSTRUCTION_BUILDINGS_COMMERCIAL
CONSTRUCTION_BUILDINGS_COMMERCIAL_INDUSTRIAL
CONSTRUCTION_BUILDINGS_CONTRACTOR
CONSTRUCTION_BUILDINGS_CUSTOM_HOMES
CONSTRUCTION_BUILDINGS_GENERAL_CONTRACTOR
CONSTRUCTION_BUILDINGS_GENERAL_CONTRACTORS
CONSTRUCTION_BUILDINGS_HEAVY_CONSTRUCTION
CONSTRUCTION_BUILDINGS_HOMES
CONSTRUCTION_BUILDINGS_HOME_BUILDERS
CONSTRUCTION_BUILDINGS_MANUFACTURED_HOMES
CONSTRUCTION_BUILDINGS_METAL_BUILDINGS_CONTRACTORS
CONSTRUCTION_BUILDINGS_MOBILE_HOMES
CONSTRUCTION_BUILDINGS_MODULAR_HOMES
CONSTRUCTION_BUILDINGS_NEW_HOMES_REMODELING
CONSTRUCTION_BUILDINGS_NEW_HOME_AND_REMODELING
CONSTRUCTION_BUILDINGS_PREFABRICATED_HOMES
CONSTRUCTION_BUILDINGS_PREFAB_HOMES
CONSTRUCTION_BUILDINGS_PUBLIC_PRIVATE_CLIENTS
CONSTRUCTION_BUILDINGS_REMODELING
CONSTRUCTION_BUILDINGS_REMODELING_CONTRACTORS
CONSTRUCTION_BUILDINGS_REMODELING_NEW_CONSTRUCTION
CONSTRUCTION_BUILDINGS_REMODELS
CONSTRUCTION_BUILDINGS_REMODELS_RENOVATIONS
CONSTRUCTION_BUILDINGS_RESIDENTIAL
CONSTRUCTION_BUILDINGS_RESIDENTIAL_COMMERCIAL
CONSTRUCTION_BUILDINGS_RESIDENTIAL_COMMERCIAL_REMODELING
CONSTRUCTION_BUILDINGS_RESIDENTIAL_HOMES
CONSTRUCTION_BUILDINGS_RESIDENTIAL_MULTI_FAMILY
CONSTRUCTION_BUILDINGS_RESIDENTIAL_NEW_REMODELS
CONSTRUCTION_BUILDINGS_RESIDENTIAL_REMODELING
CONSTRUCTION_BUILDINGS_RESTORATION_SERVICES_WATER_DAMAGE
CONSTRUCTION_BUILDINGS_ROOFING_CONTRACTORS
CONSTRUCTION_BUILDINGS_SOCIAL_ASSISTANCE
CONSTRUCTION_BUILDINGS_STEEL_STRUCTURAL_CONCRETE
CONSTRUCTION_BUILDING_COMMERCIAL_GENERAL_CONTRACTORS
CONSTRUCTION_BUILDING_CUSTOM_HOMES
CONSTRUCTION_BUILDING_MASONRY
CONSTRUCTION_BUILDING_RESIDENTIAL_HOMES
CONSTRUCTION_BUILKDINGS_RESIDENTIAL
CONSTRUCTION_BUUILDINGS_MANUAFCTURED_HOMES
CONSTRUCTION_CLEANING_SERVICES
CONSTRUCTION_CONTRACTOR
CONSTRUCTION_CONTRACTORS
CONSTRUCTION_EXCAVATION
CONSTRUCTION_GENERAL_CONTRACTOR
CONSTRUCTION_LANDSCAPING
CONSTRUCTION_MATERIALS_SERVICES
CONSTRUCTION_MOBILE_HOMES
CONSTRUCTION_NONRESIDENTIAL_BUILDINGS
CONSTRUCTION_OF_BUILDINGS
CONSTRUCTION_OTHER
CONSTRUCTION_REMODELING
CONSTRUCTION_REMODEL_SERVICES
CONSTRUCTION_REPAIR_RETROFITTING_TUGS_BARGES
What approach would you apply so that these titles could be put into only
3 maybe 4 containers. Just hypothetically, may Construction_Heavy_Construction; Construction_Residential; Construction_Remodeling; Construction_Other? Maybe a few more containers might be better - and Services would definitely require more than 4 or 5. And there are typos in some of these descriptions I've noticed.
Your suggestions to hopefully making this categorization step less onerous than it looks will be very appreciated.
Thanks.
wlierman
If this were my project I would be tempted to make a pair of custom Informats/Formats to read the data into the categories I defined and then use the format on the categories for most reporting.
I am not going to do a full blown version because that is a lot of stuff
Something like:
Proc format invalue Firm_type (upcase) "CONSTRUCTION_AGRICULTURE_MINING" = 4 "CONSTRUCTION_BUILDINGS" = 1 "CONSTRUCTION_BUILDINGS_CNTRACTOR" = 1 "CONSTRUCTION_BUILDINGS_COMMERCIAL" = 2 "CONSTRUCTION_BUILDINGS_CONTRACTOR" = 1 "CONSTRUCTION_BUILDINGS_CUSTOM_HOMES" = 3 "CONSTRUCTION_BUILDINGS_GENERAL_CONTRACTOR" = 1 "CONSTRUCTION_BUILDINGS_GENERAL_CONTRACTORS" = 1 "CONSTRUCTION_BUILDINGS_HEAVY_CONSTRUCTION" = 2 "CONSTRUCTION_BUILDINGS_HOMES" = 3 "CONSTRUCTION_BUILDINGS_HOME_BUILDERS" = 3 /* repeat as needed*/ ; value Firm_type 1 = "Construction: General" 2 = "Construction: Commercial" 3 = "Construction: Residential" 4 = "Construction: Other" ;
Then use the Informat Firm_type to read the text into new variable that you assign the Firm_type format to.
This has a minor advantage of documenting ALL the values actually occurring in your data in one place.
The use of a numeric value behind the scenes can make some report formats easier to control for order.
I don't use NAICS so I'm not sure if you are wanting to assign their codes directly from this or not.
Likely more digits would be useful and have the first digits more significant such as 10000 as "Construction" in general and then 10001, 10002, 10003 etc for the sub classes. If you are going to report using the procedures that support MULTILABEL formats, Means/summary and Tabulate it may be worth the effort to make such.
Note: This is an intensive manual operation. Either build a data set(s) to use with Proc Format CNTLIN option or find a way to manage the text for Proc Format. One of the things I actually use spreadsheets for is to help write such code as I can enter (or paste) cells with the main value, have another cell on the same line to enter the code and then use a formula to make the column text such as "Value text" = number for the informat.
I would spend more time on the INFORMAT. Once you have that it should be relatively easy to come up with the format. You may find that your initial approach has more values than needed. It is very easy, meaning no change to the data set, to combine categories in the format definition.
One of the things to consider if you use the INFORMAT approach is that you can designate an option of
Other = _error_;
Which will throw an Invalid data message which may be helpful when you have a majority of the fields with an initial code set. I find the option very helpful with a few of my projects where the data suppliers slip in new values occasionally. The option means the LOG shows me that there is now an issue with some values and what the values may be. So I can ask the supplier what is going on (i.e. up date the metadata related to the field).
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.