07-16-2017 11:12 PM
I have 7 similar files (different years of a survey) that I'm working to clean up so I can merge them. Around 80% of the columns/variables present in each file are common across all files. All of the files are in STC format which all means they have their own formats catalogues. However, these files were originally assembled by different people, and so there's a ton of minor discrepancies I need to work out between the files before they'll be clean enough to merge. This includes some differences in the user defined format labeling that I will need to resolve, as well as a number of variables that are incorrectly formatted as character rather than numeric variables in some of the files.
So I need to figure out where the labeling and type discrepancies exist between the 7 files, and bring them all up to standard before I try and merge them. I've got some techniques worked out for this, but I'm wondering if they're really the most efficient.
My questions are as follows:
1. What's the easiest way to bring all 7 of my data files into a common library without the formats catalogue being overwritten by each new STC file I import? Is there a way to declare a different formats catalogue for each different dataset within the same library? Currently I'm using a proc format cntlout statement after each proc cimport to move the formats for each of the datasets into their own dataset, but this seems kind of clunky and unlinks the formats from the data itself.
2. What's the easiest way to compare formats catalogs across the seven files to see where the labeling discrepancies are present? Right now I'm using a very convoluted proc sql query that grabs all the variables present in all of the files into a table, and then joins on all 7 of my cntlout format datasets where the format name and value are the same but the label is different. I'm not convinced it's working correctly, and it also can't really help me when a variable should be numeric type in all files but is numeric in some and character in others (causing the formats to have different names). If needed, I can provide my Proc SQL code for this.
3. Using Proc SQL and dictionary.columns, I was able to generate a table showing me the variables that have mismatched types between the files. I'm noticing in some years, the survey actually used free text responses instead of numeric coding, while in most other cases of mismatch, the person who put the original files together just accidentally left numeric columns as character type. So what I need is a quick way to scan through each of the columns with type mismatches, and determine whether they actually have any non-numeric characters present or whether they can be recoded to numeric without worry. I know there's an ANYALPHA function, but it seems intended for a case-by-case basis, while I'd like to find a way to put it into a datastep so that SAS can determine whether or not a variable has any alphabetical characters in it and either alter it to a numeric format or append "_text" to its column name.
For the top two questions I'm looking more for general guidance of "that's how you'd do it" or "you should look into this procedure/function instead", while for the third I'm looking for a little more of a direct technical response.
I'm still fairly new to SAS so I appreciate the help. I understand that these questions may be somewhat abstract on their own, so I'm happy to provide more detail upon request.
07-17-2017 01:33 AM
You have a messy business to manage, and it's going to be messy.
You seem to be doing fine from what you describe, and it's difficult to give further advice while being so removed from the data or the process.
In general, I would compare the outputs of proc contents to assess the metadata, and use proc sql with summary functions to asess the data.
For example, to see "whether or not a variable has any alphabetical characters in it" you can use
select max(anyalpha(VAR)) from TAB;
Another (better) way to test for valid numbers would be to use an informat, something like:
if missing(input(VAR,?? dollar32.)) & not missing(VAR);