I am learning about leveraging a data dictionary to write SAS code for formats, labels, etc. Interesting stuff. There are two cases.
For one study (that I posted about on here not too long ago) the data are all character when we bring them in, although they won't end up that way since some are numbers or dates in spirit. So what we do is have info in the data dictionary telling what type it is for edit check purposes, as in "Option List" (1=Yes, 2=No), or "Free Text" (can be anything) or "Free Text Number" (can be any number but we can edit check it with the min and max allowable value that is part of the data dictionary), etc. There are more but you get the idea. And we also use the data dictionary to write labels and formats, etc.
So that's all good. The nature of bringing the stuff in in that study above is such that we bring it in somehow from a database named Maria (I don't know if that's our name for it or if that's a brand name) and it is character from the start.
But in a new study we're doing we are bringing in data from .CSV files. As you know, SAS guesses about data attributes when you bring data in from .CSV files. And you can use the data dictionary to help SAS guess correctly. Okay, fine. But the thing is, I don't see how some edit checks will be possible that way.
For instance if something is going to end up as a date variable and it is an error, such as 06/22/217 for example, (which obviously should be 6/22/2017) and we bring it into SAS, it will immediately be a missing value and we will not know that it was 6/22/217 and thus we won't be able to edit check it or tell the site to correct it. That is just one example but there are others.
Of course one way to get around that is to bring it all in as character and then cope with it, as described in the first study in this post. But that begs the question of how to tell SAS to bring everything from a .CSV file into SAS while forcing every variable to be character. I don't know how to do that. And more generally, considering how I've described things above, is that even the right thing to do?
It seems to me that the optimal thing would be to simply force the people entering the data to enter it in the correct type. So if something is a date, you have to enter it as a date, and if you make a typo so that it's not a date then the system won't take it. But I get the impression that that's no an option for our operation and maybe it's not optimal anyway.
I hope this makes sense. Any help is greatly appreciated.
Do you have some sample data you can post? And are you using SAS Enterprise Guide? SAS Display Manager? SAS Studio? Are you coding your own data step or are you using an Import? Can you post your SAS code?
Jim
Does not sound that hard, as long as the rules for generating checks from the metadata are clear.
First thing to do is figure out what checks you want to do for the given metadata restrictions. Then figure out how to code those checks in SAS. Once that is done you can finally work on writing some SAS code that will generate that code from the metadata.
Note that SAS does not guess how to read a CSV unless you ask it to guess by using PROC IMPORT or some other guessing method to read the CSV. If you write a data step to read the CSV file then how the data is read it totally determined by the code you write.
It sounds like you are using PROC IMPORT. If you want more control over how SAS reads in your data, try switching to a data step.
data want;
infile 'path/to/your/file/file.csv';
input var1 $ date $;
run;
Regardless of which method you select, SAS will reject dates that it cannot interpret as dates. If you want to retain what was in the field, then you're better off reading in the data as text and creating a new variable with a numeric date. You can then build if-then rules around the dates that you need to clean up based on what's missing in the new column, i.e. which dates you don't end up with a numeric date for.
I'm not using SAS Enterprise or SAS Studio or any of that. I can either use PROC IMPORT, in which case SAS will guess, or I can write code (or copy and paste the code from the Log window that SAS generates when you use PROC IMPORT and then tweak that as necessary).
I was doing the latter with the data dictionary info and then using that info, along with data _null_ to write SAS code to bring in the actual study data. But the data dictionary says that "Date_of_Procedure," for instance, is date (or numeric if you prefer) and if I bring it in like that then any responses of "June 2017" will get converted to missing instead of corrected to a good date.
I guess we could do it like below but just have a $ after every single variable, although that's somewhat of a pain when there is a lot of variables, but it gets the job done. If I do that then do I have to worry about the length? I mean, maybe if you're reading from a .CSV file and it's a character variable then you just need a $ and don't need a number after it because it knows to take data up until it sees the next comma.
data want;
infile 'path/to/your/file/file.csv';
input var1 $ date $;
run;
The $ in the INPUT statement is only of any value when you have not yet defined the variable to the data step compiler. In that case it will cause the variable to created as character with a length of $8 instead of the default of numeric. If the variable is already defined before the INPUT statement there is no need to add the $ in the input statement.
As to what length to use for the variable or whether to read them as character, date, time, datetime or general numbers it depends on what types of values will be in the file.
For character variables if your metadata tells you the values will not be any longer than 20 bytes you can define your variable as $20. But if the metadata instead means that the values could be up to 20 CHARACTERS long and you are supporting UTF-8 encoding then you should define the variable as longer since some UTF-8 characters can use more than one byte.
For DATE or other special types compare the types of values you will get to the available SAS informats (or perhaps make your own informat). It might be safest to read it initially as character and then add an INPUT() statement to convert the string into the values you want to store.
@n6 wrote:
I'm not using SAS Enterprise or SAS Studio or any of that. I can either use PROC IMPORT, in which case SAS will guess, or I can write code (or copy and paste the code from the Log window that SAS generates when you use PROC IMPORT and then tweak that as necessary).
I was doing the latter with the data dictionary info and then using that info, along with data _null_ to write SAS code to bring in the actual study data. But the data dictionary says that "Date_of_Procedure," for instance, is date (or numeric if you prefer) and if I bring it in like that then any responses of "June 2017" will get converted to missing instead of corrected to a good date.
And what does your data dictionary say the layout of that date variable should be?
If your data step provides an informat and the data does not match you will get an error message that will let you provide feedback to the source.
Consider:
data junk; infile datalines dlm=',' dsd; informat x best12. y date9.; format y date9.; input x y; datalines; 1234,23Jan2017 444,June 2017 ;
When I run that code then I get a log like this:
47 data junk; 48 infile datalines dlm=',' dsd; 49 informat x best12. y date9.; 50 format y date9.; 51 input x y; 52 53 datalines; NOTE: Invalid data for y in line 55 5-13. RULE: ----+----1----+----2----+----3----+----4----+----5--- 55 444,June 2017 x=444 y=. _ERROR_=1 _N_=2 NOTE: The data set WORK.JUNK has 2 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds
The line (lines) after the RULE: are the actual data line read, wrapping to fit the log if long enough, with the values of all the variables at the time the error occurs (the input statement).
So If X above had been an identifier I would know to go that identifier source and tell them their date value is kafluey.
If you have LOTS of these errors you might want to increase the setting of the system option ERRORS to increase the number of error messages displayed.
And you almost certainly do not want to set $ after every single variable as the maximum length any of the variables would have is 8 characters. Which means your "June 2017" would appear as "June 201".
You've hit on the problem. In the data dictionary it's a data but if there is an error in the data such that SAS can't make it a date, it will make it a missing value instead. Thus, we will get a missing value and never know it's a date, unless of course we look in the SAS log to see that message. But the problem is that when you have a big study, using the SAS log as an edit check mechanism doesn't work well. It's easier to have all the data in SAS and edit check it there, thus you want to be sure you have all the truly raw data in SAS in the first place.
If I make them $ then they'll be length 8 by default. Okay, that's no good, but what if I make them all $500 (or some other large number that will be big enough for anything)? Then having all those unnecessarily long variables will take up a lot of space but at least we'd be sure we have all the data. And then we would clean those, changing the final length to 20, say, if we wanted it that for a variable, changing another variable to numeric, changing another to date if we like, etc.
In summary, it seems like that in order to edit check the raw data in SAS, you have to get the raw data into SAS. And if the raw data is character then we have to get all that raw character data into SAS. Does that sound sensible?
@n6 wrote:
But in a new study we're doing we are bringing in data from .CSV files. As you know, SAS guesses about data attributes when you bring data in from .CSV files. And you can use the data dictionary to help SAS guess correctly. Okay, fine. But the thing is, I don't see how some edit checks will be possible that way.
SAS does not guess, you tell it to guess if using Proc Import.
If the CSV file has any documentation at all then you write your own data step to read the file with intended characteristics. Once you have a working data step then you just change the name of the input file and output data set. It is possible to write a data step that will read all of the CSV files of a common name in a folder at one time as well.
I read dozens of CSV files. Part of my data checking is proper use of informats. If the data is supposed to appear in the CSV as a date like 03/27/2020 then use an MMDDYY10 informat. If there is a problem with the data other than missing, such as a "month" of 27 then you get Invalid data messages.
I use lots of custom informats to check on variables that are supposed to have limited number of values. A dummy example:
Proc format; invalue $somecode '01','02','03','04','99' = _same_ other = _error_ ; data read; infile <stuff>; informat codevar $somecode. <other informats> input codevar <other variables on input statement> ; run;
In the above example if the variable codevar encounters any value other that in the list then an invalid data message will appear in the log.
If you have fields that are numeric but maybe have a code value such as NA or UNK you can read the values of codes into either a specific numeric value, missing, or special missing
proc format; invalue numcodes 'UNK' = . 'NC' = .N 'MAX' = 100000 ; data example; infile datalines dlm=',' dsd; informat v1 v2 v3 numcodes.; input v1 v2 v3; datalines; 1,2,MAX 3,,24.56 UNK,18,NC ;
The example above reads the specified characters in to the provided values and defaults to a BEST format for other numeric values. The N that you would see for the V3 value is a special missing that lets you know that the value was intentionally not collected for some reason. Typically a matching custom format would be created to display more interesting text than N. The .N value however will not be used in any numeric calculation.
Lots of things can be done with INFORMATS, though fixing bad date formats isn't one of them.
Writing you own data step to read values is the proper thing to do, especially if you have multiple files of the same structure.
The data step also lets do things like write note, warning or errors to the log if a specific variable is missing, out of range or inconsistent (purchase date is after final payment date for example).
The data step is also a good place to provide things like variable labels. I really do not like seeing a variable with a name like "ex_cot_bbdf_pdq" when the variable represenst "Exchange date with Company T" or some such.
Okay but it seems to me that if we do that then our edit checking is happening in the Log window at the point of bringing it into SAS and that seems very clunky. In the first study I mentioned, where it is all character data as soon as we bring it in, we look at each value and if it's clean we output it to a clean dataset and if it's not then we output it to an error dataset and figure out why it's not clean and/or query the site. We can generate error reporter this way.
If we're bringing in a chunk of data one time and that's it then I can see using the Log window like you describe but we have larger, longer studies where we bring in data again and again over time.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.