Dear Community;
I need to read a text file that is a codebook to create an SAS dataset. This SAS dataset only contains 4 variables, var_name, Label, type, and Explanation. A short version of the text file is attached below (the number of variables is more than 300). The want dataset (shown in Excel, but .sas7bdat is desired) is also included. Thank you so much for your time.
Is your file representative of the actual file(s)? It does appear pretty organized to me.
If all of the variables have all of the elements (and the case is consistent) then you could just use @'text' input style to jump to the location of the next metadata field.
So you might try something like this:
data codebook;
infile 'c:\downloads\have.txt' lrecl=132 pad ;
if _n_=1 then do;
input @'Form name:' formname $80.
@'Form code:' formcode $10.
@'SAS datafile name:' formsasname $10.
;
retain form: ;
end;
varnum+1;
input @'var name:' varname $32.
@'Label:' label $100.
@'type:' type $20.
@'range:' range $30.
@'length:' length $20.
@'format:' format $32.
@'missing code:' missingcode $50.
;
run;
Result
If some of the elements (like LABEL for example) is not there then you might need to work a little harder. Basically read each line and figure out if it one of interest and then extract the value you want from it. Retain the values and then detect when a variable's definition ends (looks like that is signaled by a line of hyphens) and output the record and clear the retained variables.
This works but I had to use the ancient CAPS option, because I don't know how to get input @'String' input to ignore case.
options caps=1;
data want;
infile cards;
input @'VAR NAME: ' var_name:$32.
@'LABEL: ' label &$128.
@'TYPE: ' type :$8.
@'EXPLANATION: ' Explanation &$128.
;
cards4;
Clinical Trial - Data Base Codebook
-----------------------------------------------------
Form name:QUESTIONNAIRE- PART B
Form code: BL
SAS datafile name: bu
NOTE: First version - v40.0 - codebook
Versions:
v41.0 - Second version
Created:
Uploaded:
Codebook last updated:
---------------------------------------------------------------------------------
var name: HRBv1
Label: id
---------------------------------------------------------------------------------
type: numeric
range: none
length:
format:
missing code:
Explanation: a number given by the system.
Source: Automatically entered from computerized system
name of variable in Dump file: id
-------------------------------------------------------------------------------
var name: HRBv2
Label: qs_checklist_name
-------------------------------------------------------------------------------
type: string
length: 15
format: varchar
missing code:
explanation: information about the visit: bl, number.
Source: Automatically entered from computerized system
name of variable in Dump file: qs_checklist_name
-------------------------------------------------------------------------------
var name: HRBv3
Label: pdf_file_size
-------------------------------------------------------------------------------
type: numeric
range:
length: 20
format: varchar
missing code:
explanation: size of the pdf file in bytes.
Source: Automatically entered from computerized system
name of variable in Dump file: pdf_file_size
-------------------------------------------------------------------------------
var name: HRBv4
Label: pdf_file_status
-------------------------------------------------------------------------------
type: numeric (byte)
range: [0,2]
length: 1
format: char
missing code: 0
explanation: status of the pdf file: 0=blank, 1=draft, 2=final.
Source: Automatically entered from computerized system
name of variable in Dump file: pdf_file_status
-------------------------------------------------------------------------------
var name: HRBv5
Label: user_pin
-------------------------------------------------------------------------------
type: numeric (int)
range:
length: 4
format: integer
missing code:
explanation: the user's pin number.
Source: Automatically entered from computerized system
name of variable in Dump file: user_pin
;;;;
run;
Might explain to @CHL0320 that an INPUT statement using : Input @'some string value' moves the pointer in the input buffer to the next position where the value of 'some string value' exists and reads from there to get the value of the variable.
Do you need "option caps=1"? It looks like the @ strings are consistent throughout the document. So if you know the consistent case for each search string, couldn't it be as below?
input @'var name: ' var_name:$32.
@'Label: ' label &$128.
@'type: ' type :$8.
@'Explanation: ' Explanation &$128.
;
Without resorting to the CAPS option?
Something to handle case was needed. In the posted example the first two sets have "explanation" and "Explanation" .
Which would make me extremely cautious about the remainder of the file and the key words.
@mkeintz wrote:
Do you need "option caps=1"? It looks like the @ strings are consistent throughout the document. So if you know the consistent case for each search string, couldn't it be as below?
input @'var name: ' var_name:$32. @'Label: ' label &$128. @'type: ' type :$8. @'Explanation: ' Explanation &$128. ;
Without resorting to the CAPS option?
Is your file representative of the actual file(s)? It does appear pretty organized to me.
If all of the variables have all of the elements (and the case is consistent) then you could just use @'text' input style to jump to the location of the next metadata field.
So you might try something like this:
data codebook;
infile 'c:\downloads\have.txt' lrecl=132 pad ;
if _n_=1 then do;
input @'Form name:' formname $80.
@'Form code:' formcode $10.
@'SAS datafile name:' formsasname $10.
;
retain form: ;
end;
varnum+1;
input @'var name:' varname $32.
@'Label:' label $100.
@'type:' type $20.
@'range:' range $30.
@'length:' length $20.
@'format:' format $32.
@'missing code:' missingcode $50.
;
run;
Result
If some of the elements (like LABEL for example) is not there then you might need to work a little harder. Basically read each line and figure out if it one of interest and then extract the value you want from it. Retain the values and then detect when a variable's definition ends (looks like that is signaled by a line of hyphens) and output the record and clear the retained variables.
Thank you for all your help. There is a problem with the consistency in variable cases. For example, I need to have a variable "explanation" to be included but some of them are "Explantation".
@CHL0320 wrote:
Thank you for all your help. There is a problem with the consistency in variable cases. For example, I need to have a variable "explanation" to be included but some of them are "Explantation".
That is what @data_null__ 's use of the system option CAPS handle. It translates all characters to upper case so the comparison is consistent.
Or talk to whoever prepares that file and fix why the case changes.
@CHL0320 wrote:
Thank you for all your help. There is a problem with the consistency in variable cases. For example, I need to have a variable "explanation" to be included but some of them are "Explantation".
If you need to read those fields (why?) then you must go with the more complicate approach.
Something like this that reads the tag text and cleans it up before deciding what metadata variable it means.
data codebook;
infile 'c:\downloads\have.txt' dsd dlm=':' truncover end=eof;
length formname $20 formcode $10
varnum 8 varname $32
/* Define the other variables */
explanation $200
;
retain varnum 1 formname -- explanation;
input tag :$200. @ ;
tag=lowcase(compress(tag,'20A0090a0d'x));
drop tag ;
if index(_infile_,':') then select (tag);
when ('varname') do;
if varname ne ' ' then do;
output;
varnum+1;
call missing(of varname--explanation);
end;
input varname $200.;
end;
when ('formname') input formname $200.;
when ('formcode') input formcode $200.;
when ('explanation') input explanation $200.;
/* fill in the other possible variables */
otherwise ;
end;
if eof and varname ne ' ' then output;
run;
Result
If it is possible that a metadata field value could span multiple lines you will need more than just the simple INPUT statement in the WHEN clause for that metadata field.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.