I'm parsing a file and trying associate the entries there into one of two categories, complex or simple. Then ultimately save those off in separate output files.
My test data/script is as follows:
data have;
infile cards;
input msg $ 01-80;
cards;
- 23025 110400 LSCHD,LIST=SCHD,JOB=HAWKEYE
FRED NDAY=250
SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025
- 23025 110400 LSCHD,LIST=SCHD,JOB=HOTLIPS
SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025
- 23025 110400 LSCHD,LIST=SCHD,JOB=KLINGER
WILMA DAY=213
FRED DAY=051
FRED DAY=051
SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025
- 23025 110400 LSCHD,LIST=SCHD,JOB=BJ
BARNEY DAY=213
FRED DAY=051
SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025
- 23025 110400 LSCHD,LIST=SCHD,JOB=CHARLES
SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025
;
run;
data want;
set have;
if msg =: '-' then do;
var1 = (substr(msg, 41,08));
keep var1;
end;
else if msg =: 'FRED' or
msg =: 'WILMA' or
msg =: 'BARNEY' then do;
file complex;
put @1 var1;
end;
else do;
file simple;
put @1 var1;
end;
;
The lines starting with a dash(-) I use to get var1. If a subsequent line starts with a specific value(fred, Wilma, barney), then var1 gets classified as complex. Otherwise var1 gets classified as simple.
Although it indicated that records are written to both files, they are blank records.
NOTE: 6 records were written to the file COMPLEX. NOTE: 5 records were written to the file SIMPLE.
My Complex file should consist of:
HAWKEYE
KLINGER
BJ
While the Simple file should consist of :
HOTLIPS
CHARLES
Appreciate the assistance.
Your rules don't really clearly describe what output should come from
SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025
- 23025 110400 LSCHD,LIST=SCHD,JOB=KLINGER
WILMA DAY=213
FRED DAY=051
FRED DAY=051
You have both WILMA and FRED without reading a new var1. Your rules did not state "first only following" or "last" or some other rule as to exactly which of these triggers the output.
This matches your desired output for the given example text:
data complex (keep=Var1) simple (keep=var1); infile cards truncover; length var1 word $ 15; retain var1; input @; if _infile_ =:'-' then do; input @'JOB=' var1; end; else if not missing(var1) then do; word=(scan(_infile_,1)); if word in ('FRED' 'WILMA' 'BARNEY') then do; output complex; call missing(var1); end; else do; output simple; call missing(var1); end; end; cards; - 23025 110400 LSCHD,LIST=SCHD,JOB=HAWKEYE FRED NDAY=250 SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025 - 23025 110400 LSCHD,LIST=SCHD,JOB=HOTLIPS SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025 - 23025 110400 LSCHD,LIST=SCHD,JOB=KLINGER WILMA DAY=213 FRED DAY=051 FRED DAY=051 SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025 - 23025 110400 LSCHD,LIST=SCHD,JOB=BJ BARNEY DAY=213 FRED DAY=051 SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025 - 23025 110400 LSCHD,LIST=SCHD,JOB=CHARLES SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025 ; run;
The first Infile statement with the @ is there to populate the automatic variable _infile_ which has the current input line.
If you have not seen it before the =: is "begins with" and the @'text string' on an input statement says "go to the position where the string is and start reading. So there isn't a need to hard code in column count where JOB= occurs.
This assumes that you want the first value like Fred, Wilma or Barney only to output. Setting the Var1 value to missing after it is written once and then testing to see if a value of the variable is available to write is one way to control how many tests are needed.
To send output to two different sets you need the names on the Data statement and then an explicit Output <data set name> only writes to that one set.
More complex values may need to modify the input statements as your example data all consists of one word. More words will require additional code.
The way I see it, none of your strings is longer than 40 bytes, so your SUBSTR starting at positiion 41 will only fetch blanks.
Not sure if its how I represented the data here, but the substr does work. Running this finds the entries I'm after there -
data want;
set have;
if msg =: '-' then do;
var1 = (substr(msg, 41,08));
keep var1;
end;
;
proc print data=want;
returns this -
Obs var1 1 HAWKEYE 2 3 4 HOTLIPS 5 6 KLINGER 7 8 9 10 11 BJ 12 13 14 15 CHARLES 16
It really helps if you LOOK AT your data to see what is happening. (Maxim 3, Know your Data)
For all records where NOT msg=:'-' (these are the ones that will potentially be written out, the records that begin with a dash never get to the rest of the code), var1 is always blank.
Right, and its probably just my ignorance, but I'm not understanding why that is.
@serge68 wrote:
Right, and its probably just my ignorance, but I'm not understanding why that is.
The simple answer is that you wrote code which produces blank VAR1. All records which begin with '-' get var1 computed, records that do not begin with '-' will not have a VAR1 computed and only the records that do not begin with '-' will be sent to the output files.
How to fix it? This is untested, I don't know if it gets the desired results, but you can test it ... add a RETAIN statement so that the value in VAR1 is carried forward to the next record. First few lines:
data want;
retain var1;
set have;
Thanks, the retain is helping.
My complex file looks good.
Though the simple file gets all var1 entries added to it(both those that are simple & complex). Struggling with how to not have the complex entries included there.
Is the source the TEXT in your first data step? Or do you only have the actual DATASET as the source?
If it is TEXT then this looks like a simple data step to read. Just use a bare INPUT statement so you can check if the line starts with a hyphen.
data want;
infile text truncover ;
input @;
if _infile_ =: '-' then do;
* statements to handle the lines that start with hyphen;
end;
else do;
* statements to handle the other lines ;
end;
run;
Your rules don't really clearly describe what output should come from
SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025
- 23025 110400 LSCHD,LIST=SCHD,JOB=KLINGER
WILMA DAY=213
FRED DAY=051
FRED DAY=051
You have both WILMA and FRED without reading a new var1. Your rules did not state "first only following" or "last" or some other rule as to exactly which of these triggers the output.
This matches your desired output for the given example text:
data complex (keep=Var1) simple (keep=var1); infile cards truncover; length var1 word $ 15; retain var1; input @; if _infile_ =:'-' then do; input @'JOB=' var1; end; else if not missing(var1) then do; word=(scan(_infile_,1)); if word in ('FRED' 'WILMA' 'BARNEY') then do; output complex; call missing(var1); end; else do; output simple; call missing(var1); end; end; cards; - 23025 110400 LSCHD,LIST=SCHD,JOB=HAWKEYE FRED NDAY=250 SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025 - 23025 110400 LSCHD,LIST=SCHD,JOB=HOTLIPS SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025 - 23025 110400 LSCHD,LIST=SCHD,JOB=KLINGER WILMA DAY=213 FRED DAY=051 FRED DAY=051 SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025 - 23025 110400 LSCHD,LIST=SCHD,JOB=BJ BARNEY DAY=213 FRED DAY=051 SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025 - 23025 110400 LSCHD,LIST=SCHD,JOB=CHARLES SLIG-00 REQUEST COMPLETED AT 11:04:01 ON 23.025 ; run;
The first Infile statement with the @ is there to populate the automatic variable _infile_ which has the current input line.
If you have not seen it before the =: is "begins with" and the @'text string' on an input statement says "go to the position where the string is and start reading. So there isn't a need to hard code in column count where JOB= occurs.
This assumes that you want the first value like Fred, Wilma or Barney only to output. Setting the Var1 value to missing after it is written once and then testing to see if a value of the variable is available to write is one way to control how many tests are needed.
To send output to two different sets you need the names on the Data statement and then an explicit Output <data set name> only writes to that one set.
More complex values may need to modify the input statements as your example data all consists of one word. More words will require additional code.
Awesome. Thanks for this. It looks like what I'm after. Will continue to test here.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.