- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Greetings!
I'm trying to write code that will identify when a data step has gone awry. Essentially, I want to capture the number of records that are read from and INFILE and compare it to the resulting data set. I can get the number of observations in the SAS data set, but I am at a loss as to how to capture the number of observations in the INFILE (external dataset). I know that SAS has it in one of the many temporary variables, as it prints it out in the log. I want to be able to use that information to create a report, verifying that the number of records read in matches the number of recors written out.
"NOTE: 39917 records were read from the infile"
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
There is no value like that. It would be difficult for SAS to add since a single datastep could read many input files.
You will probably need to create your own process to check the files. To count the number of lines if you are using Unix then I would just use the wc command. You could do it outside of SAS or use SAS abiltiy to execute operating system commands to do it for you.
%let path=./raw/ ;
%let pattern=*.csv;
data files;
length fname $256 lines 8 cmd $300;
infile "ls &path.&pattern" pipe truncover ;
input fname $100.;
cmd = catx(' ','wc -l',fname);
infile cmd pipe filevar=cmd truncover;
input lines @;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is indicative of a poor transfer process. If you setup the transfer correctly, i.e. Data Transfer specifications document - detailing structure and transfer schedule signed by both parties, then send data via a secure server, and this is the key point for you, have the sending person provide a proc contents of the data at their end. You can then validate the structure and the number of observations along with the data contained.
One other alternative is to use a self describing file format such as XML, you can have tags which describe how much data should be present, such as number of observations/columns.
Its pretty difficult to do with a plain text file if you have no specs/contents of what was sent. Your datastep may process the data, split wide variables, convert data etc. which can impact number of obs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Be careful what you wish for!
- Many DATA steps delete observations.
- Some DATA steps read a single observation from multiple lines of raw data.
- Some DATA steps output multiple observations from a single line of raw data.
Any of these factors would cause the number of observations in the SAS data set to differ from the number of lines of raw data.
You may succeed in your task, but find that there are many "false positive" situations that you need to investigate.
Assuming you want to continue, are you willing to add a few lines of code to each DATA step to perform the counting/reporting?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I was hoping that the &sysnobs macro variable would do the trick. Unfortunately, it seems to capture the number of records in the output dataset, not the input dataset. When I add a subsetting IF statement, I only get the number of records that qualified for the IF statement.
Unfortunately, I don't have control over the way the files are sent to me, so I can't require the changing of input formats, and the like.
I'm open to other suggestions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If your program has a single INPUT statement try adding:
Call symputx('n_count',put(_n_,best16.));
in the line immediately after the input.
Reference the value as &n_count later.
At least this works with one of my programs that reads multiple input files and totals to the "records read from file" notes in the log.
There are however lots of caveats to consider. If your input reads multiple lines at one time this won't work.
I'm not going to spend the time to test all of the interactions with missover, truncover and such where the program may or may not go the next line of the input file when there is insufficient data on a line.
You might need to consider very strongly, other than a subsetting if what the likely causes of the difference may be. If you incoming data has things like blank lines or extra header rows embedded in the body of the data you might consider something to specifically count those cases.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Basically your example is reporting the number of iterations of the data step.
In that case I would subtract one from _N_.
So in this example both SYSNOBS and DATA_STEP_LOOPS will by 19.
data class ;
call symputx('DATA_STEP_LOOPS',put(_n_-1,32.));
set sashelp.class;
run;
%put &=sysnobs &=data_step_loops;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
There is no value like that. It would be difficult for SAS to add since a single datastep could read many input files.
You will probably need to create your own process to check the files. To count the number of lines if you are using Unix then I would just use the wc command. You could do it outside of SAS or use SAS abiltiy to execute operating system commands to do it for you.
%let path=./raw/ ;
%let pattern=*.csv;
data files;
length fname $256 lines 8 cmd $300;
infile "ls &path.&pattern" pipe truncover ;
input fname $100.;
cmd = catx(' ','wc -l',fname);
infile cmd pipe filevar=cmd truncover;
input lines @;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sometimes the answer is as plain as the nose on your face.
Thanks!