BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
GBL__
Quartz | Level 8

Hello All,

 

Obviously data validation is a very well covered topic, however, I have not been able to specifically find help for my problem.

 

My group receives raw data files ~100 different clients in all shapes and sizes.  The problem I am having is that some what frequently a client will change their data file layout, or add extra variables at the end of their file.

 

The first scenario is easy enough to identify and I have no problem programming to catch that (I currently use PROC COMPARE to validate against a "master table" that only contains variable information).

 

The problem I cannot seem to program for is when a client adds additional variables/data to the end of their raw data file.  If I create a standardized import using a DATA Step and INFILE/INPUT statements, how can I programically check to make sure there is no additional data?

 

As an example:

 

DATA t_&client_number. ;
    INFILE "&file_path.\&client_name.\*.txt" TRUNCOVER ;
    INPUT
        @1   class_number            $10.
        @11  account_number          $16.
        @27  total_current_balance    12.2
        @39  customer_name           $40. ;
    FORMAT
        total_current_balance        dollar12.2 ;
RUN ;

If I have a data import like the above, and since I am explicitly creating the variables, if there is extra data how am I supposed to know?

 

Another problem is that we receive fixed width files, tab and comma delimited files, and excel files, so I am not sure there would be a one stop shop to handle everything.

 

I apologize if I am not being clear, or have not provided enough background information for you to provide any feedback.

 

Appreciate any and all help!  Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

One way might be similar to this:

DATA t_&client_number. ;
    INFILE "&file_path.\&client_name.\*.txt" TRUNCOVER length=reclen;
    INPUT
        @1   class_number            $10.
        @11  account_number          $16.
        @27  total_current_balance    12.2
        @39  customer_name           $40. 
     ;
    if reclen > 78 then do;
      put "WARNING: Longer than expected data line";
      stop;
   end;
    FORMAT
        total_current_balance        dollar12.2 ;
RUN ;

The length= infile option creates a variable, in this case named Reclen that has the length of the current input record.

The variable is not added to the data set but you can test the value. Since you are using column input then that says you expect the maximum length of your data file to be 39 (starting column of the last variable) + the format length (40) which would end at column 78. So if the value of Reclen is longer than that then this writes a message to that effect and then stops executing the data step. You may want to consider using ABORT instead of STOP if running in a batch mode as they affect how later program code behaves.

 

You could include a PUT _infile_; as well to show the value of the data line.

This specific approach has a minor advantage that if the first record with the data appears part way through the file it stops there.

Also if there is a KNOWN column of data you are intentionally not reading (some data with end of record strings do this) you could set the Reclen comparison past that expected position.

 

However this has some weakness. For one if you have different file layouts then you have to calculate a specific length for each one. Also if any of your files are using delimited input with list then you likely don't have a nice reference column position to test from.

 

Another approach:

DATA t_&client_number. ;
    INFILE "&file_path.\&client_name.\*.txt" TRUNCOVER length=reclen;
    INPUT
        @1   class_number            $10.
        @11  account_number          $16.
        @27  total_current_balance    12.2
        @39  customer_name           $40. 
        testvar                      $varying50.;
     ;
    if length(testvar)>0 then do;
      put "WARNING: Longer than expected data line";
      stop;
   end;
   drop testvar;
    FORMAT
        total_current_balance        dollar12.2 ;
RUN ;

Check the documentation on the $varying format but basically you are attempting to read past the end of the expected data and if there is any then stop. Again you might want a Put _infile_ to show the whole line. This does create a variable that you likely do not want to see in your data so you need to drop it.

The advantage is this is easier to transfer to other data file layouts and will word with list input as well as column input.

A disadvantage is that if you are intentionally not reading part of the data then you will likely catch that here. So you would have to read the known data and intentionally discard that variable as well.

 

 

View solution in original post

7 REPLIES 7
Tom
Super User Tom
Super User

The best way is to make them tell you want they are sending you.  So in addition to the actual data file have them send you a data dictionary.  You can then compare the new data dictionary to the old data dictionary and know if there are any changes.

 

So for your fixed length example you might ask them to send you:

Name Type StartCol Length
class_number CHAR 1 10
account_number CHAR 11 16
total_current_balance NUM 27 12
customer_name CHAR 39 40 

PS You probably do NOT want to use 12.2 as then INFORMAT to read a number.  Unless you have been told that they will be removing the periods from the text version of the numbers.  So that a value like 1,234.56 will appear as "123456".  Otherwise SAS will divide any value that does not include an explicit decimal point by 10**2.

GBL__
Quartz | Level 8

Thanks for the reply, Tom!

 

While we do receive information regarding the data that is being submitted, as I stated in the OP, sometimes (unfortunately too frequently) a client will include extra variables/data to the end of the file.

 

The problem previously, and currently, is that the system we are on now cannot programically check in there is extra data.  So the only way for us to find is to manually inspect the raw data before running our import.

 

I have taken up the task of converting our team over to SAS and am looking for a way to have this checked when the import takes place, if possible.

 

I am reading as much as I can about LRECL and RECFM options and uses, because I feel like that is going to be the way to go??

Tom
Super User Tom
Super User

If you have text files (either delimited or not) just read the rest of the line into a variable. 

data t_&client_number. ;
    infile "&file_path.\&client_name.\*.txt" truncover ;
    input
        @1   class_number            $10.
        @11  account_number          $16.
        @27  total_current_balance   comma12.
        @39  customer_name           $40.
        extra $200.
    ;
    format
        total_current_balance        dollar12.2
    ;
run;

You could then check that variable to see if it is empty. Either in that step or in another step.

 

GBL__
Quartz | Level 8
Thank you @Tom! I really appreciate your feedback! I am going to implement this suggesting in to part of my of end of file testing. Thanks again!
ballardw
Super User

One way might be similar to this:

DATA t_&client_number. ;
    INFILE "&file_path.\&client_name.\*.txt" TRUNCOVER length=reclen;
    INPUT
        @1   class_number            $10.
        @11  account_number          $16.
        @27  total_current_balance    12.2
        @39  customer_name           $40. 
     ;
    if reclen > 78 then do;
      put "WARNING: Longer than expected data line";
      stop;
   end;
    FORMAT
        total_current_balance        dollar12.2 ;
RUN ;

The length= infile option creates a variable, in this case named Reclen that has the length of the current input record.

The variable is not added to the data set but you can test the value. Since you are using column input then that says you expect the maximum length of your data file to be 39 (starting column of the last variable) + the format length (40) which would end at column 78. So if the value of Reclen is longer than that then this writes a message to that effect and then stops executing the data step. You may want to consider using ABORT instead of STOP if running in a batch mode as they affect how later program code behaves.

 

You could include a PUT _infile_; as well to show the value of the data line.

This specific approach has a minor advantage that if the first record with the data appears part way through the file it stops there.

Also if there is a KNOWN column of data you are intentionally not reading (some data with end of record strings do this) you could set the Reclen comparison past that expected position.

 

However this has some weakness. For one if you have different file layouts then you have to calculate a specific length for each one. Also if any of your files are using delimited input with list then you likely don't have a nice reference column position to test from.

 

Another approach:

DATA t_&client_number. ;
    INFILE "&file_path.\&client_name.\*.txt" TRUNCOVER length=reclen;
    INPUT
        @1   class_number            $10.
        @11  account_number          $16.
        @27  total_current_balance    12.2
        @39  customer_name           $40. 
        testvar                      $varying50.;
     ;
    if length(testvar)>0 then do;
      put "WARNING: Longer than expected data line";
      stop;
   end;
   drop testvar;
    FORMAT
        total_current_balance        dollar12.2 ;
RUN ;

Check the documentation on the $varying format but basically you are attempting to read past the end of the expected data and if there is any then stop. Again you might want a Put _infile_ to show the whole line. This does create a variable that you likely do not want to see in your data so you need to drop it.

The advantage is this is easier to transfer to other data file layouts and will word with list input as well as column input.

A disadvantage is that if you are intentionally not reading part of the data then you will likely catch that here. So you would have to read the known data and intentionally discard that variable as well.

 

 

GBL__
Quartz | Level 8
Thank you @ballardw! Both of these work great! Much appreciated!
data_null__
Jade | Level 19

You could add a field after CUSTOMER_NAME to read THE_REST.  You have TRUNCOVER which will keep this from causing the input to go the next line.

THE_REST $128.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1086 views
  • 0 likes
  • 4 in conversation