DATA Step, Macro, Functions and more

Retrieve Source file count

Reply
Frequent Contributor
Posts: 90

Retrieve Source file count

Hi,

 

In my source data file, I have exactly 250000 lakhs records..Through infile statement ,we are reading the source data , then we are doing validation part ( created two temporary work dataset - one for error dataset and the other one is for target dataset) . Like the code as below

 

 


data work.RETAIL_CUST
/* Drop the temporary columns from the target dataset */
(drop = X_INSURED_NAME X_ADDRESSOFINSURED)
work.RETAIL_CUST_ERROR
/* Drop the actual columns from the ERROR Dataset */
(drop = INSURED_NAME ADDRESSOFINSURED)
;

infile 'C:\Official\RGIC\Data\retail_txt.txt'
lrecl = 1024
delimiter = '|'
dsd
missover
firstobs = 2;
;

/* Final Variables to be populated after the Structure Validation is done */
attrib TRAN_ID length = $7;
attrib INSURED_NAME length = $30;
attrib ADDRESSOFINSURED length = $200;
attrib PRODUCT_CODE length = $12;
attrib PRODUCT_NAME length = $30;
attrib BRANCH_CODE length = $11;
attrib BRANCH_NAME length = $50;
attrib CONTACTMOBILE length = $20;
attrib CONTACTLANDLINE length = $20;
attrib EMAIL length = $50;
attrib PREVIOUS_POLICY_NUMBER length = $30;
attrib AGENT_CODE length = $20;
attrib AGENT_BROKER_NAME length = $30;
attrib PAN length = $15;
attrib VEHICALNO length = $20;
attrib ENGINENUMBER length = $30;
attrib CHAISSENO length = $30;
attrib DOB length = $12;
attrib GENDER length = $6;
attrib POLICYNO length = $30;
attrib PASSPORT length = $20;

/* Temporary Variables to read the data from file */
attrib X_INSURED_NAME length = $300;
attrib X_ADDRESSOFINSURED length = $300;


/* Note - The Input below reads the data into temporary columns of bigger size */
input TRAN_ID X_INSURED_NAME X_ADDRESSOFINSURED PRODUCT_CODE PRODUCT_NAME
BRANCH_CODE BRANCH_NAME CONTACTMOBILE CONTACTLANDLINE EMAIL
PREVIOUS_POLICY_NUMBER AGENT_CODE AGENT_BROKER_NAME PAN VEHICALNO
ENGINENUMBER CHAISSENO DOB GENDER POLICYNO PASSPORT;

/* Apply the Structure Validation Rules on temporary columns */
ERR_REC = 0;
if length(X_INSURED_NAME) > 30 then ERR_REC = 1;
else INSURED_NAME = strip(X_INSURED_NAME);

if length(X_ADDRESSOFINSURED) > 200 then ERR_REC = 1;
else ADDRESSOFINSURED = strip(X_ADDRESSOFINSURED);


/* Check the ERR_REC value */
if ERR_REC = 0 then
output work.RETAIL_CUST; /* Output correct data in the target table */
else
output work.RETAIL_CUST_ERROR; /* Output Error data in the ERROR table */

run;

 

The RETAIL_CUST dataset will have only the proper records and RETAIL_CUST_ERROR dataset will have only the error records. 

But my requirement is to capture  the source row count and populate into my final table. Is there any option to capture the source  count( i.e 2500000) in a variable..

 

I know there is one approach..create a dataset for  the infile statement t,hen taking the output count of  the dataset using the proc sql, will give the source file count...But I dont want to create a extra dataset simply  for reading the data from file.

 

Request to guide is there any way to assign some variable to retrieve the source count from the file in the same part of the code as above.

 

Super User
Super User
Posts: 7,401

Re: Retrieve Source file count

Well, you could just start an incrementor variable, retain it, add 1 to it each time, then output that:

data x y;

  infile...;

  retain row_count;

  if _n_=1 then row_count=1;

run;

 

Then whichever is the highest row_count in the two datasets is the obs count.  However I personally wouldn't take this approach.  You will find code is better strcutured and more readable if you separate your code out. 

data imported_data;

  ...;

run;

 

data fail pass;

  set imported_data;

...

run;

 

You will see it is easier to read, and you have a nice clear path back to the original data - and of course can pull nobs out easily.

 

A coupl eof tips as wel.  Use consistent indentation and avoid mixed case/upcase.  Use length statement once:

  length vara $10 varb 8 varc 3.4...;

 

Respected Advisor
Posts: 3,777

Re: Retrieve Source file count


RW9 wrote:

Well, you could just start an incrementor variable, retain it, add 1 to it each time, then output that:

data x y;

  infile...;

  retain row_count;

  if _n_=1 then row_count=1;

run;

 

Then whichever is the highest row_count in the two datasets is the obs count.  However I personally wouldn't take this approach.  You will find code is better strcutured and more readable if you separate your code out. 

data imported_data;

  ...;

run;

 

data fail pass;

  set imported_data;

...

run;

 

You will see it is easier to read, and you have a nice clear path back to the original data - and of course can pull nobs out easily.

 

A coupl eof tips as wel.  Use consistent indentation and avoid mixed case/upcase.  Use length statement once:

  length vara $10 varb 8 varc 3.4...;

 


I would not read the data twice for the reasons you suggest.  Remember the OP has "exactly 250000 lakhs records".

Super User
Super User
Posts: 7,401

Re: Retrieve Source file count

Well, I don't know what "lakhs" is, but if we know there are exactly 250000 records, why are we even discussing it, the outcome would be 250000 records no?  As for the two steps, well the OP is reading in and splitting the data at the same step, which is fine, my preference would be to do that in two steps, keep the original data, and the two split ones, for validation purposes.  250k isn't much.

Trusted Advisor
Posts: 2,113

Re: Retrieve Source file count

You can also use the SAS Dictionary Tables to obtain the record count for a SAS Dataset without reading the entire dataset.

Super User
Super User
Posts: 7,401

Re: Retrieve Source file count

Yes, but that can only happen once the data is read into SAS and a dataset is created no?  Thats whay I suggest reading in the data into a dataset.  Then you can get the obs count from those tables.  Then do your processing.  The way its setup it reads each line of the file and then outputs to two different datasets, doesn't actually seem to have any checks to see if the if then is not fulfilled, hence why I would split the read in from the validation step.

Ask a Question
Discussion stats
  • 5 replies
  • 225 views
  • 0 likes
  • 4 in conversation