BookmarkSubscribeRSS Feed
Kirito1
Quartz | Level 8

I was working on SAS and I usually use proc import to bring all the data I want. And I agree its the best way to deal with any data. So my data has about 950K rows and about 50 columns. Using proc import it was working fine until The rows hit 900K.

 

But the problem arises when my data exceeds 900K rows. Data was all good in the csv (means source file data has no discrepancies) in case anyone having doubt that the raw data was discrepant it is not the case.

But when I am now using proc import the data is getting out of control means its all wrong.

 

So, I decided to use infile to import my data. It was getting imported but (once again a but) data is again getting out of control can anyone help.

So whats happening is the blanks in the data is getting filled by some values or characters which is not suppose to happen.

(If there is any other way to import data into SAS without getting discrepant. Please, let me know)

Please, help.

7 REPLIES 7
Kurt_Bremser
Super User

Please post your code, the log from it, some example lines which are read correctly, and some where the code does not create correct results.

Patrick
Opal | Level 21

"And I agree its the best way to deal with any data."

Using a SAS datastep infile/input is what gives you full control and though what most experienced developers would consider "the best way".

 

If your datastep works for the first 900K rows then something in your source data structure must cause your input statement to "fall over". 

Are you using MISSOVER or TRUNCATE in your INFILE statement?

 

Look into your .csv starting one line before you get the first time unexpected data in the SAS table and investigate if you find anything out of the ordinary.

Also investigate your SAS log and look if there is any Note or Warning that give you a hint ...like: SAS went to a new line .....

Kirito1
Quartz | Level 8

SAS CODE and Log as below:

 

data MERINFO;
    infile "XXXFILEXXX/MABC.csv"
	delimiter = ","
	missover 
	dsd
	firstobs=2;

informat 'Sr. NO.'n;	
informat 'Lead Id'n;	
informat 'Lead Source'n $30.;	
informat 'Customer Id'n;	
informat 'Merchant Name'n $1000.;	
informat 'QR Display Name'n $1000.;	
informat 'Acc Type'n $10.;	
informat 'Merchant Type'n $30.;	
informat 'Mobile Number'n;	
informat 'Email Id'n $2000.;	
informat 'PM Swanidhi Yojna'n $1.;	
informat 'Status'n $30.;	
informat 'VPA'n $200.;	
informat 'Remark'n $5000.;	
informat 'Address'n $1000.;	
informat 'Pin Code'n;	
informat 'Business Address'n $2000.;	
informat 'City'n $100.;	
informat 'State'n $100.;	
informat 'Pincode'n;	
informat 'Registered Address'n $2000.;	
informat 'City'n $100.;	
informat 'State'n $100.;	
informat 'Pincode'n;	
informat 'BankName'n $200.;	
informat 'Branch Name'n $200.;	
informat 'Account Name'n $400.;	
informat 'Account Number'n;	
informat 'Ifsc Code'n $50.;	
informat 'Pancard No.'n $70.;	
informat 'Micr Code'n;	
informat 'Profession Category'n $400.;	
informat 'Department Category'n $500.;	
informat 'Employer Category'n $200.;	
informat 'Employment Category'n $300.;	
informat 'Principle Segment'n $30.;	
informat 'GST Number'n $100.;	
informat 'So Name'n $200.;	
informat 'So Branch Name'n $400.;	
informat 'So Branch Code'n $40.;	
informat 'OP Name'n $40.;	
informat 'LG Code'n $20.;	
informat 'LC Code'n $20.;	
informat 'Date of onboarding'n datetime24.;	
informat 'Date Of approval'n datetime24.;	
informat 'Last Modified Date'n datetime24.;	
informat 'Consent'n $1.;	
informat 'Settlement Cycle'n $3.;	
informat 'Product Description'n $500.;	
informat 'Name Match Score'n;	
informat 'Product Code'n;	
informat 'Referral Code'n;	
informat 'CKYCNumber'n;



format 'Sr. NO.'n;	
format 'Lead Id'n;	
format 'Lead Source'n $30.;	
format 'Customer Id'n;	
format 'Merchant Name'n $1000.;	
format 'QR Display Name'n $1000.;	
format 'Acc Type'n $10.;	
format 'Merchant Type'n $30.;	
format 'Mobile Number'n;	
format 'Email Id'n $2000.;	
format 'PM Swanidhi Yojna'n $1.;	
format 'Status'n $30.;	
format 'VPA'n $200.;	
format 'Remark'n $5000.;	
format 'Address'n $1000.;	
format 'Pin Code'n;	
format 'Business Address'n $2000.;	
format 'City'n $100.;	
format 'State'n $100.;	
format 'Pincode'n;	
format 'Registered Address'n $2000.;	
format 'City'n $100.;	
format 'State'n $100.;	
format 'Pincode'n;	
format 'BankName'n $200.;	
format 'Branch Name'n $200.;	
format 'Account Name'n $400.;	
format 'Account Number'n;	
format 'Ifsc Code'n $50.;	
format 'Pancard No.'n $70.;	
format 'Micr Code'n;	
format 'Profession Category'n $400.;	
format 'Department Category'n $500.;	
format 'Employer Category'n $200.;	
format 'Employment Category'n $300.;	
format 'Principle Segment'n $30.;	
format 'GST Number'n $100.;	
format 'So Name'n $200.;	
format 'So Branch Name'n $400.;	
format 'So Branch Code'n $40.;	
format 'OP Name'n $40.;	
format 'LG Code'n $20.;	
format 'LC Code'n $20.;	
format 'Date of onboarding'n datetime24.;	
format 'Date Of approval'n datetime24.;	
format 'Last Modified Date'n datetime24.;	
format 'Consent'n $1.;	
format 'Settlement Cycle'n $3.;	
format 'Product Description'n $500.;	
format 'Name Match Score'n;	
format 'Product Code'n;	
format 'Referral Code'n;	
format 'CKYCNumber'n;

input
'Sr. NO.'n
'Lead Id'n	
'Lead Source'n $	
'Customer Id'n
'Merchant Name'n $	
'QR Display Name'n $	
'Acc Type'n $	
'Merchant Type'n $	
'Mobile Number'n
'Email Id'n $	
'PM Swanidhi Yojna'n $	
'Status'n $	
'VPA'n $	
'Remark'n 	
'Address'n $	
'Pin Code'n	
'Business Address'n $
'City'n $
'State'n $	
'Pincode'n
'Registered Address'n $	
'City'n $	
'State'n $
'Pincode'n
'BankName'n $	
'Branch Name'n $	
'Account Name'n $	
'Account Number'n	
'Ifsc Code'n $	
'Pancard No.'n $	
'Micr Code'n	
'Profession Category'n $	
'Department Category'n $	
'Employer Category'n $	
'Employment Category'n $	
'Principle Segment'n $	
'GST Number'n $	
'So Name'n $	
'So Branch Name'n $	
'So Branch Code'n $	
'OP Name'n $	
'LG Code'n $	
'LC Code'n $	
'Date of onboarding'n 	
'Date Of approval'n 	
'Last Modified Date'n 	
'Consent'n $
'Settlement Cycle'n $	
'Product Description'n $	
'Name Match Score'n	
'Product Code'n	
'Referral Code'n	
'CKYCNumber'n;

run;

 

 

hdsgfgjh

 

NOTE: Invalid data for 'Sr. NO.'n in line 8 1-1.
NOTE: Invalid data for 'Account Number'n in line 21 387-398.
NOTE: Invalid data for 'Date of onboarding'n in line 21 600-618.
NOTE: Invalid data for 'Date Of approval'n in line 21 622-640.
NOTE: Invalid data for 'Last Modified Date'n in line 21 644-662.
WARNING: Limit set by ERRORS= option reached.  Further errors of this type will not be printed.

And I Have only 920K records/ rows/ observation  and 53 columns in my data but SAS is reading as below SS:

Kirito1_0-1676460078036.png

 

ballardw
Super User

DO not remove the details of INVALID data messages. You should include all of that very ugly stuff that comes along because it almost always answers the problem. Consider your invalid data for 'Sr. No'. This brief step duplicates a very likely cause:

data example;
   infile datalines dlm=',' dsd;
   informat sn;
   informat leadid;
   input sn leadid;
datalines;
1,3
345,567
a,111
;

And the log:

47   data example;
48      infile datalines dlm=',' dsd;
49      informat sn;
50      informat leadid;
51      input sn leadid;
52   datalines;

NOTE: Invalid data for sn in line 55 1-1.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--
55         a,111
sn=. leadid=111 _ERROR_=1 _N_=3

Yes the line number changes because my system is not synced with yours. BUT the invalid data is shown in the line under the Rule: the first character is an a. Your lack of actual informat for the variable means that it defaults to a numeric informat and quite likely BEST12. Any single non-digit non-space character is going to be "invalid".

Your invalid data with the account number is very likely to be quite similar though I might expect a - or space in the middle of the value consisting of mostly digits. (Note: if you don't do arithmetic with it there usually isn't a reason to read as numeric).

 

I typically expect when I see an message like these (note they are all on the same record)

NOTE: Invalid data for 'Date of onboarding'n in line 21 600-618.
NOTE: Invalid data for 'Date Of approval'n in line 21 622-640.
NOTE: Invalid data for 'Last Modified Date'n in line 21 644-662.

that some entered the value in a different structure than the datetime24. your informat expects, such as no time component or a space between the date and time portions.

 

The RULE is extremely helpful in identifying the columns of data. The numbers are multiples of 10, the + are 5 characters and the first character position is shown when a long record breaks, so can see which column values were encountered.

 

Note also that the LINE in the invalid data message is the line of source file. So the problems you see are occurring way before 900K records and the size of the file has nothing to do with it.

You may not be getting all the invalid data messages because you have not addressed things like the informats for the earlier records.

 

Also, if your CSV originated as a spreadsheet and you are saving it then you may just be at the mercy of bad data entry practices.

I have received spreadsheets that would partway through a file change the column formatting from "general" to currency, or date to currency (or currency to dates as well). So where I was expecting to read values like 2022-10-15 for a date the column started showing values of $44,123. I had to manually ensure that the columns had the same formatting in the spreadsheet software before saving to CSV to fix what ever the source was doing to mess things up. 

Tom
Super User Tom
Super User

@ballardw Very good response, except BEST is the name of a FORMAT not an INFORMAT. 

 

When a numeric variable does not have an informat assigned it will use the normal numeric informat. And since you are using LIST MODE it will use the width of the next delimited "word" on the line.

Reeza
Super User
Number of lines makes me guess that you have EOL characters in the text file. I would read the file using X observations until you find the line where it starts to split and then manually check the file for EOL characters and ensure the correct EOL is specified.

Personally, I would re-run proc import to generate the code using options validvarname=v7 to get names that are easier to type and also including guessingrows=max which will take a long time. Then you can compare the new one to the old version to see which fields are problematic.
Kurt_Bremser
Super User

BEFORE you do anything else, get rid of those extremely stupid name literals. Use valid SAS names (letters, digits, underlines, must not start with a digit); keep the funny strings in labels if you must preserve them.

 

The informats and formats can all be set in a single statement each, which makes your code easier to maintain.

 

Your first statement should be a LENGTH statement; after that, formats and informats are only needed for certain variables like dates, times or numbers.

 

Serial and account "numbers" are never used for calculations, so read them as character.

 

Look at the format of the datetimes in your csv files. I seriously doubt that non-SAS sources use the SAS DATETIME format; the length of 19 lets me suspect that these datetimes are stored as YYYY-MM-DD HH:MM:SS or YYYY-MM-DDTHH:MM:SS (ISO 8601 compliant). Such can be read with the E8601DT19. informat.

 

Finally, scan for strings containing hex 0D, 0A or 0D0A characters (CR, LF, CRLF), which cause SAS to think that the current record ends right there. You will find numerous threads here on the community dealing with such issues, and ways to remedy them from within SAS.

sas-innovate-white.png

🚨 Early Bird Rate Extended!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.

 

Lock in the best rate now before the price increases on April 1.

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1963 views
  • 0 likes
  • 6 in conversation