BookmarkSubscribeRSS Feed
Jagadishkatam
Amethyst | Level 16

This is with regard to a strange issue i observed when i run the below code

proc import datafile='~path\sample.csv' out=ex dbms=csv replace;

DELIMITER=',';

getnames=yes;

DATAROW=2;

GUESSINGROWS=5000;

run;  

a dataset EX is created with only 678 observations. There are actually 5000 rows. The remaining observation are not read into the dataset.

could you please let me know the reason for this and any possible options to avoid. i am running this code in sas 9.2.

also tried to execute the above code with

options obs=max;

Thanks,

Jag

Thanks,
Jag
8 REPLIES 8
ballardw
Super User

After running proc import there should be a log entry. Did you get any warnings or notes in the log?

Also have you opened the file in a plain text editor? If you looked at the file in many programs lines may have wrapped appearing to be more lines that are actually in the file.

One clue to that: How many variables did you get and how many did you expect to have? If you have many more variables it indicates you may have an issue with line ending.

Also, if you are reading this file in Windows and it originated in a Unix system then the line end characters are different and you may need to modify the code generated by Proc Import and add the option TERMSTR=LF to properly get rows.

Also, look at the lengths of variables. If you have unclosed quotes in the data some odd things may happen.

ErikLund_Jensen
Rhodochrosite | Level 12

The problem might be caused by a hex 1A (DOS eof characher) somewhere in the file.

If this is the problem, then the option ignoredoseof on the infile-statement will solve it.

Jagadishkatam
Amethyst | Level 16

Than you Erik, it worked. I was able to get the remaining rows after using the option ignoredoseof in the infile statement. However one problem i noticed was in one of the variable there were text data and in between these text data there were commas. so even these commas were considered and the data of this variable was split and used in the next variable. because of which i was getting more than the expected records.

i understand it is because of the presence of commas in between the text data, the text data is also considered as separate due to the function of dlm=',' while reading the data from csv file.

Is there any way to avoid this and improve the import from csv file through the text data has commas.

Thanks,
Jag
ErikLund_Jensen
Rhodochrosite | Level 12

Hi Jag

In a comma separated file, text fields ought to be quoted, at least columns that contain spaces or the separator character, and there should be the same number of separators in all rows, even if the fields are empty. In that case the file can imported in excel and also read by SAS with correct result.

So I presume that the text columns containing commas are not in quotes, and then SAS (or EXCEL) cannot distinguish between commas in fields and commas between fields. But IF there is only one text field that might contain commas, AND IF there is the same number of "real" separators in all rows, it is possible to solve the problem by coding in the data step. One trick is to treat the whole row as a string and scan variables from start up to the problematic field, and then from end backwards down to the field. The remaining string must be the problem-field content, commas or not. Tedious to write, but it works.

Example records - the name field has a embedded comma in second row (Anderson, Peter) - note the termination comma.
28,12,Sunil,programmer,Lahore,employed,
33,5,Anderson, Peter,teacher,New Delhi,dismissed,

Code example - this code does not work properly if there are empty fields (2 concecutive commas with a zero-length value between). - It could be done too, with a little more effort.
data test (drop=nextstart nextend namelength);
infile 'e:\work\test.csv' ignoredoseof lrecl=32000;
length age month 8 name jobtitle town jobstatus $30;
input;

* scan from start before the name-field, keep track of next position;
age = input(scan(_infile_,1,','),8.);
nextstart = find(_infile_,',',1)+1;

month = input(scan(_infile_,2,','),8.);
nextstart = find(_infile_,',',nextstart+1);

* from end after the name-field, keep track of next position;
nextend = find(_infile_,',',-32000); * position of terminating comma;

jobstatus = scan(_infile_,-1,',');
nextend = find(_infile_,',',-(nextend-1));

town = scan(_infile_,-2,',');
nextend = find(_infile_,',',-(nextend-1));

jobtitle = scan(_infile_,-3,',');
nextend = find(_infile_,',',-(nextend-1));

* extract the remainder;
* nextstart and nextend are now the positions of the commas surrounding the name;
namelength = nextend - nextstart;
name = substr(_infile_,nextstart+1,namelength-1);
run;

result:
age=28 month=12 name=Sunil jobtitle=programmer town=Lahore jobstatus=employed
age=33 month=5 name=Anderson, Peter jobtitle=teacher town=New Delhi jobstatus=dismissed

Jagadishkatam
Amethyst | Level 16

i checked the proc import log file and it was executed well without any errors, the variables were as expected and only issue was with the observations, which was rectified ignoredoseof.


Thank you @ballardw i tried the TERMSTR=LF, but for my scenario it did not work.

Thanks,
Jag
data_null__
Jade | Level 19

Does that mean that now one or more variable values contain the value '1A'X?  If so could that be a problem going forward?

Jagadishkatam
Amethyst | Level 16

i dont think so, as we used option ignoredoseof, however the problem is with the commas that are present in the one of the variables data. Because of which the data is being carried to other variables and there by there is increase in records.

Thanks,
Jag
Reeza
Super User

There really isn't a way to get around the comma's embedded in comments field.

It usually requires some manual cleaning or some clean up process after the import but its usually very data specific. If you're lucky it's the last field and that can make it easier.

Usually when such a file is exported it has quotation marks around it, and then using the DSD option will help.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 2987 views
  • 3 likes
  • 5 in conversation