Solved: Re: Import xlsx - weird formatting on mixed columns - Page 2

andreas_lds · Posted 06-11-2021 12:27 AM

@lbarwick wrote:
Understood but the 'creator' of the spreadsheet in this case is Jira, I get the .csv as an export and then save it as .xlsx. I tried converting all of the values in this column to text and Excel will turn 100% into 1 and 50% into 0.5. The problem with this is that there could be other rows with a legitimate value of 1 (1%) so in this conversion I cannot distinguish between what was originally 100% and what was originally 1%.

So, don't open the csv-file with excel, write a data step to read it directly. The code depends on the contents of the file, so it is hardly possible to suggest something useful.

Tom · Posted 06-10-2021 02:33 PM

Please go support this now 3+ year old SASWARE Ballot Idea for SAS to make enhancement to allow easy processing of such text files.

https://communities.sas.com/t5/SASware-Ballot-Ideas/Enhancements-to-INFILE-FILE-to-handle-delimited-...

ErikLund_Jensen · Posted 06-10-2021 03:31 PM

Hi @lbarwick

I have encountered the same problem several times, and I have found the following approach useful. There is no conversion outside SAS.

1. Ged rid of the embedded line shifts. Use a data step to read the CSV file as text lines. Store the line in a retained output-line variable. Count the number of delimiters in the current line and add to the output line until it has the expected number of deliniters. Then write the output-line variable to a temporary text file, initiate the output-line variable and counter and read next line etc.

2.Use another data step to read the temporary file into SAS as a delimited file. I prefer to read all variables as text and convert them to numeric or dates or whatever where appropriate.

lbarwick · Posted 06-10-2021 04:10 PM

I don't suppose you have example code for this solution?

Tom · Posted 06-10-2021 04:33 PM

@lbarwick wrote:
I don't suppose you have example code for this solution?

There are literally dozens if not hundreds of examples on this site alone.

Try a simple google search:

https://www.google.com/search?q=%40sas.com+embedded+line+breaks+in+csv+file

Here are a couple of solutions:

https://communities.sas.com/t5/New-SAS-User/Reading-CSV-file-with-double-quotes-line-breaks-and-spac...

https://communities.sas.com/t5/SAS-Programming/How-to-import-line-feed-or-carriage-return-values-fro...

SAS has even posted a note about this, but watch out as that method modifies the original file.

https://support.sas.com/kb/26/065.html

ErikLund_Jensen · Posted 06-11-2021 05:16 AM

Sure, here we go..

* Write a test CSV file with embedded line breaks in fields;
data _null_;
	infile cards truncover;
	file 'c:\temp\csvtest.csv';
	input;
	put _infile_;
	cards4;
101;3/11/2020;"text field";something;217.3;another field
102;;"text field with 
line break";more;;another field
103;12/1/2021;;something;25.2.3;another field
104;2/2/2021;"text field 
with line break";something;10000000;another 
field with 
line break
105;3/11/2020;"text field";something;217.3;another field
;;;;
run;

* Filename for temp file;
filename temp 'c:\temp\csvtest_cleaned.txt';

* Read CSV file and remove embedded line breaks - we expect 5 separators in each line;
data _null_;
	infile 'c:\temp\csvtest.csv' truncover lrecl=255 end=eof;
	file temp lrecl=255;
	length outstr $255;
	retain outstr;
	input;
	l = count(outstr,';');
	if l = 5 and count(_infile_,';') > 0 then do;
		put outstr;
		outstr = '';
	end;
	outstr = catx(' ',outstr,_infile_);
	if eof then put outstr;
run;

* Read cleaned CSV file - the "hard way" with full control;
data want;
	infile temp lrecl=255;
	length ID Date 8 Text1 text2 $200 Value 8 Text3 $200;
	format Date date9. Value 12.3;
	input;
	ID = input(scan(_infile_,1,';','m'),??8.);
	Date = input(scan(_infile_,2,';','m'),??mmddyy10.);
	Text1 = dequote(scan(_infile_,3,';','m'));
	Text2 = scan(_infile_,4,';','m');
	Value = input(scan(_infile_,5,';','m'),??12.3);
	Text3 = scan(_infile_,6,';','m');
run;

lbarwick · Posted 06-11-2021 10:03 AM

Here is the solution that seems to work best for my needs:

%let dl=C:\Users\lbarwick\Desktop;
filename fixed temp;
data _null_;
  infile "&dl/Jira_Data.csv" end=eof  LRECL=32767 truncover dsd;
  file "&dl/Jira_Data_cleaned.csv" ;
  input;
  nq+countc(_infile_,'"');
  nq=mod(nq,2);
  put _infile_ @;
  if eof or not nq then put; 
  else put '|' @;
run;

One issue I noticed is that after 693 columns or so, SAS starts to not assign names to columns so I end up with a column of data in the 'cleaned' csv with no variable name in the header.

Tom · Posted 06-11-2021 10:19 AM

Sounds like the header row is longer than 32 K bytes. For 2400 names that would be an average of about 12 characters per name.

You will need to switch to one of the methods that does not use _INFILE_ automatic variable when pre-processing the file. There are versions out that that process the file byte by byte so they should work for any recrod length.

If you are using PROC IMPORT to guess at how to name the variables from the header row you might still have a problem. I seem to remember it also could not really handle more than 32K for the length of the header row.

lbarwick · Posted 06-15-2021 11:09 AM

Here is the solution I ended up going with:

*read-in raw Jira csv data WITHOUT names;
%let dl=C:\Users\lbarwick\Desktop;

filename csv "&dl/Jira_Data.csv" lrecl=2000000;
proc import datafile=csv out=jiranames replace dbms=csv;
 getnames=no;
 datarow=2;
run;

*transpose generic VAR names from raw Jira import from above;
proc transpose data=jiranames(obs=0) out=names ;
 var _all_;
run;


*now that we have variable names set, read-in raw Jira data once again and this time correct issues with carriage returns and line feeds so we get
	all records/values lined up with header.;
%let dl=C:\Users\lbarwick\Desktop;
filename fixed temp;
data _null_;
  infile "&dl/Jira_Data.csv" end=eof;
  file "&dl/Jira_Data_cleaned.csv" ; *output file with carriage return/line feed issues fixed;
  input;
  nq+countc(_infile_,'"');
  nq=mod(nq,2);
  put _infile_ @;
  if eof or not nq then put; 
  else put '|' @;
run;

*import the cleaned Jira data but again, we're not going to import variable names;
filename csv 'C:\Users\lbarwick\Desktop\Jira_Data_cleaned.csv';
proc import out=prr /*(where=(substr(issue_key,1,3)='PRR'))*/
		datafile = "C:\Users\lbarwick\Desktop\Jira_Data_cleaned.csv"
		dbms=csv replace;
		delimiter=',';
		guessingrows=1000;
	getnames=no; 
	datarow=2;
run;

*create temp file which will create code to rename generic variables to the cleaned up variables;
filename code temp;
data _null_;
  file code;
  set names end=eof;
  if not eof2 then set new_names1 end=eof2;
  if _n_=1 then put 'rename';
  put _name_ '=' final_new  ;
  if eof then put ';';
run;

*replace generic variable names with cleaned up variable names;
proc datasets nolist lib=work;
  modify prr;
  %include code;
run;
quit;

Kurt_Bremser · Posted 06-11-2021 12:53 PM

I have written a little C utility that reads a file byte-by-byte, counts the quotes, and if mod(count,2) = 1 and a linefeed is encountered, inserts <BR> instead. Is blindingly fast and can be used as filter in a FILENAME PIPE.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Registration is open

SAS Training: Just a Click Away