Solved: Re: Single column to multiple column of comma separated

jimbobob · Posted 03-28-2023 02:37 PM

My situation is similar but the output needs to be rows. All my data is in one giant line where each value is separated by comma, and I should only have one column of data, where each of these are rows. Also my one line is extremely long greater than 32K+ in length.

["T_283_20220605_73_283A0000006757_830010", "T_283_20220605_75_283A0000012029_830010", "T_283_20220605_77_283A0000017945_Ach_OPull", "T_283_20220605_79_283A0000011229_Ach_OPull"]

How to get it to transpose when i import it:

ColumnA

T_283_20220605_73_283A0000006757_830010

T_283_20220605_75_283A0000012029_830010

T_283_20220605_77_283A0000017945_Ach_OPull

T_283_20220605_79_283A0000011229_Ach_OPull

Attached is full sample.

Tom · Posted 03-29-2023 11:19 AM

You cannot use the END= option with RECFM=N. (You also don't need LRECL= option with RECFM=N).

So you will need another test to see if the you have reached the end of the file.

You can try just stopping when you get a missing value, but that might not really stop in time.

do while (missing(TRANSACTION_ID));
  input TRANSACTION_ID @@;
  if not missing(TRANSACTION_ID) then output;
end;

Otherwise you might need to try the EOF= option instead. That wants a LABEL to jump to and not a VARIABLE as the value. You can probably just jump past the end of the DO loop. Now the DO loop is really an infinite loop. But perhaps instead you should put some upper limit on the looping just in case.

So perhaps something like:

data dsn;
  length location $1000;
  set new_files;
  location=cats("&path/", file_name);
  infile dummy filevar=location dlm=',[]' dsd recfm=n eof=done;
  length transaction_id  $50;
  do _n_=1 to 1E8;
    input transaction_id @@;
    if not missing(transaction_id) then output;
  end;
done:
  drop file_date file_name;
run;

PS There is no need to attach an INFORMAT to a simple character variable. If you want to tell SAS how the variable should be DEFINED then just do that with the LENGTH statement. The same as you did for the LOCATION variable.

Otherwise since you don't seem to care which file the transaction id came from you can just use your list of files to define ONE fileref and then use that in the data step.

filename code temp;
data _null_;
  file code ;
  length location $1000;
  set new_files end=eof;
  location=cats("&path/", file_name);
  if _n_=1 then put 'filename allfiles (' ;
  put location :$quote. ;
  if eof then put ');' ;
run;

data dsn;
  infile allfiles dlm=',[]' dsd recfm=n ;
  length transaction_id  $50;
  input transaction_id @@;
  if not missing(transaction_id) then output;
run;

View solution in original post

ballardw · Posted 03-28-2023 02:54 PM

First thing; it is a good idea to start your own thread. If you think think that your question is related to another then post a link to the related thread(s). As the starter of a thread you have the ability to mark responses as an accepted solution.

You do not explain how the [ or ] characters in your example are to be treated.

This example treats the [ and ] as delimiters. The IF is because the structure of that line means you get blank values as read. You would replace "infile datalines" with "infile "yourfilenamegoes here".

data example;
   infile datalines dlm=',[]' dsd;
   informat value $50.;
   input value @@;
   if not missing(value) then output;
datalines;
["T_283_20220605_73_283A0000006757_830010", "T_283_20220605_75_283A0000012029_830010", "T_283_20220605_77_283A0000017945_Ach_OPull", "T_283_20220605_79_283A0000011229_Ach_OPull"]
;

The @@ on the input statement says hold the line and keep reading until you run out of information. You will likely see a note in the Log about reading to next line. That is normal.

jimbobob · Posted 03-28-2023 03:07 PM

Thanks @ballardw each file has a open and close bracket, one in the beginning and one at the end, I was just going to replace these if I found them in output. So that I understand the code better how or where in the code does it know these are to be rows of data and not 4 different columns? Is it the double at sign?

jimbobob · Posted 03-28-2023 03:43 PM

Also how do I deal with a file that has length greater than 32,768?, when I apply your logic to a file it stops at 750 records, where the last one is partial, and I've been told the file should have 10,000 records in it.

Reeza · Posted 03-28-2023 03:55 PM

Do you have a JSON or XML file you're attempting to parse here?

As mentioned, please start your own thread.

Kurt_Bremser · Posted 03-28-2023 04:20 PM

Increase the LRECL=. On UNIX, the maximum value for this is 1G.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 03-28-2023 04:07 PM

You need to provide a better example of the file you are reading. Make sure to use the Insert Code button to get a pop-up window to paste in the example text.

Is the file all one line or are there line breaks that indicate when a "ROW" ends? Do you want to treat the closing bracket as marking the end of a "ROW"?

You can use the LRECL= option on the INFILE statement to tell SAS that the lines are longer then 32767 bytes. If the lines are larger than the LRECL= option supports then use RECFM=N instead and treat the file as one long line. In that case you might need to remove any LF and/or CR characters that you end up reading in.

So something like this will read all of the "words" out of the file.

data want;
  infile 'myfile.txt' recfm=n dlm=' ,"' ;
  wordno+1;
  input word :$80. @@ ;
run;

If you want to remember what LINE the WORD was read from then read the file as variable length instead you can use the LENGTH= and COLUMN= option to tell when you reach the end of a line.

data want;
  infile 'myfile.txt' recfm=n dlm=' ,"' truncover lrecl=10000000 length=ll column=cc;
  row+1;
  do wordno=1 by 1 until(cc > ll);
    input word :$80. @ ;
    output;
  end;
run;

If you need check which of the different [ ] blocks the words are in then perhaps add something like:

data want;
  infile 'myfile.txt' recfm=n dlm=' ,"' truncover lrecl=10000000 length=ll column=cc;
  row+1;
  do block=1 by 1 until(cc>ll);
    do wordno=1 by 1 until(cc>ll or word=']');
      input word :$80. @ ;
      output;
   end;
  end;
run;

jimbobob · Posted 03-28-2023 04:33 PM

I've attached one of the text file, does it look like a xml or json kurt? @Kurt_Bremser, if so is there another way to import this?

jimbobob · Posted 03-28-2023 05:02 PM

lrecl=10000000 recfm=n truncover adding this seems to work @Tom @Kurt_Bremser @ballardw

jimbobob · Posted 03-29-2023 11:06 AM

I wanted to loop this through multiple files so I took some existing code I had that read in multiple files, however it stops after reading just one file, looking at the log I see this NOTE: Unexpected end of file for binary input. Which I'm thinking is why it stops. Is there an option I'm missing to keep reading thru my file list

	DATA dsn;
		length location $1000;
		set NEW_FILES;
		location=cats("&path/", FILE_NAME);
		infile dummy filevar=location end=done DLM=',[]' DSD lrecl=10000000 recfm=n;
	    informat TRANSACTION_ID  $50.;
		Do while (not done);
			input TRANSACTION_ID @@;
   			if not missing(TRANSACTION_ID) then output;
		end;
		drop file_date file_name;
	RUN;

Thanks any help is appreciated

Tom · Posted 03-29-2023 11:19 AM

You cannot use the END= option with RECFM=N. (You also don't need LRECL= option with RECFM=N).

So you will need another test to see if the you have reached the end of the file.

You can try just stopping when you get a missing value, but that might not really stop in time.

do while (missing(TRANSACTION_ID));
  input TRANSACTION_ID @@;
  if not missing(TRANSACTION_ID) then output;
end;

Otherwise you might need to try the EOF= option instead. That wants a LABEL to jump to and not a VARIABLE as the value. You can probably just jump past the end of the DO loop. Now the DO loop is really an infinite loop. But perhaps instead you should put some upper limit on the looping just in case.

So perhaps something like:

data dsn;
  length location $1000;
  set new_files;
  location=cats("&path/", file_name);
  infile dummy filevar=location dlm=',[]' dsd recfm=n eof=done;
  length transaction_id  $50;
  do _n_=1 to 1E8;
    input transaction_id @@;
    if not missing(transaction_id) then output;
  end;
done:
  drop file_date file_name;
run;

PS There is no need to attach an INFORMAT to a simple character variable. If you want to tell SAS how the variable should be DEFINED then just do that with the LENGTH statement. The same as you did for the LOCATION variable.

Otherwise since you don't seem to care which file the transaction id came from you can just use your list of files to define ONE fileref and then use that in the data step.

filename code temp;
data _null_;
  file code ;
  length location $1000;
  set new_files end=eof;
  location=cats("&path/", file_name);
  if _n_=1 then put 'filename allfiles (' ;
  put location :$quote. ;
  if eof then put ');' ;
run;

data dsn;
  infile allfiles dlm=',[]' dsd recfm=n ;
  length transaction_id  $50;
  input transaction_id @@;
  if not missing(transaction_id) then output;
run;

jimbobob · Posted 03-29-2023 04:36 PM

Awesome Thanks @Tom that works. Appreciate your help

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!