I am importing an unstructured text data file that exceeds the allowed limit per cell for SAS.
After fudging around a bit, I have been able to import some of the data, but it is not ideal situation and doesn't pull in the entire file it seems.
Is there a while, perhaps through the infile statement, that I can say, if the string exceeds 32000 string characters, start variable2 and continue to drop the data into successive variables until all text is stored?
example:
data file |
158,000 characters |
var1 | var2 | var3 | var4 |
tons of characters - aopndvikadfngnf;afngaogkma'psvdmax;clvkdasf gh, q-vw35iv,a -s8eri0t a,zs[0ro86ybuqi4,]-=l "*SIDgkj'pac';ew4l6kgvjw-4n5c | picksup where var1 left off - aonwerdvp maq9nyrgha;oriltluakce;sort8ryacfusfe /;.5gqa7w4pj5ca.seurjcioanuyrec | and so on | and so forth... till the end of the text file |
Example 5 in the following SAS Global Forum paper describes how you can split your variable that is longer than 32K into more than one variable.
http://support.sas.com/resources/papers/proceedings11/256-2011.pdf
If by "import" you mean Proc Import or any of the widgets that call proc import then your data is inappropriate for use. Proc Import is intended for structured data in pretty regular terms.
A data step can do what you want, I think:
data read; infile "some text file name.txt" lrecl=320000 flowover; input (v1 - v10) ($32000.); run;
the LRECL, logical record length, would attempt to default to something much to short. Flowover on the infile should handle any line break characters. I picked a combination of LRECL and variables lengths so they easily complement each other to evenly divide 10 variables into the record length.
The input with () groups things to repeat the use. So this reads the first 32,000 characters into V1, the second 32,000 into V2 etc.
Note: some forms of apparently "unstructured" text, such as report files, actually have a complex structure, also not suitable for Proc Import, but can be read in a meaningful way.
Use the $CHAR informat instead to preserve leading spaces at the boundary points.
If it is unstructured then how to know where the "cells" start and end?
If the goal is to read lines of text into multiple shorter character variables then something like this should work.
Let's use 10 as the maximum length for one value so our example can be small.
First let's make a sample file.
options parmcards=txt;
filename txt temp;
parmcards;
one,two,three
1,2,3
1234567890,12345678901,123456789012
1234567890123456789012345678901234567890
;
Now let's read it in. We can use the LRECL= to tell SAS that the lines are longer than the default 32K bytes. Note there is a limit to how large LRECL can be, but it is probably more than 1 million on most installations of SAS.
We can use the LENGTH= and COLUMN= option to determine when we have reached the end of the line.
So let's just make one observation per each 10 bytes read. We can add some additional variables to indicate where the 10 bytes came from.
data tall ;
infile txt truncover length=ll column=cc lrecl=1000000 ;
length row col 8 value $10;
row+1;
do col=1 by 1 until(cc>ll);
input value $char10. @ ;
output;
end;
run;
Result:
Obs row col value 1 1 1 one,two,th 2 1 2 ree 3 2 1 1,2,3 4 3 1 1234567890 5 3 2 ,123456789 6 3 3 01,1234567 7 3 4 89012 8 4 1 1234567890 9 4 2 1234567890 10 4 3 1234567890 11 4 4 1234567890
Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.
Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.