I have several large csv files (~1,000,000 rows, ~30 columns) that I need to sort by an id variable. I want to be very cautious about not modifying the data at all, just sorting the lines of text and putting them back exactly how they were. The way I am currently doing it is extremely slow (it takes sometimes more than half an hour to sort one file). Does anyone have ideas on a faster way to do this? Maybe a way to find the length of the longest line so I can make the variable length shorter?
Here is my current code with some simplified example data:
filename test temp;
filename test1 temp;
data _null_;
file test;
put 'var1,var2,id,var4,var5';
array id{9} _temporary_ (9 5 6 7 4 8 3 1 2);
do i=1 to 9;
line = cats('Test data,more data,ID-', id[i], ',fourth column,"user input, of variable length."');
put line;
end;
run;
data read;
infile test delimiter = ',' MISSOVER DSD lrecl=13106;
length id $36 newline $5000;
input;
id = scan(_infile_, 3, ',', 'qm');
newline = _infile_;
if _n_ = 1 then call missing(id);
run;
proc sort data=read;
by id;
run;
data write;
set read;
file test1;
put newline;
run;
... View more