BookmarkSubscribeRSS Feed
mtnbikerjoshua
Obsidian | Level 7

I have several large csv files (~1,000,000 rows, ~30 columns) that I need to sort by an id variable. I want to be very cautious about not modifying the data at all, just sorting the lines of text and putting them back exactly how they were. The way I am currently doing it is extremely slow (it takes sometimes more than half an hour to sort one file). Does anyone have ideas on a faster way to do this? Maybe a way to find the length of the longest line so I can make the variable length shorter?

 

Here is my current code with some simplified example data:

 

filename test temp;
filename test1 temp;

data _null_;
  file test;
  put 'var1,var2,id,var4,var5';
  array id{9} _temporary_ (9 5 6 7 4 8 3 1 2);
  do i=1 to 9;
    line = cats('Test data,more data,ID-', id[i], ',fourth column,"user input, of variable length."');
    put line;
  end;
run;

data read;
  infile test delimiter = ',' MISSOVER DSD lrecl=13106;
  length id $36 newline $5000;
  input;
  id = scan(_infile_, 3, ',', 'qm');
  newline = _infile_;
  if _n_ = 1 then call missing(id);
run;
proc sort data=read;
  by id;
run;
data write;
  set read;
  file test1;
  put newline;
run;
4 REPLIES 4
AhmedAl_Attar
Ammonite | Level 13

@mtnbikerjoshua 

For flat files, you maybe better off using the OS sort command. they will be faster than reading in then writing out.

Linux & Windows both have sort command

Just my 2 cents

ballardw
Super User

It is usually helpful to provide an example of the file being read.

 

I suggest showing us the LOG from running this code, the read, sort and write on one of your files as your approach looks about right. So if this is taking time it will help to see where the time is used.

If you are reading data over a network you may just be having issues with bandwidth.

 

Note: you do not need to create the data set WRITE, which may be part of the time. Use  DATA  _NULL_ instead of DATA WRITE. That way no secondary writing to the data set happens along with the write to the text file.

Tom
Super User Tom
Super User

A million lines is not very much.  Assuming the file is short enough to fit into a single character variable you can just do something like this.

 

Let's assume that the 4th variable has the ID value you want to sort by.

filename old 'original_file.csv';
filename new 'new_file.csv';
data lines ;
   infile OLD dsd truncover firstobs=2 ;
   length id 8 line $32767 ;
   input (4*id) (??) ;
   line = _infile_;
run;
proc sort data=lines ;
  by id;
run;
data _null_;
  file new ;
  set lines ;
  if _n_=1 then do;
     infile old;
     input;
     put _infile_;
  end;
  put line ;
run;

Basically the first data step skips the header row and then creates two variables for every line. One with the ID value and one with the whole line from the CSV file.

It then sorts.  You might try using the TAGSORT option to see if the sort can run faster.

To write the new file it will first read in and write the header line and then write the lines from the dataset.

SASKiwi
PROC Star

What's the reason for sorting the CSV file in the first place? If a different order is required for all users it would be a lot easier just to change the original CSV export process than to change the sort order afterwards.

 

If a different order is required for a particular purpose, then just read the data in as is then sort it after importing.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1420 views
  • 3 likes
  • 5 in conversation