How to sort a large csv file by an id variable

mtnbikerjoshua · Posted 04-21-2023 03:16 PM

I have several large csv files (~1,000,000 rows, ~30 columns) that I need to sort by an id variable. I want to be very cautious about not modifying the data at all, just sorting the lines of text and putting them back exactly how they were. The way I am currently doing it is extremely slow (it takes sometimes more than half an hour to sort one file). Does anyone have ideas on a faster way to do this? Maybe a way to find the length of the longest line so I can make the variable length shorter?

Here is my current code with some simplified example data:

filename test temp;
filename test1 temp;

data _null_;
  file test;
  put 'var1,var2,id,var4,var5';
  array id{9} _temporary_ (9 5 6 7 4 8 3 1 2);
  do i=1 to 9;
    line = cats('Test data,more data,ID-', id[i], ',fourth column,"user input, of variable length."');
    put line;
  end;
run;

data read;
  infile test delimiter = ',' MISSOVER DSD lrecl=13106;
  length id $36 newline $5000;
  input;
  id = scan(_infile_, 3, ',', 'qm');
  newline = _infile_;
  if _n_ = 1 then call missing(id);
run;
proc sort data=read;
  by id;
run;
data write;
  set read;
  file test1;
  put newline;
run;

AhmedAl_Attar · Posted 04-21-2023 03:28 PM

@mtnbikerjoshua

For flat files, you maybe better off using the OS sort command. they will be faster than reading in then writing out.

Linux & Windows both have sort command

Just my 2 cents

ballardw · Posted 04-21-2023 03:35 PM

It is usually helpful to provide an example of the file being read.

I suggest showing us the LOG from running this code, the read, sort and write on one of your files as your approach looks about right. So if this is taking time it will help to see where the time is used.

If you are reading data over a network you may just be having issues with bandwidth.

Note: you do not need to create the data set WRITE, which may be part of the time. Use DATA _NULL_ instead of DATA WRITE. That way no secondary writing to the data set happens along with the write to the text file.

Tom · Posted 04-21-2023 09:58 PM

A million lines is not very much. Assuming the file is short enough to fit into a single character variable you can just do something like this.

Let's assume that the 4th variable has the ID value you want to sort by.

filename old 'original_file.csv';
filename new 'new_file.csv';
data lines ;
   infile OLD dsd truncover firstobs=2 ;
   length id 8 line $32767 ;
   input (4*id) (??) ;
   line = _infile_;
run;
proc sort data=lines ;
  by id;
run;
data _null_;
  file new ;
  set lines ;
  if _n_=1 then do;
     infile old;
     input;
     put _infile_;
  end;
  put line ;
run;

Basically the first data step skips the header row and then creates two variables for every line. One with the ID value and one with the whole line from the CSV file.

It then sorts. You might try using the TAGSORT option to see if the sort can run faster.

To write the new file it will first read in and write the header line and then write the lines from the dataset.

SASKiwi · Posted 04-22-2023 07:26 PM

What's the reason for sorting the CSV file in the first place? If a different order is required for all users it would be a lot easier just to change the original CSV export process than to change the sort order afterwards.

If a different order is required for a particular purpose, then just read the data in as is then sort it after importing.

How to sort a large csv file by an id variable

Re: How to sort a large csv file by an id variable

Re: How to sort a large csv file by an id variable

Re: How to sort a large csv file by an id variable

Re: How to sort a large csv file by an id variable

How to sort a large csv file by an id variable

Re: How to sort a large csv file by an id variable

Re: How to sort a large csv file by an id variable

Re: How to sort a large csv file by an id variable

Re: How to sort a large csv file by an id variable

Click image to register for webinar

Classroom Training Available!