BookmarkSubscribeRSS Feed
mtnbikerjoshua
Obsidian | Level 7

I have several large csv files (~1,000,000 rows, ~30 columns) that I need to sort by an id variable. I want to be very cautious about not modifying the data at all, just sorting the lines of text and putting them back exactly how they were. The way I am currently doing it is extremely slow (it takes sometimes more than half an hour to sort one file). Does anyone have ideas on a faster way to do this? Maybe a way to find the length of the longest line so I can make the variable length shorter?

 

Here is my current code with some simplified example data:

 

filename test temp;
filename test1 temp;

data _null_;
  file test;
  put 'var1,var2,id,var4,var5';
  array id{9} _temporary_ (9 5 6 7 4 8 3 1 2);
  do i=1 to 9;
    line = cats('Test data,more data,ID-', id[i], ',fourth column,"user input, of variable length."');
    put line;
  end;
run;

data read;
  infile test delimiter = ',' MISSOVER DSD lrecl=13106;
  length id $36 newline $5000;
  input;
  id = scan(_infile_, 3, ',', 'qm');
  newline = _infile_;
  if _n_ = 1 then call missing(id);
run;
proc sort data=read;
  by id;
run;
data write;
  set read;
  file test1;
  put newline;
run;
4 REPLIES 4
AhmedAl_Attar
Rhodochrosite | Level 12

@mtnbikerjoshua 

For flat files, you maybe better off using the OS sort command. they will be faster than reading in then writing out.

Linux & Windows both have sort command

Just my 2 cents

ballardw
Super User

It is usually helpful to provide an example of the file being read.

 

I suggest showing us the LOG from running this code, the read, sort and write on one of your files as your approach looks about right. So if this is taking time it will help to see where the time is used.

If you are reading data over a network you may just be having issues with bandwidth.

 

Note: you do not need to create the data set WRITE, which may be part of the time. Use  DATA  _NULL_ instead of DATA WRITE. That way no secondary writing to the data set happens along with the write to the text file.

Tom
Super User Tom
Super User

A million lines is not very much.  Assuming the file is short enough to fit into a single character variable you can just do something like this.

 

Let's assume that the 4th variable has the ID value you want to sort by.

filename old 'original_file.csv';
filename new 'new_file.csv';
data lines ;
   infile OLD dsd truncover firstobs=2 ;
   length id 8 line $32767 ;
   input (4*id) (??) ;
   line = _infile_;
run;
proc sort data=lines ;
  by id;
run;
data _null_;
  file new ;
  set lines ;
  if _n_=1 then do;
     infile old;
     input;
     put _infile_;
  end;
  put line ;
run;

Basically the first data step skips the header row and then creates two variables for every line. One with the ID value and one with the whole line from the CSV file.

It then sorts.  You might try using the TAGSORT option to see if the sort can run faster.

To write the new file it will first read in and write the header line and then write the lines from the dataset.

SASKiwi
PROC Star

What's the reason for sorting the CSV file in the first place? If a different order is required for all users it would be a lot easier just to change the original CSV export process than to change the sort order afterwards.

 

If a different order is required for a particular purpose, then just read the data in as is then sort it after importing.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 740 views
  • 3 likes
  • 5 in conversation