SAS sorting order is not preserved in Redshift tables

Mohan_Rang · Posted 02-11-2021 05:52 PM

When I load a SAS dataset sorted by a key(contains duplicate keys) using PROC SORT into Redshift, the same sorting order is not preserved in Redshift. For eg. When I compare the SAS dataset with Redshift table, they are not in the same sorting order, the data is really jumbled within the duplicates keys and very hard to find a pattern of how data stores in Redshift columns or why that's happening. Please find attached the sample example for reference.

Any suggestions or guidance on maintaining the same sorting after data load into Redshift from SAS would be very helpful. Thanks!

ballardw · Posted 02-11-2021 06:13 PM

Please share exactly how you moved the data to Redshift.

Were you creating a new file/table/ or what ever Redshift might call it? Or appending/inserting to an existing bit?

Putting "results" into Excel adds yet another place to corrupt data. Plus many of us won't open XLSX files from unknown sources because of organization policies or concerns over viruses and such.

Mohan_Rang · Posted 02-11-2021 06:28 PM

PROC APPEND BASE = Target_Table (bulkload=yes bl_compress=yes bl_bucket='xxx' bl_default_dir = 'xxx' bl_use_escape =YES) 
DATA = Source_Table force; 
RUN;

This is the code used to load the data from SAS to Redshift

ballardw · Posted 02-11-2021 07:55 PM

You did not "load" data, it appears that you appended it to an existing table.

So any data already in the table had its existing sort order and you can only expect the records added at the end of the table to maybe retain the order of the SAS data set.

I would suggest if you require a specific order to either use Redshift tools, what ever they may be, to resort the data after you append or to create a new table.

This is basically the equivalent of pasting lines from one document at the end of another.

Mohan_Rang · Posted 02-11-2021 11:11 PM

Thank you for the response. I would like to clarify. First, I delete all the data from the Redshift Target table and load the new SAS dataset into empty Redshift data structure. Actually the row count of both SAS table and Redshift data are the same but the sort order within the duplicate keys are random, not exactly how I would see in the SAS datatset before the data load into Redshift. It looks like SAS follows a certain sort sequence and Redshift follows another or maybe I'm wrong. If I can understand how the sort sequence works in Redshift, that would be really helpful in manipulating the sort order in Redshift to preserve it exactly like SAS.

Tom · Posted 02-11-2021 08:22 PM

Does Redshift actually have a concept of order? Most database systems do not return results in the order the observations were loaded. They return them in the order they are fetched from the storage which is generally totally independent of the order they were loaded. This is especially true of systems like Redshift that support multiple processors. If you want the observations returned in a specific order then use the ORDER BY clause in your SQL query. If you want to preserve the order they observations had in the SAS dataset then generated a new variable that you can use in the ORDER BY clause.

Mohan_Rang · Posted 02-11-2021 11:39 PM

Thanks for your response. I'm not sure if they have a concept of order but from my understanding Redshift uses sort keys to determine the order in which rows in a table are stored. I have no problems with Redshift returning the results from storage as well. They do return them in SAS in the same order they were loaded. The problem occurs during the data load from SAS to Redshift. Please see the sample screenshot below. It randomly stores some of the records in random order within the keys. I would like to mimic the same SAS sort order within the duplicate keys in Redshift too.

Kurt_Bremser · Posted 02-12-2021 05:54 AM

If you want to see a specific order from a database system, you have to force it in the query.

SAS does in fact do a query to the DBMS table when you open it in a viewtable window, and it will not specify an order during this. I suggest you define a view in Redshift with an ORDER BY clause, and use that to view your table.

As others said, DB systems will store rows in any order they think is optimal, and return them in that order. The only order they will assume automatically would be that of defined primary keys.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SAS sorting order is not preserved in Redshift tables

Re: SAS sorting order is not preserved in Redshift tables

Re: SAS sorting order is not preserved in Redshift tables

Re: SAS sorting order is not preserved in Redshift tables

Re: SAS sorting order is not preserved in Redshift tables

Re: SAS sorting order is not preserved in Redshift tables

Re: SAS sorting order is not preserved in Redshift tables

Re: SAS sorting order is not preserved in Redshift tables

Ready to join fellow brilliant minds for the SAS Hackathon?