<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to sort and restrict a big dataset in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969822#M376953</link>
    <description>&lt;P&gt;The PROC SUMMARY offers a neat compact single-pass solution.&amp;nbsp; It will certainly use a lot less disk input/output resources than the PROC SORT solution, and will probably be a lot faster - assuming there is no memory constraint.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;BUT ... does your data have the possibility of tied maximum time_flag values for a given ID/VAR1/VAR2?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If not, then ignore the rest of this comment.&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But if it does, then the PROC SUMMARY might not likely give the same result as the PROC SORT ... if LAST.VAR2 solution.&amp;nbsp; It will choose different records (with possibly different VAR3/VAR4 values) among the tied records.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This is because the default behavior of PROC SORT is to preserve the original order (from the unsorted dataset) of tied records.&amp;nbsp; So that solution would always choose the latest of the tied records.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;A quick test of PROC SUMMARY with ties suggests it would always choose the first of such ties.&amp;nbsp; At least it did so in the test below:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data ties;
  set sashelp.class (keep=name sex age weight);
  order='First'; output;
  order='Last' ; output;
run;
proc summary nway data=ties;
  class sex age;
  output out=summ_want (drop=_:) idgroup (max(weight) out (name weight order)=);
run;
proc sort data=ties;
  by sex age weight;
run;
data sort_want;
  set ties;
  by sex age;
  if last.age;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Dataset summ_want has order='First' in every output record, but the PROC SORT approach always has ORDER='Last'.&lt;/P&gt;</description>
    <pubDate>Fri, 27 Jun 2025 00:14:07 GMT</pubDate>
    <dc:creator>mkeintz</dc:creator>
    <dc:date>2025-06-27T00:14:07Z</dc:date>
    <item>
      <title>How to sort and restrict a big dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969691#M376921</link>
      <description>Some advice on the code below please. Is this the right code for a dataset with millions of records where I want to restrict to the most recent record (maximum time_flag) and retain all the columns associated with it for each group ordered and defined by 3 variable? I fear using proc sort would take ages.&lt;BR /&gt; &lt;BR /&gt;* Take only the most recent record for each ID / var1 / var2 group;&lt;BR /&gt;PROC SQL;&lt;BR /&gt;create table spine as&lt;BR /&gt;select distinct&lt;BR /&gt;ID,&lt;BR /&gt;Var1,&lt;BR /&gt;Var2,&lt;BR /&gt;Var3,&lt;BR /&gt;Var4,&lt;BR /&gt;time_flag,&lt;BR /&gt;from mydataset as big&lt;BR /&gt;group by ID, var1, var2&lt;BR /&gt;having time_flag = max(time_flag)&lt;BR /&gt;order by ID, var1, var2;&lt;BR /&gt;quit;</description>
      <pubDate>Wed, 25 Jun 2025 17:24:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969691#M376921</guid>
      <dc:creator>Callam1</dc:creator>
      <dc:date>2025-06-25T17:24:39Z</dc:date>
    </item>
    <item>
      <title>Re: How to sort and restrict a big dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969728#M376925</link>
      <description>&lt;P&gt;Try this:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort
  data=mydataset (
    keep=id var1 var2 var3 var4 time_flag
  )
  out=sorted
;
by id var1 var2 time_flag;
run;

data spine;
set sorted;
by id var1 var2;
if last.var2;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;SQL will also sort, and it is usually less efficient than PROC SORT.&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jun 2025 20:00:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969728#M376925</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2025-06-25T20:00:42Z</dc:date>
    </item>
    <item>
      <title>Re: How to sort and restrict a big dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969736#M376927</link>
      <description>&lt;P&gt;Did you try PROC SUMMARY?&lt;/P&gt;
&lt;P&gt;Use your primary keys as the CLASS (or if already sorted BY) variables.&lt;/P&gt;
&lt;P&gt;Use the IDGROUP and the MAX() type to select the observation with the maximum value of your time/ordering variable. And list the other variables you want copied from that maximum observation.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc summary nway data=mydataset ;
  class id var1 var2 ;
  output out=spine idgroup (max(time_flag) out (var3 var4 time_flag)=);
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;You might want to use the DROP= dataset option on the output dataset reference to remove the _TYPE_ _FREQ_ variables that PROC SUMMARY adds.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;  output out=spine(drop=_type_ _freq_) idgroup (max(time_flag) out (var3 var4 time_flag)=);&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Example:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
  input id var1 var2 flag ;
cards;
1 1 1 1
1 1 2 2
1 1 3 3
1 2 1 1
1 3 1 1
;

proc print;
run;

proc summary data=have nway;
 class id var1 ;
 output out=want idgroup (max(var2) out (var2 flag)=) ;
run;

proc print;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Tom_0-1750884457359.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/108023i8BDC251C36C9A6F2/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Tom_0-1750884457359.png" alt="Tom_0-1750884457359.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jun 2025 20:47:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969736#M376927</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2025-06-25T20:47:46Z</dc:date>
    </item>
    <item>
      <title>Re: How to sort and restrict a big dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969739#M376928</link>
      <description>Thank you that is very simple. I didn’t use it as I thought it would have been much slower than the sql code. But you are suggesting the opposite. So a simple sort with sas is actually more efficient when using a big dataset?</description>
      <pubDate>Wed, 25 Jun 2025 20:52:36 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969739#M376928</guid>
      <dc:creator>Callam1</dc:creator>
      <dc:date>2025-06-25T20:52:36Z</dc:date>
    </item>
    <item>
      <title>Re: How to sort and restrict a big dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969741#M376930</link>
      <description>Thank you. Is the proc summary quicker than proc sort?</description>
      <pubDate>Wed, 25 Jun 2025 21:05:37 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969741#M376930</guid>
      <dc:creator>Callam1</dc:creator>
      <dc:date>2025-06-25T21:05:37Z</dc:date>
    </item>
    <item>
      <title>Re: How to sort and restrict a big dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969744#M376932</link>
      <description>&lt;P&gt;What works faster will depend on your data and you can only prove which is faster by testing each method. Take a subset of your data and try it for yourself.&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jun 2025 21:26:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969744#M376932</guid>
      <dc:creator>SASKiwi</dc:creator>
      <dc:date>2025-06-25T21:26:14Z</dc:date>
    </item>
    <item>
      <title>Re: How to sort and restrict a big dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969753#M376937</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/460764"&gt;@Callam1&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;Thank you. Is the proc summary quicker than proc sort?&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;A likely issue with Proc Summary is getting the values for "other columns" as you said. Every variable would have to be referenced somewhere in the Proc Summary and might be a headache.&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jun 2025 23:47:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969753#M376937</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2025-06-25T23:47:19Z</dc:date>
    </item>
    <item>
      <title>Re: How to sort and restrict a big dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969760#M376943</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13884"&gt;@ballardw&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/460764"&gt;@Callam1&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;Thank you. Is the proc summary quicker than proc sort?&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;A likely issue with Proc Summary is getting the values for "other columns" as you said. Every variable would have to be referenced somewhere in the Proc Summary and might be a headache.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;You can use _ALL_ if you don't mind the warning messages.&lt;/P&gt;
&lt;PRE&gt;751
752  proc summary data=have nway;
753   class id var1 ;
754   output out=want idgroup (max(var2) out (_all_)=) ;
755  run;

WARNING: Variable id already exists on file WORK.WANT.
WARNING: Variable var1 already exists on file WORK.WANT.
WARNING: The duplicate variables will not be included in the output data set of the output statement number 1.
&lt;/PRE&gt;</description>
      <pubDate>Thu, 26 Jun 2025 02:33:08 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969760#M376943</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2025-06-26T02:33:08Z</dc:date>
    </item>
    <item>
      <title>Re: How to sort and restrict a big dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969803#M376949</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/460764"&gt;@Callam1&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;So a simple sort with sas is actually more efficient when using a big dataset?&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;See Maxim 4. Try It.&lt;/P&gt;</description>
      <pubDate>Thu, 26 Jun 2025 15:20:02 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969803#M376949</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2025-06-26T15:20:02Z</dc:date>
    </item>
    <item>
      <title>Re: How to sort and restrict a big dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969822#M376953</link>
      <description>&lt;P&gt;The PROC SUMMARY offers a neat compact single-pass solution.&amp;nbsp; It will certainly use a lot less disk input/output resources than the PROC SORT solution, and will probably be a lot faster - assuming there is no memory constraint.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;BUT ... does your data have the possibility of tied maximum time_flag values for a given ID/VAR1/VAR2?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If not, then ignore the rest of this comment.&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But if it does, then the PROC SUMMARY might not likely give the same result as the PROC SORT ... if LAST.VAR2 solution.&amp;nbsp; It will choose different records (with possibly different VAR3/VAR4 values) among the tied records.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This is because the default behavior of PROC SORT is to preserve the original order (from the unsorted dataset) of tied records.&amp;nbsp; So that solution would always choose the latest of the tied records.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;A quick test of PROC SUMMARY with ties suggests it would always choose the first of such ties.&amp;nbsp; At least it did so in the test below:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data ties;
  set sashelp.class (keep=name sex age weight);
  order='First'; output;
  order='Last' ; output;
run;
proc summary nway data=ties;
  class sex age;
  output out=summ_want (drop=_:) idgroup (max(weight) out (name weight order)=);
run;
proc sort data=ties;
  by sex age weight;
run;
data sort_want;
  set ties;
  by sex age;
  if last.age;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Dataset summ_want has order='First' in every output record, but the PROC SORT approach always has ORDER='Last'.&lt;/P&gt;</description>
      <pubDate>Fri, 27 Jun 2025 00:14:07 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-sort-and-restrict-a-big-dataset/m-p/969822#M376953</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2025-06-27T00:14:07Z</dc:date>
    </item>
  </channel>
</rss>

