<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Work with large datasets in Statistical Procedures</title>
    <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949041#M47444</link>
    <description>Can people advise on the most efficient/quickest way to check if a dataset with millions of records (millions of individuals where each individual has multiple records), contains a list of unique individual records as specified in a small dataset (30,000 unique IDs). So I have a dataset called ‘small’ with one variable called ID with 30,000 unique records. Then a dataset called ‘big’ with a variables ID and other variables with millions of records (ID not unique). I want to find how many individuals from ‘small’ I can find in ‘Big’. It is taking me days to sort by ID and merge. There must be a more efficient way. Thanks in advance</description>
    <pubDate>Fri, 25 Oct 2024 14:52:35 GMT</pubDate>
    <dc:creator>Callam1</dc:creator>
    <dc:date>2024-10-25T14:52:35Z</dc:date>
    <item>
      <title>Work with large datasets</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949041#M47444</link>
      <description>Can people advise on the most efficient/quickest way to check if a dataset with millions of records (millions of individuals where each individual has multiple records), contains a list of unique individual records as specified in a small dataset (30,000 unique IDs). So I have a dataset called ‘small’ with one variable called ID with 30,000 unique records. Then a dataset called ‘big’ with a variables ID and other variables with millions of records (ID not unique). I want to find how many individuals from ‘small’ I can find in ‘Big’. It is taking me days to sort by ID and merge. There must be a more efficient way. Thanks in advance</description>
      <pubDate>Fri, 25 Oct 2024 14:52:35 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949041#M47444</guid>
      <dc:creator>Callam1</dc:creator>
      <dc:date>2024-10-25T14:52:35Z</dc:date>
    </item>
    <item>
      <title>Re: Work with large datasets</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949044#M47445</link>
      <description>&lt;P&gt;Something like this should work:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;Proc sql;
   create table work.found as
   select distinct a.id
          
   from work.small as a
        left join
        (selecet id from work.big) as b
        on a.id = b.id
   where not missing(b.id)
   ;
run;&lt;/PRE&gt;
&lt;P&gt;However if the dataset is very large there is going to be some time in the processing.&lt;/P&gt;
&lt;P&gt;The output set work.found will only have the ID values that matched one time (the DISTINCT does that normally a left join would have the ID appear once for each match found).&lt;/P&gt;
&lt;P&gt;Left join would normally include all the values from the A data set but the WHERE clause instructs SAS to only include the ones with a match.&lt;/P&gt;
&lt;P&gt;The (select Id from work.big) means only include that variable so the amount of stuff moved around is less and may run faster.&lt;/P&gt;</description>
      <pubDate>Fri, 25 Oct 2024 15:14:12 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949044#M47445</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2024-10-25T15:14:12Z</dc:date>
    </item>
    <item>
      <title>Re: Work with large datasets</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949045#M47446</link>
      <description>Thank you, I was considering that as next thing to try but I am still concerned about run time.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 25 Oct 2024 15:22:11 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949045#M47446</guid>
      <dc:creator>Callam1</dc:creator>
      <dc:date>2024-10-25T15:22:11Z</dc:date>
    </item>
    <item>
      <title>Re: Work with large datasets</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949050#M47447</link>
      <description>&lt;P&gt;Use a hash:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data check;
set big (keep=id);
if _n_ = 1
then do;
  declare hash s (dataset:"small");
  s.definekey("id");
  s.definedone();
end;
if s.check() = 0
then do;
  rc = s.remove();
 output;
end;
keep id;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Because of the REMOVE, only one observation per unique id will make it, so the final dataset has the exact count of unique found id's.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Edit: note that no explicit sorting is needed, dataset "big" can be read as is.&lt;/P&gt;
&lt;P&gt;While loading "small" into the hash, a search tree is built, effectively sorting it on the fly.&lt;/P&gt;</description>
      <pubDate>Fri, 25 Oct 2024 17:38:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949050#M47447</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2024-10-25T17:38:38Z</dc:date>
    </item>
    <item>
      <title>Re: Work with large datasets</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949081#M47451</link>
      <description>&lt;P&gt;You could also add a test to see if the hash is empty so you could stop reading the BIG dataset if every ID from the small dataset was already found.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;if s.check() = 0 then do;
  output;
  rc = s.remove();
  if 0=s.num_items then stop;
end;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Sat, 26 Oct 2024 06:21:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949081#M47451</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2024-10-26T06:21:34Z</dc:date>
    </item>
    <item>
      <title>Re: Work with large datasets</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949091#M47452</link>
      <description>&lt;P&gt;Good idea. Given 30000 vs. millions, the early stop will quite surely outweigh the additional IF, performancewise.&lt;/P&gt;</description>
      <pubDate>Sat, 26 Oct 2024 12:14:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949091#M47452</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2024-10-26T12:14:21Z</dc:date>
    </item>
    <item>
      <title>Re: Work with large datasets</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949099#M47453</link>
      <description>Thank you. It looks efficient and elegant. Are you familiar with the approach where you read the small dataset as a format then you apply that format to the big dataset.</description>
      <pubDate>Sat, 26 Oct 2024 17:32:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949099#M47453</guid>
      <dc:creator>Callam1</dc:creator>
      <dc:date>2024-10-26T17:32:31Z</dc:date>
    </item>
    <item>
      <title>Re: Work with large datasets</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949100#M47454</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/460764"&gt;@Callam1&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;Thank you. It looks efficient and elegant. Are you familiar with the approach where you read the small dataset as a format then you apply that format to the big dataset.&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;You can do that, but a format is slower than the hash, which becomes important when working with large datasets.&lt;/P&gt;
&lt;P&gt;We used to do it a lot before the introduction of the hash.&lt;/P&gt;</description>
      <pubDate>Sat, 26 Oct 2024 18:13:02 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949100#M47454</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2024-10-26T18:13:02Z</dc:date>
    </item>
    <item>
      <title>Re: Work with large datasets</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949101#M47455</link>
      <description>&lt;P&gt;HASH should be faster. (And code is not that hard once you get used to working with HASH).&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The idea of using a format would be to generate an CNTLIN dataset to define the FORMAT (or perhaps an INFORMAT if ID is character).&amp;nbsp; Then use PROC FORMAT to create the format.&amp;nbsp; Then in a data step that reads in the BIG dataset check each ID to see if the formatted value is then result you defined the format to return.&amp;nbsp; &amp;nbsp;Would require reading all of the BIG dataset.&amp;nbsp; Would return all of the matching observations from BIG, not just one per ID.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Another way do handle it is to make an index on the BIG dataset.&amp;nbsp; That might take some time but should be faster than sorting the dataset.&amp;nbsp; Plus the index might be useful for other things.&amp;nbsp; Then you could run a data step that reads in the SMALL dataset and checks whether each ID is found in the BIG dataset.&amp;nbsp; Should be fastest (once you have paid the up-front cost of creating the index).&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.lexjansen.com/wuss/2003/SASSolutions/c-an_animated_guide_speed_merges__iorc_.pdf" target="_blank"&gt;https://www.lexjansen.com/wuss/2003/SASSolutions/c-an_animated_guide_speed_merges__iorc_.pdf&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;So something like:&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data want;
  set small;
  set big key=id/unique;
  if _iorc_ then do;
     _error_=0;
     delete;
  end;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Sat, 26 Oct 2024 18:37:00 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949101#M47455</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2024-10-26T18:37:00Z</dc:date>
    </item>
    <item>
      <title>Re: Work with large datasets</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949103#M47456</link>
      <description>Brilliant! Thank you so much!&lt;BR /&gt;</description>
      <pubDate>Sat, 26 Oct 2024 19:36:11 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949103#M47456</guid>
      <dc:creator>Callam1</dc:creator>
      <dc:date>2024-10-26T19:36:11Z</dc:date>
    </item>
    <item>
      <title>Re: Work with large datasets</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949107#M47457</link>
      <description>&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql;
create table want as
select distinct id
 from big
  where id in (select distinct id from small);
quit;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Sun, 27 Oct 2024 00:56:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949107#M47457</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2024-10-27T00:56:54Z</dc:date>
    </item>
    <item>
      <title>Re: Work with large datasets</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949116#M47458</link>
      <description>Once I establish how many IDs are found in the big dataset, I might decide to analyse the data. In which case I will need to retain from the big dataset all of the records for the IDs (from Small) found. Is the hash method still valid or should I use the indexing approach suggested by Tom below?</description>
      <pubDate>Sun, 27 Oct 2024 08:55:12 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949116#M47458</guid>
      <dc:creator>Callam1</dc:creator>
      <dc:date>2024-10-27T08:55:12Z</dc:date>
    </item>
    <item>
      <title>Re: Work with large datasets</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949124#M47459</link>
      <description>Just omit the REMOVE method and modify the KEEP statement to include all variables needed.</description>
      <pubDate>Sun, 27 Oct 2024 09:49:05 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Work-with-large-datasets/m-p/949124#M47459</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2024-10-27T09:49:05Z</dc:date>
    </item>
  </channel>
</rss>

