<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Accessing a huge SAS data file in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737137#M229768</link>
    <description>&lt;P&gt;You are doing a PROC SORT with a BY _ALL_.&amp;nbsp; &amp;nbsp;Given that you are also specifying NODUP, I don't think you care about data order as much as you merely want to remove duplicates.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Is your real goal just to eliminate duplicate records?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If so, there are ways (hash objects applied to MD5 or SHA256 encryption applied against a concatenation of all your variables) that can be used to eliminate duplicates without the burden of a sort.&lt;/P&gt;</description>
    <pubDate>Tue, 27 Apr 2021 00:36:37 GMT</pubDate>
    <dc:creator>mkeintz</dc:creator>
    <dc:date>2021-04-27T00:36:37Z</dc:date>
    <item>
      <title>Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737091#M229752</link>
      <description>&lt;P&gt;I am working with a huge SAS data file (~ 50M observations).&amp;nbsp; When I run it, it says I don't have space. Please see below the log message I got. Could anyone help me to resolve this issue? Thanks.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="b0guna01_1-1619463925574.png" style="width: 680px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/58701i9EDF625C21C19FFF/image-dimensions/680x306?v=v2" width="680" height="306" role="button" title="b0guna01_1-1619463925574.png" alt="b0guna01_1-1619463925574.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 26 Apr 2021 19:35:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737091#M229752</guid>
      <dc:creator>b0guna01</dc:creator>
      <dc:date>2021-04-26T19:35:13Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737096#M229755</link>
      <description>You need 3x the space of a data set to sort it. &lt;BR /&gt;So if you have a 10GB data set do you have 30GB free to sort it? If not you'll need to find a different option - split the file into smaller portions or consider an INDEX instead. &lt;BR /&gt;Sorting by _all_ is also incredibly time intensive and kind of a weird thing to do on such a large data set. &lt;BR /&gt;I would have expected a more specified sort...</description>
      <pubDate>Mon, 26 Apr 2021 19:41:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737096#M229755</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2021-04-26T19:41:23Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737101#M229756</link>
      <description>&lt;P&gt;I am not sure if there may not be a space limit because of operations behind the scenes but&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;proc sql;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; create table want as&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; select distinct *&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; from have&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; ;&lt;/P&gt;
&lt;P&gt;quit;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;has a small chance of working.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Post log and code as text by copying text from the log or editor, opening a text box using the forum &amp;lt;/&amp;gt; icon and then pasting.&lt;/P&gt;
&lt;P&gt;It is extremely difficult to code from pictures and I for one am too lazy to retype code from a picture.&lt;/P&gt;
&lt;P&gt;Sometimes code is close to working but if I have to retype a lot of stuff to make one small change I'm likely not to. If text is provided then it is easy to edit or simply highlight things that need to change which isn't really easy with pictures.&lt;/P&gt;</description>
      <pubDate>Mon, 26 Apr 2021 20:23:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737101#M229756</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2021-04-26T20:23:13Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737132#M229764</link>
      <description>&lt;P&gt;Looks like you are running SAS locally on your PC so you can easily free up space on your C drive, if there are are a lot of files you don't want to keep including old SAS WORK folders. If D is also a local drive then you could consider using that for SAS WORK also. Don't use remote drives for SAS WORK folders as it will totally kill your performance.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 26 Apr 2021 23:05:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737132#M229764</guid>
      <dc:creator>SASKiwi</dc:creator>
      <dc:date>2021-04-26T23:05:32Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737135#M229766</link>
      <description>&lt;P&gt;If you need to free-up space on your disk, there are some really nice open source tools which help you understand what takes up space and what you could delete. I like&amp;nbsp;&lt;A href="https://windirstat.net/" target="_self"&gt;WinDirStat&lt;/A&gt; a lot.&lt;/P&gt;</description>
      <pubDate>Tue, 27 Apr 2021 00:31:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737135#M229766</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2021-04-27T00:31:18Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737137#M229768</link>
      <description>&lt;P&gt;You are doing a PROC SORT with a BY _ALL_.&amp;nbsp; &amp;nbsp;Given that you are also specifying NODUP, I don't think you care about data order as much as you merely want to remove duplicates.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Is your real goal just to eliminate duplicate records?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If so, there are ways (hash objects applied to MD5 or SHA256 encryption applied against a concatenation of all your variables) that can be used to eliminate duplicates without the burden of a sort.&lt;/P&gt;</description>
      <pubDate>Tue, 27 Apr 2021 00:36:37 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737137#M229768</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2021-04-27T00:36:37Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737174#M229782</link>
      <description>&lt;P&gt;You need space for the whole&amp;nbsp;&lt;EM&gt;uncompressed&lt;/EM&gt; dataset in your WORK. So you need to know more about your dataset:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;observation count&lt;/LI&gt;
&lt;LI&gt;observation size&lt;/LI&gt;
&lt;LI&gt;compressed: yes/no&lt;/LI&gt;
&lt;LI&gt;physical file size of the dataset&lt;/LI&gt;
&lt;LI&gt;if compressed, compression rate&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The latter can be determined by copying a sufficient subset (say, 1 million obs) to a compressed dataset in WORK and looking at the log.&lt;/P&gt;</description>
      <pubDate>Tue, 27 Apr 2021 06:07:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737174#M229782</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2021-04-27T06:07:21Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737196#M229797</link>
      <description>&lt;P&gt;On top of the other valid suggestions (free space, use &lt;FONT face="courier new,courier"&gt;select distinct &lt;/FONT&gt;if you don't care about order), two more suggestions:&lt;/P&gt;
&lt;P&gt;- Maybe you don't need this step all, what comes next?&lt;/P&gt;
&lt;P&gt;- Copy the table in SPDE format and it will be sorted on the fly.&lt;/P&gt;
&lt;P&gt;&amp;nbsp; This is very efficient and might require less space then proc sort, I have never looked into the space requirements.&lt;/P&gt;
&lt;P&gt;Something like&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data TEST SPEEDY.TEST(compress=binary);   
  retain A1-A99 0;
  do I=1e5 to 1 by -1; output; output; end; 
run;

proc sort data=TEST out=TEST1 nodup; by _ALL_; run; * current process;

data TEST2;                                         * SPDE process;
  set SPEEDY.TEST;
  by _ALL_;
  if md5(catx('|',of _ALL_)) ne lag( md5(catx('|',of _ALL_)) );
run;

&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;CPU usage will be much higher though.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Apr 2021 09:04:59 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737196#M229797</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2021-04-27T09:04:59Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737197#M229798</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/31461"&gt;@mkeintz&lt;/a&gt;&amp;nbsp;,&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13976"&gt;@SASKiwi&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Just as a side note when using functions like catx() and SHA(): They are often limited to 32KB (at least under SAS 9.4) so careful with &lt;EM&gt;of _all_&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Apr 2021 09:25:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737197#M229798</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2021-04-27T09:25:20Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737249#M229824</link>
      <description>Your dataset is too big for PROC SORT , try TAGSORT option.&lt;BR /&gt;&lt;BR /&gt;proc sort data=TEST out=TEST1 nodup  tagsort sortsize=max ;&lt;BR /&gt;by _ALL_; &lt;BR /&gt;run;</description>
      <pubDate>Tue, 27 Apr 2021 12:51:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737249#M229824</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2021-04-27T12:51:42Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737428#M229889</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/18408"&gt;@Ksharp&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;TAGSORT&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;will not work with&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;by _ALL_;&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Apr 2021 20:56:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737428#M229889</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2021-04-27T20:56:55Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737431#M229892</link>
      <description>&lt;P&gt;TAGSORT is meant to reduce the size of the utility file by putting only the key variable(s) and the observation pointer into it. If all variables need to go into it anyway, TAGSORT has no effect.&lt;/P&gt;</description>
      <pubDate>Tue, 27 Apr 2021 21:00:41 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737431#M229892</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2021-04-27T21:00:41Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737551#M229955</link>
      <description>1) Try PROC SQL&lt;BR /&gt;proc sql;&lt;BR /&gt;create table want as&lt;BR /&gt;select distinct * from have;&lt;BR /&gt;quit;&lt;BR /&gt;&lt;BR /&gt;2)Try batch process:&lt;BR /&gt;&lt;A href="https://communities.sas.com/t5/SAS-Programming/Insufficient-space-in-file-WORK-SASTMP-000000024-n-UTILITY/m-p/737449#M229910" target="_blank"&gt;https://communities.sas.com/t5/SAS-Programming/Insufficient-space-in-file-WORK-SASTMP-000000024-n-UTILITY/m-p/737449#M229910&lt;/A&gt;</description>
      <pubDate>Wed, 28 Apr 2021 12:01:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737551#M229955</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2021-04-28T12:01:17Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737588#M229973</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/379835"&gt;@b0guna01&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You have gotten a few suggestions on this topic.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But ... it would improve the quality and efficiency of responses if you told us whether your goal is only the removal of duplicates.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Do you really need a dataset ordered by _ALL_?&lt;/P&gt;</description>
      <pubDate>Wed, 28 Apr 2021 14:26:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/737588#M229973</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2021-04-28T14:26:45Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/738155#M230201</link>
      <description>&lt;P&gt;My computer has enough space but still takes around 6 hours.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 30 Apr 2021 13:50:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/738155#M230201</guid>
      <dc:creator>b0guna01</dc:creator>
      <dc:date>2021-04-30T13:50:01Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/738156#M230202</link>
      <description>&lt;P&gt;Yes, I am accessing the computer through a VPN.&lt;/P&gt;</description>
      <pubDate>Fri, 30 Apr 2021 13:53:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/738156#M230202</guid>
      <dc:creator>b0guna01</dc:creator>
      <dc:date>2021-04-30T13:53:29Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/738157#M230203</link>
      <description>&lt;P&gt;we already cleaned, but we don't see a huge difference in terms of timing.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 30 Apr 2021 13:55:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/738157#M230203</guid>
      <dc:creator>b0guna01</dc:creator>
      <dc:date>2021-04-30T13:55:29Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/738158#M230204</link>
      <description>&lt;P&gt;I want to remove the duplicates before doing the data analysis. It has 124 variables.&lt;/P&gt;</description>
      <pubDate>Fri, 30 Apr 2021 13:56:51 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/738158#M230204</guid>
      <dc:creator>b0guna01</dc:creator>
      <dc:date>2021-04-30T13:56:51Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/738290#M230268</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/379835"&gt;@b0guna01&lt;/a&gt;&amp;nbsp; - How you access the computer is irrelevant to your problem. If you are sending your data across a network to or from remote storage then it is definitely relevant to your problem&lt;/P&gt;</description>
      <pubDate>Fri, 30 Apr 2021 23:55:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/738290#M230268</guid>
      <dc:creator>SASKiwi</dc:creator>
      <dc:date>2021-04-30T23:55:21Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing a huge SAS data file</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/738291#M230269</link>
      <description>&lt;P&gt;OK, it's just de-duping.&amp;nbsp; Then you can replicate the NODUP + BY _ALL_ results by using the MD5 function and a hash object:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql noprint;
  /* Make a csv list of PUT(X,rb8) for all x's that are numeric variables */
  select cats('put(',name,',rb8.)') into :num_to_rb8 separated by ','
  from dictionary.columns 
  where libname='PROJECT' and memname='MEDICAID_V01_2010' and type='num';

  /* Get the total length of a single observation */
  select obslen into :concat_len
  from dictionary.tables
  where libname='PROJECT' and memname='MEDICAID_V01_2010';
quit;
%put &amp;amp;=num_to_rb8 ;
%put &amp;amp;=obslen ;

data want (drop=_:) ;
  set PROJECT.MEDICAID_V01_2010;

  /* Concatenate all the data into a single string ("message") named _CONCAT */
  length _concat $&amp;amp;concat_len ;
  _concat=cat(&amp;amp;num_to_rb8,of _character_);

  /* Make a "unique" signature for the message */
  length _md5 $16;
  _md5=md5(_concat);

  if _n_=1 then do;
    declare hash md5 (hashexp:10);
      md5.definekey('_md5');
      md5.definedata('_md5');
      md5.definedone();
  end;

  if md5.find()^=0 then do;
    output;  /*Output first obs for a given signature*/
    md5.add();
  end;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;I do not directly list numeric variables as arguments of the CAT (or CATX) functions, &lt;EM&gt;&lt;STRONG&gt;because different numeric values can generate matching _concat values&lt;/STRONG&gt;&lt;/EM&gt;, (in turn generating matching md5 values), destroying the whole point of de-duping here.&amp;nbsp; Consider the two concatenations below, where X ^= Y,&amp;nbsp; but cat(x,x)=cat(x,y):&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data _null_;
  x=0.1234567890123456;
  y=0.1234567890123457;
  if x=y then put "X Equals Y" ;
  else put "X Does NOT Equal Y";

  cat_x_x = cat(x,x);
  cat_x_y = cat(x,y);
  if cat_x_x=cat_x_y then put "CAT(X,X) DOES Equal CAT(X,Y)";
  else put "CAT(X,X) does NOT = CAT(X,Y)";
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;which generates the log&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;X Does NOT Equal Y
CAT(X,X) DOES Equal CAT(X,Y)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In other contexts (for instance when using the MD5 function), this is known as a "collision" - where distinct values of the original data generate equivalent results.&amp;nbsp; That's because the CAT family of functions convert the numeric values into text prior to concatenation, which does not always represent the value to the needed precision.&amp;nbsp; You can avoid that by keeping the original numeric "real binary" representation by using:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;  cat_x_x = cat(put(x,rb8.),put(x,rb8.));
  cat_x_y = cat(put(x,rb8.),put(y,rb8.));&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;If you rerun the modified program the log will say:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;X Does NOT Equal Y
CAT(X,X) does NOT = CAT(X,Y)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;That's why you see my PROC SQL code generating the macrovar &amp;amp;num_to_rb8.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The usual concern about a collision risk is in using the MD5 function, but that risk is very very .... very low.&amp;nbsp; It is intended to generate distinct values.&amp;nbsp; Citing page 339 of&amp;nbsp;&lt;A href="https://support.sas.com/content/dam/SAS/support/en/books/data-management-solutions-using-sas-hash-table-operations/69153_excerpt.pdf" target="_self"&gt;Data Management Solutions Using SAS Hash Table Operations&lt;/A&gt;&amp;nbsp; by Paul Dorfman and Don Henderson:&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;In the worst case scenario, the approximate number of items that need to be hashed to get a 50 percent chance of an MD5 collision is about 2**64≃2E+19. It means that to encounter just 1 collision, the MD5 function has to be executed against 100 quintillion distinct arguments the equal number of times, i.e., approximately 1 trillion times per second for 100 years. The probability of such an event is so infinitesimally negligible that one truly has an enormously greater chance of living through a baseball season where every single pitch is a strike and no batter ever gets on base. (Amusingly, some people who will confidently say that can never, ever happen may believe that an MD5 collision can happen.)&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Now if the number of true duplicates in the original data set is low, one could identify the records having duplicate MD5 values, and then confirm they all arise from true duplicate observations.&amp;nbsp; I'm not including such code here.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 01 May 2021 03:19:57 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Accessing-a-huge-SAS-data-file/m-p/738291#M230269</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2021-05-01T03:19:57Z</dc:date>
    </item>
  </channel>
</rss>

