<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Run compged by splitting dataset and then merge the output by using do loop in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425534#M104830</link>
    <description>&lt;P&gt;Even though you propose to break down your comparison to chunks of size N_start&amp;nbsp;by 1,000 chunks, you will still end up with N_start *&lt;/P&gt;
&lt;P&gt;N_end comparison to do - no savings in overall time.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;COMPGED on such a Cartesian comparison is expensive.&amp;nbsp; So I would suggest:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; (0) make sure all names are entirely in upper (or lower) case.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; (1) replace&amp;nbsp;all instances of duplicate names in each data set with a single record containing the name and pointers/record id's of the original observations.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; (2) look for cases of exact equality between start and end - save the matches and remove the matched names from start and end&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; (3) can you standardize names?&amp;nbsp; I.e. convert trailing "JUNIOR" to "JR."&amp;nbsp;&amp;nbsp;&amp;nbsp; and "JR" also to "JR.", etc.&amp;nbsp; Rerun #2.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; (4)&amp;nbsp; avoid making comparisons in which COMPGED is certain to be over 50.&amp;nbsp; For instance establish&amp;nbsp;the number of letters in the names.&amp;nbsp; Then do the COMPGED &amp;nbsp;comparison for&amp;nbsp;&amp;nbsp;lengths of&amp;nbsp; X in start only to cases with lengths of (say)&amp;nbsp;&amp;nbsp; X-4 through X+4 in end.&amp;nbsp;&amp;nbsp; Then length X+1 in start to X-3 through X+5 in end.&amp;nbsp; Or if your names have multiple words, you might additionally filter based on the number of words in the name.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;These are the sorts of processes I used to match records for mutual funds by name.&amp;nbsp; Those names were almost all&amp;nbsp;multiple words&amp;nbsp; (vanguard explorer admiral shares) and often abbreviated in random ways&amp;nbsp; (vgrd exp adm shrs) or sometimes with reordered words(vgrd expl shrs adm).&amp;nbsp; This was all done to generate best matches to subsequently be inspected manually.&amp;nbsp; False positives were to be avoided at the expense of missed matches.&lt;/P&gt;</description>
    <pubDate>Sat, 06 Jan 2018 21:06:52 GMT</pubDate>
    <dc:creator>mkeintz</dc:creator>
    <dc:date>2018-01-06T21:06:52Z</dc:date>
    <item>
      <title>Run compged by splitting dataset and then merge the output by using do loop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425406#M104797</link>
      <description>&lt;P&gt;Hi Guys,&lt;/P&gt;&lt;P&gt;I want to ask some help from you guys regarding reading the data and saving results by using do-loop.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have two dataset "start" and "end" and merge two dataset if I find two names similar (using compged).&amp;nbsp;&lt;/P&gt;&lt;P&gt;Because the number of observation is big, it takes too long so I want to do it by splitting dataset.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql noprint ;
 create table want as
 select *
 from start inner join end
 on (compged(start.companyname,end.name,'i') le 50)
 order by companyname;
quit ;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Above is the code I wan to run by splitting sample 'end' into multiple subsamples.&lt;/P&gt;&lt;P&gt;Is there a way that&lt;/P&gt;&lt;P&gt;- using 'start' data as a whole&lt;/P&gt;&lt;P&gt;- split 'end' data for, say, every 1000 observations&lt;/P&gt;&lt;P&gt;- and run the above codes multiple times&lt;/P&gt;&lt;P&gt;- then I will have tens of 'want's&lt;/P&gt;&lt;P&gt;- join them and have one final 'want'.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would really appreciate it if anyone has some suggestions!&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Fri, 05 Jan 2018 21:54:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425406#M104797</guid>
      <dc:creator>Sangho</dc:creator>
      <dc:date>2018-01-05T21:54:34Z</dc:date>
    </item>
    <item>
      <title>Re: Run compged by splitting dataset and then merge the output by using do loop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425410#M104798</link>
      <description>&lt;P&gt;I'm not sure it is really what you ultimately want to (or should) do, but you can split end as follows:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;DATA end_partition1;
    SET end (firstobs=i obs=j);
RUN;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp; So for sets of 1,000 observations, do i=1 j=1000, then i=1001, j=2000, etc etc&lt;/P&gt;</description>
      <pubDate>Fri, 05 Jan 2018 22:05:07 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425410#M104798</guid>
      <dc:creator>HB</dc:creator>
      <dc:date>2018-01-05T22:05:07Z</dc:date>
    </item>
    <item>
      <title>Re: Run compged by splitting dataset and then merge the output by using do loop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425416#M104801</link>
      <description>&lt;P&gt;You aren't really changing the number of comparisons though are you? If not, it's not going to be any faster.&lt;/P&gt;</description>
      <pubDate>Fri, 05 Jan 2018 22:10:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425416#M104801</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2018-01-05T22:10:34Z</dc:date>
    </item>
    <item>
      <title>Re: Run compged by splitting dataset and then merge the output by using do loop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425421#M104803</link>
      <description>&lt;P&gt;But it is way faster when I tried one subsample by manually splitting it.&lt;/P&gt;&lt;P&gt;Also, when I run it as a whole, I get the message below.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The execution of this query involves performing one or more Cartesian product joins that can not be optimized.&lt;/P&gt;</description>
      <pubDate>Fri, 05 Jan 2018 22:21:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425421#M104803</guid>
      <dc:creator>Sangho</dc:creator>
      <dc:date>2018-01-05T22:21:15Z</dc:date>
    </item>
    <item>
      <title>Re: Run compged by splitting dataset and then merge the output by using do loop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425425#M104804</link>
      <description>It's possible that the current query time is too long for a timeout or some other issue, where 10 queries of 1 hour each are acceptable but 1 query of 10 hours isn't.&lt;BR /&gt;</description>
      <pubDate>Fri, 05 Jan 2018 22:29:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425425#M104804</guid>
      <dc:creator>snoopy369</dc:creator>
      <dc:date>2018-01-05T22:29:47Z</dc:date>
    </item>
    <item>
      <title>Re: Run compged by splitting dataset and then merge the output by using do loop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425426#M104805</link>
      <description>If you're not getting the cartesian join message on a subsample, then perhaps SAS is choosing to use a hash table in the smaller version rather than the bigger?</description>
      <pubDate>Fri, 05 Jan 2018 22:31:52 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425426#M104805</guid>
      <dc:creator>snoopy369</dc:creator>
      <dc:date>2018-01-05T22:31:52Z</dc:date>
    </item>
    <item>
      <title>Re: Run compged by splitting dataset and then merge the output by using do loop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425429#M104806</link>
      <description>&lt;P&gt;If you subsplit are you still doing the same number of comparisons?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If the values are in a single data set, and you're sub-setting the big data set you're reducing the number of comparisons until you run it against the full data set.&amp;nbsp; The only other thing I can think of, is your subsets have an additional criteria that you're not accounting for in the cross join.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You should be getting the same messages regardless of subset or the full data set.&lt;/P&gt;</description>
      <pubDate>Fri, 05 Jan 2018 22:34:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425429#M104806</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2018-01-05T22:34:48Z</dc:date>
    </item>
    <item>
      <title>Re: Run compged by splitting dataset and then merge the output by using do loop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425444#M104808</link>
      <description>&lt;P&gt;In case you want it, this macro will make 4 data files (chunk1, chunk5, chunk9, and chunk13) out of the test file.&amp;nbsp; Maybe you could adapt it.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data end;
   input Plan;
datalines;
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
;
run;


%macro datasplitter(chunksize);
	%do i=1 %to 16 %by &amp;amp;chunksize;
    	data chunk&amp;amp;i;
			%let j= %eval(&amp;amp;i + &amp;amp;chunksize - 1);
			SET end (firstobs=&amp;amp;i obs=&amp;amp;j);
		run;
	%end;
%mend;

%datasplitter(4);
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;A better programmer would know how to replace the hardcoded 16 with the total number of observations and make things more flexible.&amp;nbsp; I don't. &lt;/P&gt;</description>
      <pubDate>Fri, 05 Jan 2018 23:28:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425444#M104808</guid>
      <dc:creator>HB</dc:creator>
      <dc:date>2018-01-05T23:28:43Z</dc:date>
    </item>
    <item>
      <title>Re: Run compged by splitting dataset and then merge the output by using do loop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425534#M104830</link>
      <description>&lt;P&gt;Even though you propose to break down your comparison to chunks of size N_start&amp;nbsp;by 1,000 chunks, you will still end up with N_start *&lt;/P&gt;
&lt;P&gt;N_end comparison to do - no savings in overall time.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;COMPGED on such a Cartesian comparison is expensive.&amp;nbsp; So I would suggest:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; (0) make sure all names are entirely in upper (or lower) case.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; (1) replace&amp;nbsp;all instances of duplicate names in each data set with a single record containing the name and pointers/record id's of the original observations.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; (2) look for cases of exact equality between start and end - save the matches and remove the matched names from start and end&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; (3) can you standardize names?&amp;nbsp; I.e. convert trailing "JUNIOR" to "JR."&amp;nbsp;&amp;nbsp;&amp;nbsp; and "JR" also to "JR.", etc.&amp;nbsp; Rerun #2.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; (4)&amp;nbsp; avoid making comparisons in which COMPGED is certain to be over 50.&amp;nbsp; For instance establish&amp;nbsp;the number of letters in the names.&amp;nbsp; Then do the COMPGED &amp;nbsp;comparison for&amp;nbsp;&amp;nbsp;lengths of&amp;nbsp; X in start only to cases with lengths of (say)&amp;nbsp;&amp;nbsp; X-4 through X+4 in end.&amp;nbsp;&amp;nbsp; Then length X+1 in start to X-3 through X+5 in end.&amp;nbsp; Or if your names have multiple words, you might additionally filter based on the number of words in the name.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;These are the sorts of processes I used to match records for mutual funds by name.&amp;nbsp; Those names were almost all&amp;nbsp;multiple words&amp;nbsp; (vanguard explorer admiral shares) and often abbreviated in random ways&amp;nbsp; (vgrd exp adm shrs) or sometimes with reordered words(vgrd expl shrs adm).&amp;nbsp; This was all done to generate best matches to subsequently be inspected manually.&amp;nbsp; False positives were to be avoided at the expense of missed matches.&lt;/P&gt;</description>
      <pubDate>Sat, 06 Jan 2018 21:06:52 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425534#M104830</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2018-01-06T21:06:52Z</dc:date>
    </item>
    <item>
      <title>Re: Run compged by splitting dataset and then merge the output by using do loop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425613#M104846</link>
      <description>&lt;P&gt;Thanks all of you guys for the suggestions!&lt;/P&gt;</description>
      <pubDate>Sun, 07 Jan 2018 17:36:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Run-compged-by-splitting-dataset-and-then-merge-the-output-by/m-p/425613#M104846</guid>
      <dc:creator>Sangho</dc:creator>
      <dc:date>2018-01-07T17:36:06Z</dc:date>
    </item>
  </channel>
</rss>

