<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Efficient data base search in SAS Procedures</title>
    <link>https://communities.sas.com/t5/SAS-Procedures/Efficient-data-base-search/m-p/69463#M19941</link>
    <description>Hi - &lt;BR /&gt;
&lt;BR /&gt;
I've been puzzling over this for some months. I have a work-around that gets the job done, but there has to be a more efficient and slicker way to do this than what I do now (proc sql maybe?). &lt;BR /&gt;
&lt;BR /&gt;
A genomics application. I have dataset A with a number of observations (10^2 - 10^4). Each observation has a chromosome (1-22) plus starting base and ending base. Call these A, A1, and A2). I have a second dataset, B, with a different set of features but is much larger (10^6-10^7), each with a chromosome and starting and ending base (B, B1, and B2). &lt;BR /&gt;
&lt;BR /&gt;
Taking the first observation in A, find all rows in B with the same chromosome and with any overlap. There may be more than one. So, a row in B matches if A=B and (A1 le B1 le A2 or A1 le B2 le A2 or B1 le A1 le B2 or B1 le A2 le B2). Repeat for all observations in A.&lt;BR /&gt;
&lt;BR /&gt;
Is there a fast and elegant way to do this? Many thanks in advance for any hints you might have - PFS</description>
    <pubDate>Thu, 18 Mar 2010 21:51:55 GMT</pubDate>
    <dc:creator>deleted_user</dc:creator>
    <dc:date>2010-03-18T21:51:55Z</dc:date>
    <item>
      <title>Efficient data base search</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Efficient-data-base-search/m-p/69463#M19941</link>
      <description>Hi - &lt;BR /&gt;
&lt;BR /&gt;
I've been puzzling over this for some months. I have a work-around that gets the job done, but there has to be a more efficient and slicker way to do this than what I do now (proc sql maybe?). &lt;BR /&gt;
&lt;BR /&gt;
A genomics application. I have dataset A with a number of observations (10^2 - 10^4). Each observation has a chromosome (1-22) plus starting base and ending base. Call these A, A1, and A2). I have a second dataset, B, with a different set of features but is much larger (10^6-10^7), each with a chromosome and starting and ending base (B, B1, and B2). &lt;BR /&gt;
&lt;BR /&gt;
Taking the first observation in A, find all rows in B with the same chromosome and with any overlap. There may be more than one. So, a row in B matches if A=B and (A1 le B1 le A2 or A1 le B2 le A2 or B1 le A1 le B2 or B1 le A2 le B2). Repeat for all observations in A.&lt;BR /&gt;
&lt;BR /&gt;
Is there a fast and elegant way to do this? Many thanks in advance for any hints you might have - PFS</description>
      <pubDate>Thu, 18 Mar 2010 21:51:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Efficient-data-base-search/m-p/69463#M19941</guid>
      <dc:creator>deleted_user</dc:creator>
      <dc:date>2010-03-18T21:51:55Z</dc:date>
    </item>
    <item>
      <title>Re: Efficient data base search</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Efficient-data-base-search/m-p/69464#M19942</link>
      <description>(Sorry - email got abbreviated)&lt;BR /&gt;
&lt;BR /&gt;
So, a row in B matches if A=B and (A1 le B1 le A2 or A1 le B2 le A2 or B1 le A1 le B2 or B1 le A2 le B2). Repeat for all observations in A. &lt;BR /&gt;
&lt;BR /&gt;
Is there a fast and elegant way to do this? Many thanks in advance for any hints you might have - PFS</description>
      <pubDate>Thu, 18 Mar 2010 21:53:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Efficient-data-base-search/m-p/69464#M19942</guid>
      <dc:creator>deleted_user</dc:creator>
      <dc:date>2010-03-18T21:53:38Z</dc:date>
    </item>
    <item>
      <title>Re: Efficient data base search</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Efficient-data-base-search/m-p/69465#M19943</link>
      <description>2 ways to do this, using either sql or a data step:&lt;BR /&gt;
&lt;BR /&gt;
data A;&lt;BR /&gt;
  do I=1 to 1e4;&lt;BR /&gt;
    A=int(ranuni(0)*22);&lt;BR /&gt;
    A1=int(ranuni(0)*100000);&lt;BR /&gt;
    A2=A1+10;&lt;BR /&gt;
    output;&lt;BR /&gt;
  end;&lt;BR /&gt;
run;&lt;BR /&gt;
&lt;BR /&gt;
data B(index=(B));&lt;BR /&gt;
  do I=1 to 1e6;&lt;BR /&gt;
    B=int(ranuni(0)*22);&lt;BR /&gt;
    B1=int(ranuni(0)*100000);&lt;BR /&gt;
    B2=B1+10;&lt;BR /&gt;
    output;&lt;BR /&gt;
  end;&lt;BR /&gt;
run;&lt;BR /&gt;
&lt;BR /&gt;
proc sql;            * join using sql;&lt;BR /&gt;
  create table C as &lt;BR /&gt;
  select * &lt;BR /&gt;
  from A&lt;BR /&gt;
    left join B&lt;BR /&gt;
    on A=B and (A1&amp;lt;=B1&amp;lt;=A2 or A1&amp;lt;=B2&amp;lt;=A2 or B1&amp;lt;=A1&amp;lt;=B2 or B1&amp;lt;=A2&amp;lt;=B2);&lt;BR /&gt;
quit;&lt;BR /&gt;
&lt;BR /&gt;
sql is a lot faster in this case (30s on my PC) using my data sample. &lt;BR /&gt;
Hash tables might be faster but the key is not unique and I don't have 9.2 to test.&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
    &lt;BR /&gt;
Message was edited by: Chris@NewZealand&lt;BR /&gt;
&lt;BR /&gt;
Removed data step</description>
      <pubDate>Fri, 19 Mar 2010 00:59:36 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Efficient-data-base-search/m-p/69465#M19943</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2010-03-19T00:59:36Z</dc:date>
    </item>
    <item>
      <title>Re: Efficient data base search</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Efficient-data-base-search/m-p/69466#M19944</link>
      <description>Fantastic - thanks Chris</description>
      <pubDate>Fri, 19 Mar 2010 12:13:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Efficient-data-base-search/m-p/69466#M19944</guid>
      <dc:creator>deleted_user</dc:creator>
      <dc:date>2010-03-19T12:13:54Z</dc:date>
    </item>
  </channel>
</rss>

