<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Match case control large data base in SAS Procedures</title>
    <link>https://communities.sas.com/t5/SAS-Procedures/Match-case-control-large-data-base/m-p/77394#M22413</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;&amp;nbsp; Dear all,&amp;nbsp; &lt;BR /&gt;I need help with programming. I am a new sas user.&lt;/P&gt;&lt;P&gt; I am matching 2 big data base for a case control study, with 4 controls for each cases and the control is only can be use once.&lt;/P&gt;&lt;P&gt;&amp;nbsp; &lt;/P&gt;&lt;P&gt;My case table consist of 65,000 observations with 3 variables&amp;nbsp; ID, sex and age. My control table consist of 7,000,000 observation and 4 variables ID, sex,age_hi and age_lo.&lt;/P&gt;&lt;P&gt; I use the procedure below:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;Proc sql;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;Create table caco as&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;Select a. ID as case_ID, b. ID as control_ID, a. sex as case_sex, b. sex as control_sex, &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;a.age as case_age, b. age as control_lopnr&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt; &lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;from case_table a, control_table b&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;where (a.sex=b.sex and (a.age between b.age_lo and b.age_hi))&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;order by case_ID, control_ID;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt; &lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;quit;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &lt;/P&gt;&lt;P&gt;This part took almost 13 hours, and I have to do this part several times since 1 run only give me less than 400 controls. I think the randomization process that took&lt;BR /&gt;long time, since the match variable is only sex and age. I used the same program for the same tables before, but with many more match criteria and it took me only max&lt;BR /&gt;less than 2 hours.&lt;/P&gt;&lt;P&gt;&amp;nbsp; &lt;/P&gt;&lt;P&gt;I guess I need to use hash table, but to be honest I don’t know how to do it.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt; Can somebody help me? Any help will be appreciated.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;/cantika&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Wed, 26 Sep 2012 09:48:42 GMT</pubDate>
    <dc:creator>cantika</dc:creator>
    <dc:date>2012-09-26T09:48:42Z</dc:date>
    <item>
      <title>Match case control large data base</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Match-case-control-large-data-base/m-p/77394#M22413</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;&amp;nbsp; Dear all,&amp;nbsp; &lt;BR /&gt;I need help with programming. I am a new sas user.&lt;/P&gt;&lt;P&gt; I am matching 2 big data base for a case control study, with 4 controls for each cases and the control is only can be use once.&lt;/P&gt;&lt;P&gt;&amp;nbsp; &lt;/P&gt;&lt;P&gt;My case table consist of 65,000 observations with 3 variables&amp;nbsp; ID, sex and age. My control table consist of 7,000,000 observation and 4 variables ID, sex,age_hi and age_lo.&lt;/P&gt;&lt;P&gt; I use the procedure below:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;Proc sql;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;Create table caco as&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;Select a. ID as case_ID, b. ID as control_ID, a. sex as case_sex, b. sex as control_sex, &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;a.age as case_age, b. age as control_lopnr&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt; &lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;from case_table a, control_table b&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;where (a.sex=b.sex and (a.age between b.age_lo and b.age_hi))&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;order by case_ID, control_ID;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt; &lt;SPAN lang="EN-US" style="mso-ansi-language: EN-US; mso-ascii-font-family: Calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: Calibri; mso-bidi-font-family: Calibri; mso-fareast-language: SV;"&gt;quit;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &lt;/P&gt;&lt;P&gt;This part took almost 13 hours, and I have to do this part several times since 1 run only give me less than 400 controls. I think the randomization process that took&lt;BR /&gt;long time, since the match variable is only sex and age. I used the same program for the same tables before, but with many more match criteria and it took me only max&lt;BR /&gt;less than 2 hours.&lt;/P&gt;&lt;P&gt;&amp;nbsp; &lt;/P&gt;&lt;P&gt;I guess I need to use hash table, but to be honest I don’t know how to do it.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt; Can somebody help me? Any help will be appreciated.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;/cantika&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 26 Sep 2012 09:48:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Match-case-control-large-data-base/m-p/77394#M22413</guid>
      <dc:creator>cantika</dc:creator>
      <dc:date>2012-09-26T09:48:42Z</dc:date>
    </item>
    <item>
      <title>Re: Match case control large data base</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Match-case-control-large-data-base/m-p/77395#M22414</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;13 seems like an awful lot of time, even given the "weak" join criteria.&lt;/P&gt;&lt;P&gt;Where does your data reside? In separate databases?&lt;/P&gt;&lt;P&gt;Optimizing the SQL can be hard (given the join criteria).&lt;/P&gt;&lt;P&gt;Indexing control_table &lt;EM&gt;might &lt;/EM&gt;help, especially if you have the option to store the table in SPDE.&lt;/P&gt;&lt;P&gt;Hashing can occur in SQL in certain circumstances, setting the PROC SQL option BUFFERSIZE higher than 64K &lt;EM&gt;might&lt;/EM&gt; help.&lt;/P&gt;&lt;P&gt;I think even rewriting the query to do a sub-query (on case_table) will be faster than 13 hrs.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I think writing a data step hash step will probably be very efficient, and I'm there are lot of people out there that can give you hints about that.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 26 Sep 2012 15:01:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Match-case-control-large-data-base/m-p/77395#M22414</guid>
      <dc:creator>LinusH</dc:creator>
      <dc:date>2012-09-26T15:01:13Z</dc:date>
    </item>
    <item>
      <title>Re: Match case control large data base</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Match-case-control-large-data-base/m-p/77396#M22415</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;What kind of data management system are your two source tables stored in? If it's a DBMS, are they both in the same database?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I suggest that you cut your test down to only a hundred or so randomly-selected case_table records, as a testbed while you try to improve the speed. With 100 records:&lt;/P&gt;&lt;P&gt;How long does the SQL Select take?&lt;/P&gt;&lt;P&gt;How many result records are created (how many control_table records match each case?)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Also, I notice you've posted the same question under "SAS Procedures". This is going to cause confusion.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You should also follow "Choosing the Earliest and Latest Dates" under SAS Procedures. It's an identical problem, I'm sure the same solution will solve both.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Tom&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Message was edited by: Tom Kari&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 26 Sep 2012 17:25:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Match-case-control-large-data-base/m-p/77396#M22415</guid>
      <dc:creator>TomKari</dc:creator>
      <dc:date>2012-09-26T17:25:30Z</dc:date>
    </item>
    <item>
      <title>Re: Match case control large data base</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Match-case-control-large-data-base/m-p/77397#M22416</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi TomKari,&lt;/P&gt;&lt;P&gt;I actually used the same program for many more matching criterias for the same databases and it worked perfectly fine.And it only took max less than 2 hours. I found almost all controls. For 70,00 cases I found 69,535 case with 4 control each. And the running time is not so long. That is why I think because of the matching criterias is only 2, so that the probability for cases to get the controls is bigger, so in 1 run only around around 400 cases which get controls. In this case, 1 case can get over 1000 controls.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Both case and control data tables are ini SAS format.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 26 Sep 2012 17:46:56 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Match-case-control-large-data-base/m-p/77397#M22416</guid>
      <dc:creator>cantika</dc:creator>
      <dc:date>2012-09-26T17:46:56Z</dc:date>
    </item>
    <item>
      <title>Re: Match case control large data base</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Match-case-control-large-data-base/m-p/77398#M22417</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi LinusH,&lt;/P&gt;&lt;P&gt;My data is 2 separate databases. I case db and 1 control DB, and I already reduce the variables with only matching variables.&lt;/P&gt;&lt;P&gt;Thank you for your suggestion.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 26 Sep 2012 17:52:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Match-case-control-large-data-base/m-p/77398#M22417</guid>
      <dc:creator>cantika</dc:creator>
      <dc:date>2012-09-26T17:52:54Z</dc:date>
    </item>
    <item>
      <title>Re: Match case control large data base</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Match-case-control-large-data-base/m-p/77399#M22418</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Cross posted from SAS Procedures:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The performance problem is caused by the combinatorials.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You have 7,000,000 control records:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;- for the sake of simplifying the math, let's assume they split evenly among age, so you'll have 70,000 records for each year of age (1 to 100);&lt;BR /&gt;- two sexes divides it by two, so for every year of age you'll have 35,000 females, and 35,000 males;&lt;BR /&gt;- and you want the controls that have the age of case, the age of case plus one, and the age of case minus one;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;It looks to me like you should be expecting 35,000 times 3, or 105,000 records, per case; in your example of 65,000 variables, you should be expecting 6.8 billion result records total.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In your other requests, you had more variables, so you had fewer candidate controls per case.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I think that this may explain your terribly slow performance. Now, the question is what do you want to do to reduce the amount of processing?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Tom&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;P.S. I suggest you close one of the threads, and have everybody move to the other.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 26 Sep 2012 20:56:27 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Match-case-control-large-data-base/m-p/77399#M22418</guid>
      <dc:creator>TomKari</dc:creator>
      <dc:date>2012-09-26T20:56:27Z</dc:date>
    </item>
  </channel>
</rss>

