<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic SAS analyze data set for best variables index candidates in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/SAS-analyze-data-set-for-best-variables-index-candidates/m-p/501155#M133593</link>
    <description>&lt;P&gt;i have an number of datasets, ranging from 1000 rows to 300,000,000, 10 variables to 256&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;on a proceedings paper i read that:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1 % - 15% An index will definitely improve program performance&lt;/P&gt;&lt;P&gt;16% - 20% An index will probably improve program performance&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;so with that in mind i need to ask the question and create some code which:&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;Returns a list of variables that meet the criteria where then number of distinct values&amp;nbsp;on given variables that amount to 20% or less of the total row count of that table&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm not sure how to approach this as it will take some stress on the server if i just use proc freq?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;advice?&lt;/P&gt;</description>
    <pubDate>Wed, 03 Oct 2018 14:28:00 GMT</pubDate>
    <dc:creator>teelov</dc:creator>
    <dc:date>2018-10-03T14:28:00Z</dc:date>
    <item>
      <title>SAS analyze data set for best variables index candidates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/SAS-analyze-data-set-for-best-variables-index-candidates/m-p/501155#M133593</link>
      <description>&lt;P&gt;i have an number of datasets, ranging from 1000 rows to 300,000,000, 10 variables to 256&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;on a proceedings paper i read that:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1 % - 15% An index will definitely improve program performance&lt;/P&gt;&lt;P&gt;16% - 20% An index will probably improve program performance&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;so with that in mind i need to ask the question and create some code which:&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;Returns a list of variables that meet the criteria where then number of distinct values&amp;nbsp;on given variables that amount to 20% or less of the total row count of that table&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm not sure how to approach this as it will take some stress on the server if i just use proc freq?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;advice?&lt;/P&gt;</description>
      <pubDate>Wed, 03 Oct 2018 14:28:00 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/SAS-analyze-data-set-for-best-variables-index-candidates/m-p/501155#M133593</guid>
      <dc:creator>teelov</dc:creator>
      <dc:date>2018-10-03T14:28:00Z</dc:date>
    </item>
    <item>
      <title>Re: SAS analyze data set for best variables index candidates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/SAS-analyze-data-set-for-best-variables-index-candidates/m-p/501171#M133597</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/23443"&gt;@teelov&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;i have an number of datasets, ranging from 1000 rows to 300,000,000, 10 variables to 256&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;on a proceedings paper i read that:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;1 % - 15% An index will definitely improve program performance&lt;/P&gt;
&lt;P&gt;16% - 20% An index will probably improve program performance&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;so with that in mind i need to ask the question and create some code which:&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;Returns a list of variables that meet the criteria where then number of distinct values&amp;nbsp;on given variables that amount to 20% or less of the total row count of that table&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'm not sure how to approach this as it will take some stress on the server if i just use proc freq?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;advice?&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;I would be tempted to start with proc freq with the nlevels option which reports the number of non-missing and missing&amp;nbsp;levels for variables. Candidates would be those that have no missing levels and relatively large number of non-missing levels.&lt;/P&gt;
&lt;P&gt;Example:&lt;/P&gt;
&lt;PRE&gt;ods select nlevels;
Proc freq data=sashelp.class nlevels;
run;&lt;/PRE&gt;
&lt;P&gt;The ods select only displays the neleves information. You can use the ODS OUTPUT to save that into a data set if needed.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I would look for variables that are more categorical&amp;nbsp;&amp;nbsp;in nature, such as NAME in the above, than measured such as HEIGHT.&lt;/P&gt;</description>
      <pubDate>Wed, 03 Oct 2018 14:56:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/SAS-analyze-data-set-for-best-variables-index-candidates/m-p/501171#M133597</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2018-10-03T14:56:55Z</dc:date>
    </item>
  </channel>
</rss>

