<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Checking dupes in earliest row for an ID in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Checking-dupes-in-earliest-row-for-an-ID/m-p/266176#M52425</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have data at the person-month level, that is, each row is one person in one month. There are dupes. The variables I have are ID, month/year, and two categorical vars.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;ID &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; mth_year &amp;nbsp; &amp;nbsp; categ_1 &amp;nbsp; &amp;nbsp; categ_2&lt;/P&gt;&lt;P&gt;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 200901 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;abc &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;xyz&lt;/P&gt;&lt;P&gt;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 200901 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;abc &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;pqr&lt;/P&gt;&lt;P&gt;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 200902 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;abc &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;pqr&lt;/P&gt;&lt;P&gt;2 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 201001 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;def &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; xyz&lt;/P&gt;&lt;P&gt;2 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 201001 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ghi &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; xyz&lt;/P&gt;&lt;P&gt;2 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 201002 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ghi &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; xyz&lt;/P&gt;&lt;P&gt;3 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 200907 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;abc &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;pqr&lt;/P&gt;&lt;P&gt;3 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 200908 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;abc &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;pqr&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What I want is to do a data check: I want to know how many people have duplicate rows (as in, same ID and mth_year--not same categ_1 and categ_2) in their earliest mth_year. Then within that group, I want to know how many of them have different values of 1. categ_1 and 2. categ_2 across their dupe rows.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any help is much appreciated.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 25 Apr 2016 19:51:24 GMT</pubDate>
    <dc:creator>Walternate</dc:creator>
    <dc:date>2016-04-25T19:51:24Z</dc:date>
    <item>
      <title>Checking dupes in earliest row for an ID</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Checking-dupes-in-earliest-row-for-an-ID/m-p/266176#M52425</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have data at the person-month level, that is, each row is one person in one month. There are dupes. The variables I have are ID, month/year, and two categorical vars.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;ID &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; mth_year &amp;nbsp; &amp;nbsp; categ_1 &amp;nbsp; &amp;nbsp; categ_2&lt;/P&gt;&lt;P&gt;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 200901 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;abc &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;xyz&lt;/P&gt;&lt;P&gt;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 200901 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;abc &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;pqr&lt;/P&gt;&lt;P&gt;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 200902 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;abc &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;pqr&lt;/P&gt;&lt;P&gt;2 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 201001 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;def &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; xyz&lt;/P&gt;&lt;P&gt;2 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 201001 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ghi &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; xyz&lt;/P&gt;&lt;P&gt;2 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 201002 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ghi &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; xyz&lt;/P&gt;&lt;P&gt;3 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 200907 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;abc &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;pqr&lt;/P&gt;&lt;P&gt;3 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 200908 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;abc &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;pqr&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What I want is to do a data check: I want to know how many people have duplicate rows (as in, same ID and mth_year--not same categ_1 and categ_2) in their earliest mth_year. Then within that group, I want to know how many of them have different values of 1. categ_1 and 2. categ_2 across their dupe rows.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any help is much appreciated.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 25 Apr 2016 19:51:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Checking-dupes-in-earliest-row-for-an-ID/m-p/266176#M52425</guid>
      <dc:creator>Walternate</dc:creator>
      <dc:date>2016-04-25T19:51:24Z</dc:date>
    </item>
    <item>
      <title>Re: Checking dupes in earliest row for an ID</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Checking-dupes-in-earliest-row-for-an-ID/m-p/266224#M52439</link>
      <description>&lt;P&gt;The first question is easy to address.&amp;nbsp; Here's one way:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;proc sort data=have;&lt;/P&gt;
&lt;P&gt;by id mth_year;&lt;/P&gt;
&lt;P&gt;run;&lt;/P&gt;
&lt;P&gt;data want;&lt;/P&gt;
&lt;P&gt;set have;&lt;/P&gt;
&lt;P&gt;by id mth_year;&lt;/P&gt;
&lt;P&gt;if first.id and last.mth_year=0;&lt;/P&gt;
&lt;P&gt;run;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This gives you a data set with all the offenders within (one observation apiece).&amp;nbsp; If you only want to know the count, the log will tell you how many are in the data set.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Many posters could have gone this far.&amp;nbsp; I think the reason you haven't gotten responses has to do with the second question.&amp;nbsp; There are some ambiguities about what you are asking for.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Are different values limited to the first month only, or could they appear in later months?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Are different values of categ_1 to be counted without inspecting categ_2?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;With a little more clarification, I think that SQL could compare the original HAVE data set with the WANT data set above ... easily enough for a SQL programmer (though not for me).&amp;nbsp; I could do it in a DATA step if I had to, but it's a bit clumsy.&amp;nbsp; For example:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;proc sort data=have;&lt;/P&gt;
&lt;P&gt;by id mth_year categ_1;&lt;/P&gt;
&lt;P&gt;run;&lt;/P&gt;
&lt;P&gt;data want_categ1;&lt;/P&gt;
&lt;P&gt;merge want (in=keepme) have;&lt;/P&gt;
&lt;P&gt;by id mth_year;&lt;/P&gt;
&lt;P&gt;if keepme;&lt;/P&gt;
&lt;P&gt;if first.mth_year then counter = 0;&lt;/P&gt;
&lt;P&gt;if first.categ_1 then counter + 1;&lt;/P&gt;
&lt;P&gt;if last.mth_year and counter &amp;gt; 1;&lt;/P&gt;
&lt;P&gt;run;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Then similarly for categ_2.&lt;/P&gt;</description>
      <pubDate>Mon, 25 Apr 2016 21:30:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Checking-dupes-in-earliest-row-for-an-ID/m-p/266224#M52439</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2016-04-25T21:30:01Z</dc:date>
    </item>
    <item>
      <title>Re: Checking dupes in earliest row for an ID</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Checking-dupes-in-earliest-row-for-an-ID/m-p/266332#M52481</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/37814"&gt;@Walternate﻿&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This should answer your questions:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql;
create table earliest as
select *
from have
group by id
having mth_year=min(mth_year);

create table dupes as
select id, mth_year,
       count(*) as n,                    /* # obs. in a duplicate group */
      (count(distinct categ_1)&amp;gt;1) as d1, /* 0/1 flag "different CATEG_1 values" */
      (count(distinct categ_2)&amp;gt;1) as d2  /* 0/1 flag "different CATEG_2 values" */
from earliest
group by id, mth_year
having n&amp;gt;1;
quit;

proc freq data=dupes;
tables d1*d2;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Dataset EARLIEST contains the observations of the earliest MTH_YEAR of each person from your existing dataset (called HAVE). For your sample data it would comprise 5 observations.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In dataset DUPES duplicates are aggregated to person level with three aggregated measures:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;n = number of obs. in the duplicate group&lt;/LI&gt;
&lt;LI&gt;d1 = 1 if there are at least two different &lt;EM&gt;non-missing&lt;/EM&gt; values of CATEG_1 in the duplicate group, otherwise 0.&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;d2 = 1 if there are at least two different &lt;/SPAN&gt;&lt;EM&gt;non-missing&lt;/EM&gt;&lt;SPAN&gt; values of CATEG_2 in the duplicate group, otherwise 0.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;IDs with only one observation in dataset EARLIEST are not contained in DUPES. Hence, the number of observations in DUPES (=2 for your sample data) is the number of people who have "duplicate rows."&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The PROC FREQ step determines the number (and percentage) of people with different (non-missing!) CATEG_1 values, ditto for CATEG_2 and also for each of the four possible combinations of 0/1 flags d1 and d2.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Apr 2016 11:41:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Checking-dupes-in-earliest-row-for-an-ID/m-p/266332#M52481</guid>
      <dc:creator>FreelanceReinh</dc:creator>
      <dc:date>2016-04-26T11:41:53Z</dc:date>
    </item>
  </channel>
</rss>

