- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
If I have this dataset (psedo, it is really boig instead)
data data;
input DATE date9. A $ B;
format Date date9.;
datalines;
19NOV2008 100000011 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
24NOV2008 100000022 10
;
run;
You see that they have the same status (10=deafult let's say) and ChangedDate is differ.. If I want to remove all the "duplicates" they're not duplicates beacuse ChangeDate differs for every row. But I want to have the minimnum date of ChangeDate beacuse it sounds more logial, since that company was already in the status deafult from the begging of my dataset but how can I only keep on row of the company = '100000022 ' without nodup/nodupkey/distinct ?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Why the restriction about "without nodup nodupkey distinct"? Why throw out simple approaches that would work?
For anyone else reading along who doesn't have these restrictions:
proc sql;
create table want as select distinct date,a,b
from data;
quit;
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@cleokatt wrote:
Beacuse I don't know what I'm deleing with nodupkey, I think it is risky. I think more of aggregate or such..
That's what the SAS documentation is for! Not risky, many people use it.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your sample data doesn't really match your description because the date IS the same for all rows. If the date is not actually the same in the real data, then you can achieve what you're describing by sorting on company, then date and taking the 'first.company`' row:
proc sort data=have; by company date; run;
data want;
set have;
by company;
IF first.company;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@cleokatt wrote:
Ooh! No I copy pasterd the wrong sample data, but the ChgangeDate should be in differents time stamp. Does your code do that?
That is confusing. DATE values do not have a TIME component. Is your CHANGEDATE variable a DATE value or a DATETIME value (sometimes called a TIMESTAMP value)? If it is a DATE value, like in your example, then you need a separate variable that has the time of day component.
Either way the suggested method will work.
If you have the date and time stored in a single variable, lets call it CHANGEDATETIME, then use:
proc sort data=have;
by company changedatetime;
run;
If you have the date and time stored in separate variables then you just need to use both variables in the initial PROC SORT step.
proc sort data=have;
by company changedate changetime;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
proc sort data=have; by company date; run;
data want;
set have;
by company;
IF first.company;
run;
this one??
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Alternatively, you can use aggregate functions (MIN, MAX etc..) in proc sql to keep the required records BY required groups. Let's say, like in your example, we need a unique record by A (id) and B (status) with the earliest date.
data have;
input DATE date9. A $ B;
format date date9.;
datalines;
19NOV2008 1000011 10
20NOV2008 1000011 10
21NOV2008 1000011 10
22NOV2008 1000011 10
23NOV2008 1000011 10
14NOV2008 1000022 10
15NOV2008 1000022 10
16NOV2008 1000022 10
17NOV2008 1000022 10
18NOV2008 1000022 10
21NOV2008 1000021 10
21NOV2008 1000021 10
23NOV2008 1000021 10
24NOV2008 1000021 10
27NOV2008 1000012 10
28NOV2008 1000012 10
29NOV2008 1000012 10
24NOV2008 1000012 10
20NOV2008 1000012 10
25NOV2008 1000012 10
;
run;
proc sql;
create table want as
select min(date) as DATE format=date9., a, b from have
group by a, b;
quit;
This code selects the earliest date by A and B.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If the data are already sorted by company (as per your sample data), but not by changedate within company, then you can avoid the costs of running a PROC SORT:
data want (drop=_:);
set have (in=firstpass)
have (in=secondpass);
by company;
retain _mindate . ;
if first.company then call missing(_mindate);
if firstpass then _mindate=min(changedate,_mindate);
else if secondpass and changedate=_mindate then output;
run;
Of course, this assumes that each company has only a single instance of minimum changedate. If that's not the case, then you need to tell us whether to keep all such records, or how to choose among them.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------