Re: Sorting options

Aar684 · Posted 10-29-2010 11:15 AM

Hi all,

It has been a while since posting here, hopefully someone can help. I have a data set with something like one million records. Each record is has something like 40 variables. The variables are account numbers and things of that nature. Each record has a unique account number. However, while there might be a unique account number, there are records that have identical meter numbers, which is a problem for me. On top of that, there are meter sequence numbers that correspond to those meter numbers that are either the same or different.

Here is an example of the data set:

account# meter # metersequence#
9999999 1234567 105
8888888 1234567 104
2222222 4444444 103
1111111 4444444 108
3333333 0123456 109

For the project I am working on I need to keep in the data set only one entry of each meter# with the HIGHEST service account sequence#. For example I would want to keep the record for meter# 1234567 containing metersequence# 105.

I've tried something similar to:

proc sort data = b;
by meter # descending metersequence#;

How do I get from there to keeping the top record of each meter#?(assuming that the code does order the records properly) For example, after the file is sorted to show each of the highest metersequence#'s with their meter number, how do I go about keeping the highest metersequence# or just the top record for each ( which should be the highest metersequence#)

Sorry for the long post. I wanted to make sure people understood what I was saying.

One more question, if add the nodupkey option to the sort, which variable does it nodup on? One? Both?

sbb · Posted 10-29-2010 12:26 PM

Have a second SORT with NODUPKEY and EQUALS, but also you must only include the BY statement variables for which you want to consider unique (to have the SAS sort-operation) remove what are considered duplicates.

The other option is to use a DATA step with SET and a BY statement -- then also use IF FIRST. THEN OUTPUT.

And, there may also be a PROC SQL approach, possibly.

Scott Barry
SBBWorks, Inc.

Suggested Google advanced search arguments, this topic / post:

data step by group processing site:sas.com

proc sort nodupkey remove duplicates site:sas.com

Aar684 · Posted 10-29-2010 01:20 PM

Scott,

Thanks for the reply, I will see if I can figure it out with the information you provided. If not, I will post back if I get stuck.

Aar684 · Posted 10-29-2010 02:17 PM

Scott, question for you,

Does this code look ok to you ( Or anyone else who feels like answering me)?

It seems to do what I am asking, which is having only one meter# and the highest/top metersequence# kept while the other observations are tossed.

data doesthiswork dups;
set workingonit;
by groupformat meter# descending metersequence#;
if first.meter# = 1 and first.metersequence = 1 then output doesthiswork;
else output dups;
run;

ArtC · Posted 10-29-2010 02:36 PM

For your DATA step logic
[pre]if first.meter# = 1 and first.metersequence = 1 then output doesthiswork;[/pre]

when first.meter is true the first.metersequence necessarily must also be true.

You will have to have sorted first and the # sign cannot be used as a character in the variable name (numbers and letters only).

Testing only for FIRST.METERSEQUENCE is sufficient to give you a unique combination of BY variables. If you want only the highest metersequence number then test for FIRST.METERNUMBER. If the unique BY variable combinations are all that you are trying to extract then you can do this with the NODUPKEY options in the SORT step (assujming that you are using a SORT and that the data were not already sorted).

Try testing the options/code on the SASHELP.CLASS data set to get what you want.

[pre]
* untested code;
proc sort data=sashelp.class out=class nodupkey;
by sex descending age;
run;[/pre]

sbb · Posted 10-30-2010 11:20 AM

As I replied earlier, it requires two sorts, the first having the descending, then with the second sort having the NODUPKEY but only the primary de-dup key variable (not the descending numeric variable).

For desk-checking and self-debugging, add one or more statements to your program at various points (make each unique with a "nn" value for log identification).

PUTLOG ">diag-nn>" / _ALL_;

Lastly, when using BY GROUP processing, the FIRST. is either a "0" or a "1", so you need only code a simpler test, like:

IF FIRST.METER# THEN OUTPUT ;
ELSE OUTPUT ;

Scott Barry
SBBWorks, Inc.

Suggested Google advanced search argument, this topic / post:

proc sort remove duplicates nodupkey site:sas.com Message was edited by: sbb

Peter_C · Posted 11-01-2010 03:46 PM

> with their meter number, how do I go about keeping
> the highest metersequence# or just the top record for
> each ( which should be the highest metersequence#)

carry on with methods offered so far if the objective is to practise proc sort, etc.

but if you just want the highest metersequence for each meter number, then NO SORT is needed.
Use a hash table to collect the highest metersequence#
Input table for that data step can be in random order.
Needs memory for just one (highest) metersequence# for each meter, probably adding up to less than 20MB
A normally populated PC (memory) will support (up to 10 times) that.

good luck with proc sort if that is the preference

peterC

Daryl · Posted 11-03-2010 11:21 AM

Another possibility, this one is via SQL (as Scott suggested):

data test;
input meter seq;
datalines;
2 23
2 24
2 25
1 10
1 11
1 12
2 22
2 21
1 13
;
run;
proc sql;
create table uniques as
select distinct meter, max(seq) as seq from test group by meter;
quit; Message was edited by: Daryl

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away