It has been a while since posting here, hopefully someone can help. I have a data set with something like one million records. Each record is has something like 40 variables. The variables are account numbers and things of that nature. Each record has a unique account number. However, while there might be a unique account number, there are records that have identical meter numbers, which is a problem for me. On top of that, there are meter sequence numbers that correspond to those meter numbers that are either the same or different.
For the project I am working on I need to keep in the data set only one entry of each meter# with the HIGHEST service account sequence#. For example I would want to keep the record for meter# 1234567 containing metersequence# 105.
I've tried something similar to:
proc sort data = b;
by meter # descending metersequence#;
How do I get from there to keeping the top record of each meter#?(assuming that the code does order the records properly) For example, after the file is sorted to show each of the highest metersequence#'s with their meter number, how do I go about keeping the highest metersequence# or just the top record for each ( which should be the highest metersequence#)
Sorry for the long post. I wanted to make sure people understood what I was saying.
One more question, if add the nodupkey option to the sort, which variable does it nodup on? One? Both?
Have a second SORT with NODUPKEY and EQUALS, but also you must only include the BY statement variables for which you want to consider unique (to have the SAS sort-operation) remove what are considered duplicates.
The other option is to use a DATA step with SET and a BY statement -- then also use IF FIRST. THEN OUTPUT.
And, there may also be a PROC SQL approach, possibly.
Suggested Google advanced search arguments, this topic / post:
For your DATA step logic
[pre]if first.meter# = 1 and first.metersequence = 1 then output doesthiswork;[/pre]
when first.meter is true the first.metersequence necessarily must also be true.
You will have to have sorted first and the # sign cannot be used as a character in the variable name (numbers and letters only).
Testing only for FIRST.METERSEQUENCE is sufficient to give you a unique combination of BY variables. If you want only the highest metersequence number then test for FIRST.METERNUMBER. If the unique BY variable combinations are all that you are trying to extract then you can do this with the NODUPKEY options in the SORT step (assujming that you are using a SORT and that the data were not already sorted).
Try testing the options/code on the SASHELP.CLASS data set to get what you want.
* untested code;
proc sort data=sashelp.class out=class nodupkey;
by sex descending age;
As I replied earlier, it requires two sorts, the first having the descending, then with the second sort having the NODUPKEY but only the primary de-dup key variable (not the descending numeric variable).
For desk-checking and self-debugging, add one or more statements to your program at various points (make each unique with a "nn" value for log identification).
PUTLOG ">diag-nn>" / _ALL_;
Lastly, when using BY GROUP processing, the FIRST. is either a "0" or a "1", so you need only code a simpler test, like:
IF FIRST.METER# THEN OUTPUT ;
ELSE OUTPUT ;
Suggested Google advanced search argument, this topic / post:
> with their meter number, how do I go about keeping
> the highest metersequence# or just the top record for
> each ( which should be the highest metersequence#)
carry on with methods offered so far if the objective is to practise proc sort, etc.
but if you just want the highest metersequence for each meter number, then NO SORT is needed.
Use a hash table to collect the highest metersequence#
Input table for that data step can be in random order.
Needs memory for just one (highest) metersequence# for each meter, probably adding up to less than 20MB
A normally populated PC (memory) will support (up to 10 times) that.
good luck with proc sort if that is the preference