BookmarkSubscribeRSS Feed
deleted_user
Not applicable
I am working with a dataset containing customer invoice information. Some customers are represented mutiple times in the dataset on different dates/products. Every customer is identified with af unique numerical value. Now the problem is, I only need the latest customer information.

Lets say cutomer xxxx is represented 8 times in the dataset on different dates. I only need one instance of this particular customer, namely the latest.

I might add, that some times however, a customer will appear more than once on a given date. And in some cases, customers are even mistakenly registered more than once (duplicates).

So basically I am looking for some kind of logic that will remove 7 out of 8 rows containing a particular value (customer ID) preferable keeping the latest. Is that possible to do in EG?

Message was edited by: EGuser Message was edited by: EGuser
4 REPLIES 4
LinusH
Tourmaline | Level 20
This can be accomplished by using to sort steps. The first order you data by id and date (descending). The second will remove duplicate rows by id, which will keep the first row per id (with highest date value).

/Linus
Data never sleeps
Patrick
Opal | Level 21
That's an example how to do it in a code window.
HTH
Patrick

/* Create some test data */
data have;
format InvoiceDate ddmmyy10.;
do ID=1 to 400;
CustomerId=ceil(ranuni(1)*50);
InvoiceDate=ceil(ranuni(2)*50)+17000;
InvoiceNr=put(ID,z8.);
output;
end;
run;

/* Create a sorted view */
proc sql;
create view Vhave as
select *
from have
order by CustomerId,InvoiceDate;
quit;

/* Pick the obs with the highest (latest) InvoiceDate per customer */
/* It will be only one - even if the obs with the highest InvoiceDate is a duplicate */
data have;
set Vhave;
by CustomerId InvoiceDate;
if last.CustomerId;
run; Message was edited by: Patrick
deleted_user
Not applicable
Thanks! This was just what I needed. I guess this can't be done so easily in EG.
Patrick
Opal | Level 21
Hi EGuser
O.K: I took this "point and click" challenge. It wasn't too difficult in the end - but understanding the generated code when I got it wrong first helped a lot....

What I did:
1. I used the code example I've sent you already to generate the test data.
2. I applied "Filter and Query" (EG 4.1):
Base window, tab "Select Data":
- Check the box "Select distinct rows only"
- Click button "edit groups"
- Choose "CustomerId" as group by variable
Base window, tab "Filter Data":
- In the lower part of the window "Filter the summarized data" click the "advanced expression filter" and add the rule: "max(HAVE.InvoiceDate)=HAVE.InvoiceDate "

This "clicking" resulted in the following SQL code:
PROC SQL;
CREATE TABLE WORK.Query_for_HAVE AS SELECT DISTINCT HAVE.InvoiceDate FORMAT=DDMMYY10.,
HAVE.ID,
HAVE.CustomerId,
HAVE.InvoiceNr
FROM WORK.HAVE AS HAVE
GROUP BY HAVE.CustomerId
HAVING max(HAVE.InvoiceDate)=HAVE.InvoiceDate
ORDER BY HAVE.CustomerId;
QUIT;

This does the job! - I still don't know how I would have been able to "click" the right things without understanding the generated code - may be I have to accept that I'm a bit of an old fashioned coder.... 🙂

HTH
Patrick

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1004 views
  • 0 likes
  • 3 in conversation