Solved: Re: In PROC SQL how to group and replace within a subgroup by the late...

k_k_ · Posted 11-07-2025 12:58 PM

Hi, I am new to SAS and hope for your help.

I am currently using a query which yields a table similar to this:

Attribute1 Attribute2 Attribute3 Attribute4 Count

A x k num1 1

A y k num4 1

A z m num4 1

B x k num2 2

B z k num9 2

B a m num4 2

B b m num7 2

C y k num2 2

...

The current query is approx. like this (I am not allowed to use original data, therefore the strange naming...):

proc sql;

create table .... as

select Attr1, Attr2, Attr3, Attr4, count(*) from Sourcetable

where (Attr4 between ... and ... or Attr 4 between ... and ...)

and Attr1 in (...)

group by Attr1, Attr2, Attr3, Attr4

order by Attr1, Attr2, Attr3, Attr4;

quit;

The Sourcetable also contains, amongst others, Attribute 5.

I would like to include Attribute 5, which is not necessarily a number, after Attribute 4 or after count(*). I do not want to group by Attribute 5 itself. In the grouping I want to see the value of Attribute 5 for which Attribute 4 (which is a number) is highest (and Attribute 1, Attribute 2, Attribute 3 are identical withing the grouping).

I tried to adapt the above query using row_number and partition by, but I learned that this is not possible in SAS:

I make up a specific example:

Let Sourcetable be as follows:

COLOR HOUSE TYPE DAY NAME NOMATTER1 NOMATTER2

2 3 garage 70 mickey blabla k

2 3 garage 71 minni ozk k

2 7 bungalow 15 nice -- l

2 7 bungalow 16 bad 1,1 l

2 8 bungalow 3 cold x l

5 2 garage 7 10 NULL r

5 80 bungalow 8 15 Hi k

5 80 garage 9 8 dog s

5 100 school 1 2 tree k

...

I then want to modify my above proc sql query in a way to obtain from it the following set of rows:

COLOR HOUSE TYPE DAY NAME COUNT

2 3 garage 71 minni 2

2 7 bungalow 16 bad 2

2 8 bungalow 3 cold 1

5 2 garage 7 10 1

5 80 bungalow 8 15 1

5 80 garage 9 8 1

...

Sorry for being complicated or even unclear, only just beginning.

Thanks a lot for any help.

Quentin · Posted 11-07-2025 01:49 PM

First, since you are new, some advice on your question.

It's great that you included sample data. When you do that, it's better to share the data as code, so that people can easily create your data. And when you share code, you should use the code format (see the running man icon). So to share your example data, you could share it like:

data have ;
  input Color House Type : $12. Day Name :$8. ;
  cards ;
2  3  garage     70    mickey
2  3  garage     71    minni 
2  7  bungalow   15    nice   
2  7  bungalow   16    bad    
2  8  bungalow    3    cold   
5  2  garage      7    10    
5 80  bungalow    8    15     
5 80  garage      9    8     
;

Now, since you kindly went to the trouble of making sample data you have, and the example of the data you want, your question would be clearer if the example SQL code you posted actually used that sample data.

That's all tips for future questions.

As for my suggestion, I think you actually only need to group by 3 columns, not 4.

I think in most SQL implementations to get what you want would require joining together two subqueries, one to calculate the counts, and one to select the record with the max(Day). But because of an oddity in SAS SQL (where it will "remerge" data if you include a non-aggregated column in a grouped query), I think you can get what you want in one step:

proc sql ;
  create table want as
  select Color,House,Type,Day,Name,count(*) as Count
  from have 
  group by Color,House,Type
  having Day=max(day)
  ;
quit ;

If there are ties for max(day) that will return multiple rows.

View solution in original post

ballardw · Posted 11-07-2025 01:18 PM

Looking at the data how do we know which is the "latest" entry?

Proc SQL in general doesn't normally do anything in the order of the data set so values of variables need to be defined to determine what "latest" might be.

If you need to process data in the order it appears in a data set then the DATA step does that.

Hint: code that does not do what you need is not sufficient description of the process. You need to include a bit more narrative as to what is required referencing variables and values.

k_k_ · Posted 11-07-2025 01:33 PM

Hi, thank you for your reply.

I wrote "for which Attribute 4 (which is a number) is highest ", so this is how I want to determine which is "latest". Is it clearer like this?

I only included "code that does not (fully) do what I want" since I am aiming at improving this existing code.

I also gave an example with real data and was hoping it became clear from it what I aim at.

Quentin · Posted 11-07-2025 01:49 PM

First, since you are new, some advice on your question.

It's great that you included sample data. When you do that, it's better to share the data as code, so that people can easily create your data. And when you share code, you should use the code format (see the running man icon). So to share your example data, you could share it like:

data have ;
  input Color House Type : $12. Day Name :$8. ;
  cards ;
2  3  garage     70    mickey
2  3  garage     71    minni 
2  7  bungalow   15    nice   
2  7  bungalow   16    bad    
2  8  bungalow    3    cold   
5  2  garage      7    10    
5 80  bungalow    8    15     
5 80  garage      9    8     
;

Now, since you kindly went to the trouble of making sample data you have, and the example of the data you want, your question would be clearer if the example SQL code you posted actually used that sample data.

That's all tips for future questions.

As for my suggestion, I think you actually only need to group by 3 columns, not 4.

I think in most SQL implementations to get what you want would require joining together two subqueries, one to calculate the counts, and one to select the record with the max(Day). But because of an oddity in SAS SQL (where it will "remerge" data if you include a non-aggregated column in a grouped query), I think you can get what you want in one step:

proc sql ;
  create table want as
  select Color,House,Type,Day,Name,count(*) as Count
  from have 
  group by Color,House,Type
  having Day=max(day)
  ;
quit ;

If there are ties for max(day) that will return multiple rows.

k_k_ · Posted 11-07-2025 01:55 PM

Dear Quentin,

thanks a lot for your reply and the advice for future questions. (It was a bit complicated for me, since I have to use two seperate PCs due to security issues 😕 I will try and follow your advice in future questions, though.)

As for your suggested solution: I wonder if this way I would not lose contributions to "count(*)"...?

k_k_ · Posted 11-07-2025 01:58 PM

Maybe I am wrong and I had a false understanding on "having...". I have understood it so far like a "where"-condition that throws away the other data.

I will try and find out if this works tomorrow, haven't got access to the data right now.

Quentin · Posted 11-07-2025 02:26 PM

@k_k_ wrote:

Maybe I am wrong and I had a false understanding on "having...". I have understood it so far like a "where"-condition that throws away the other data.

I will try and find out if this works tomorrow, haven't got access to the data right now.

I'm not really a big SQL guy... the HAVING clause is similar to WHERE clause, but HAVING filters data after it has been aggregated by the group by. So the counts are still based on all the data. See the docs: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/sqlproc/p1n8ntmltj8ct6n1hso7x2ys6sbw.htm

The code I posted returns your desired WANT dataset:

Obs    Color    House      Type      Day    Name     Count

 1       2         3     garage       71    minni      2
 2       2         7     bungalow     16    bad        2
 3       2         8     bungalow      3    cold       1
 4       5         2     garage        7    10         1
 5       5        80     bungalow      8    15         1
 6       5        80     garage        9    8          1

k_k_ · Posted 11-07-2025 03:48 PM

So cool, thank you! I will retest during weekend and confirm, but I am hopeful now! 🙂

k_k_ · Posted 11-09-2025 01:39 PM

This worked! Thanks again!

Kurt_Bremser · Posted 11-07-2025 04:51 PM

Whenever sequence comes into play, the DATA step is the tool of choice.

The code is much simpler and easier to read, and usually performs better.

data have;
input color house type $ day name $ nomatter1 $ nomatter2 $;
datalines;
2 3 garage 70 mickey blabla k
2 3 garage 71 minni ozk k
2 7 bungalow 15 nice -- l
2 7 bungalow 16 bad 1,1 l
2 8 bungalow 3 cold x l
5 2 garage 7 10 NULL r
5 80 bungalow 8 15 Hi k
5 80 garage 9 8 dog s
5 100 school 1 2 tree k
;

data want;
set have;
drop nomatter:;
by color house type;
if first.type
then count = 1;
else count + 1;
if last.type;
run;

Feel free to ask if there are elements in there which you do not yet understand.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

k_k_ · Posted 11-09-2025 01:40 PM

Thank you for your solution, aswell! I am still a total beginner and felt more comfortable with proc sql to start with 😉 However I will next understand your solution, too. Thanks for your effort and help!

Kurt_Bremser · Posted 11-16-2025 03:29 AM

@k_k_ wrote:

Thank you for your solution, aswell! I am still a total beginner and felt more comfortable with proc sql to start with 😉 However I will next understand your solution, too. Thanks for your effort and help!

Any time you spend mastering the DATA step will improve your SAS life greatly. I can guarantee that. The DATA step is the "Swiss Army Knife" of the SAS language.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

In PROC SQL how to group and replace within a subgroup by the latest entry

Re: In PROC SQL how to group and replace within a subgroup by the latest entry

Re: In PROC SQL how to group and replace within a subgroup by the latest entry

Re: In PROC SQL how to group and replace within a subgroup by the latest entry

Re: In PROC SQL how to group and replace within a subgroup by the latest entry

Re: In PROC SQL how to group and replace within a subgroup by the latest entry

Re: In PROC SQL how to group and replace within a subgroup by the latest entry

Re: In PROC SQL how to group and replace within a subgroup by the latest entry

Re: In PROC SQL how to group and replace within a subgroup by the latest entry

Re: In PROC SQL how to group and replace within a subgroup by the latest entry

Re: In PROC SQL how to group and replace within a subgroup by the latest entry

Re: In PROC SQL how to group and replace within a subgroup by the latest entry

Re: In PROC SQL how to group and replace within a subgroup by the latest entry

SAS Innovate 2026 Registration is Open