BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
k_k_
Fluorite | Level 6

Hi, I am new to SAS and hope for your help.

I am currently using a query which yields a table similar to this:

Attribute1 Attribute2   Attribute3  Attribute4   Count

A                 x                    k                     num1             1

A                 y                    k                     num4            1

A                 z                    m                    num4            1

B                 x                    k                     num2             2

B                 z                    k                     num9             2

B                 a                    m                   num4             2

B                 b                    m                   num7             2

C                 y                    k                     num2            2

...

The current query is approx. like this (I am not allowed to use original data, therefore the strange naming...):

proc sql;

create table .... as

select Attr1, Attr2, Attr3, Attr4, count(*) from Sourcetable

where (Attr4 between ... and ... or Attr 4 between ... and ...)

and Attr1 in (...)

group by Attr1, Attr2, Attr3, Attr4

order by Attr1, Attr2, Attr3, Attr4;

quit;

 

The Sourcetable also contains, amongst others, Attribute 5.

I would like to include Attribute 5, which is not necessarily a number, after Attribute 4 or after count(*). I do not want to group by Attribute 5 itself. In the grouping I want to see the value of Attribute 5 for which Attribute 4 (which is a number) is highest (and Attribute 1, Attribute 2, Attribute 3 are identical withing the grouping).

I tried to adapt the above query using row_number and partition by, but I learned that this is not possible in SAS:

 

I make up a specific example:

Let Sourcetable be as follows:

 

COLOR   HOUSE  TYPE              DAY  NAME          NOMATTER1        NOMATTER2

2               3             garage            70      mickey          blabla                    k

2               3             garage            71      minni             ozk                         k

2                7            bungalow      15       nice                --                            l

2                7            bungalow      16       bad                 1,1                           l

2                8            bungalow       3       cold                x                             l

5                2            garage             7         10                    NULL                   r

5                80         bungalow       8         15                    Hi                          k

5                80         garage             9         8                      dog                      s

5                100         school            1          2                      tree                        k

...

 

I then want to modify my above proc sql query in a way to obtain from it the following set of rows:

 

COLOR   HOUSE  TYPE              DAY  NAME          COUNT       

         

2               3             garage            71      minni              2

              

2                7            bungalow      16       bad                 2

2                8            bungalow       3       cold                1 

5                2            garage             7         10                  1

5                80         bungalow       8         15                   1

5                80         garage             9         8                    1

...

 

Sorry for being complicated or even unclear, only just beginning.

 

Thanks a lot for any help.

1 ACCEPTED SOLUTION

Accepted Solutions
Quentin
Super User

First, since you are new, some advice on your question.

 

It's great that you included sample data.  When you do that, it's better to share the data as code, so that people can easily create your data.  And when you share code, you should use the code format (see the running man icon).  So to share your example data, you could share it like:

 

data have ;
  input Color House Type : $12. Day Name :$8. ;
  cards ;
2  3  garage     70    mickey
2  3  garage     71    minni 
2  7  bungalow   15    nice   
2  7  bungalow   16    bad    
2  8  bungalow    3    cold   
5  2  garage      7    10    
5 80  bungalow    8    15     
5 80  garage      9    8     
;

Now, since you kindly went to the trouble of making sample data you have, and the example of the data you want, your question would be clearer if the example SQL code you posted actually used that sample data.  

 

That's all tips for future questions.

 

As for my suggestion, I think you actually only need to group by 3 columns, not 4.  

 

I think in most SQL implementations to get what you want would require joining together two subqueries, one to calculate the counts, and one to select the record with the max(Day).  But because of an oddity in SAS SQL (where it will "remerge" data if you include a non-aggregated column in a grouped query), I think you can get what you want in one step:

 

proc sql ;
  create table want as
  select Color,House,Type,Day,Name,count(*) as Count
  from have 
  group by Color,House,Type
  having Day=max(day)
  ;
quit ;

If there are ties for max(day)  that will return multiple rows.

 

View solution in original post

11 REPLIES 11
ballardw
Super User

Looking at the data how do we know which is the "latest" entry?

Proc SQL in general doesn't normally do anything in the order of the data set so values of variables need to be defined to determine what "latest" might be.

 

If you need to process data in the order it appears in a data set then the DATA step does that.

 

Hint: code that does not do what you need is not sufficient description of the process. You need to include a bit more narrative as to what is required referencing variables and values.

k_k_
Fluorite | Level 6

Hi, thank you for your reply.

I wrote "for which Attribute 4 (which is a number) is highest ", so this is how I want to determine which is "latest". Is it clearer like this?

I only included "code that does not (fully) do what I want" since I am aiming at improving this existing code.

I also gave an example with real data and was hoping it became clear from it what I aim at.

Quentin
Super User

First, since you are new, some advice on your question.

 

It's great that you included sample data.  When you do that, it's better to share the data as code, so that people can easily create your data.  And when you share code, you should use the code format (see the running man icon).  So to share your example data, you could share it like:

 

data have ;
  input Color House Type : $12. Day Name :$8. ;
  cards ;
2  3  garage     70    mickey
2  3  garage     71    minni 
2  7  bungalow   15    nice   
2  7  bungalow   16    bad    
2  8  bungalow    3    cold   
5  2  garage      7    10    
5 80  bungalow    8    15     
5 80  garage      9    8     
;

Now, since you kindly went to the trouble of making sample data you have, and the example of the data you want, your question would be clearer if the example SQL code you posted actually used that sample data.  

 

That's all tips for future questions.

 

As for my suggestion, I think you actually only need to group by 3 columns, not 4.  

 

I think in most SQL implementations to get what you want would require joining together two subqueries, one to calculate the counts, and one to select the record with the max(Day).  But because of an oddity in SAS SQL (where it will "remerge" data if you include a non-aggregated column in a grouped query), I think you can get what you want in one step:

 

proc sql ;
  create table want as
  select Color,House,Type,Day,Name,count(*) as Count
  from have 
  group by Color,House,Type
  having Day=max(day)
  ;
quit ;

If there are ties for max(day)  that will return multiple rows.

 

k_k_
Fluorite | Level 6

Dear Quentin,

thanks a lot for your reply and the advice for future questions. (It was a bit complicated for me, since I have to use two seperate PCs due to security issues 😕 I will try and follow your advice in future questions, though.)

As for your suggested solution: I wonder if this way I would not lose contributions to "count(*)"...?

k_k_
Fluorite | Level 6

Maybe I am wrong and I had a false understanding on "having...". I have understood it so far like a "where"-condition that throws away the other data. 

I will try and find out if this works tomorrow, haven't got access to the data right now.

Quentin
Super User

@k_k_ wrote:

Maybe I am wrong and I had a false understanding on "having...". I have understood it so far like a "where"-condition that throws away the other data. 

I will try and find out if this works tomorrow, haven't got access to the data right now.


I'm not really a big SQL guy... the HAVING clause is similar to WHERE clause, but HAVING filters data after it has been aggregated by the group by.  So the counts are still based on all the data.  See the docs: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/sqlproc/p1n8ntmltj8ct6n1hso7x2ys6sbw.htm

 

The code I posted returns your desired WANT dataset:

 

Obs    Color    House      Type      Day    Name     Count

 1       2         3     garage       71    minni      2
 2       2         7     bungalow     16    bad        2
 3       2         8     bungalow      3    cold       1
 4       5         2     garage        7    10         1
 5       5        80     bungalow      8    15         1
 6       5        80     garage        9    8          1
k_k_
Fluorite | Level 6

So cool, thank you! I will retest during weekend and confirm, but I am hopeful now! 🙂

k_k_
Fluorite | Level 6
This worked! Thanks again!
Kurt_Bremser
Super User

Whenever sequence comes into play, the DATA step is the tool of choice.

The code is much simpler and easier to read, and usually performs better.

data have;
input color house type $ day name $ nomatter1 $ nomatter2 $;
datalines;
2 3 garage 70 mickey blabla k
2 3 garage 71 minni ozk k
2 7 bungalow 15 nice -- l
2 7 bungalow 16 bad 1,1 l
2 8 bungalow 3 cold x l
5 2 garage 7 10 NULL r
5 80 bungalow 8 15 Hi k
5 80 garage 9 8 dog s
5 100 school 1 2 tree k
;

data want;
set have;
drop nomatter:;
by color house type;
if first.type
then count = 1;
else count + 1;
if last.type;
run;

Feel free to ask if there are elements in there which you do not yet understand.

k_k_
Fluorite | Level 6

Thank you for your solution, aswell! I am still a total beginner and felt more comfortable with proc sql to start with 😉 However I will next understand your solution, too. Thanks for your effort and help!

Kurt_Bremser
Super User

@k_k_ wrote:

Thank you for your solution, aswell! I am still a total beginner and felt more comfortable with proc sql to start with 😉 However I will next understand your solution, too. Thanks for your effort and help!


Any time you spend mastering the DATA step will improve your SAS life greatly. I can guarantee that. The DATA step is the "Swiss Army Knife" of the SAS language.