Solved: Re: sorting dataset following 4 rules

michtka · Posted 05-25-2013 07:50 AM

Hi everyone, I hope you can help me with this. Thanks in advance.V

I have this dataset HAVE:

data HAVE;

length nvalue nintensity nitem 8.;

length value intensity item $20.;

input value $ nvalue intensity $ nintensity item $ nitem;

datalines;

DRINK 2 high 3 orange 1

DRINK 2 medium 2 orange .

DRINK . low 1 green 2

DRINK . high 3 green .

DRINK . medium 2 green .

FOOD 5 low 2 cheese 3

FOOD 5 high 4 cheese .

FOOD . medium 1 meat 4

FOOD 5 low 2 meat 4

FOOD . high 4 meat .

FOOD 5 medium 1 vegetable .

;

run;

And I want to obtain the WANT dataset: As you can see the order are involved 4 rules:

, FOOD and DRINK are families > subfamilies.

FOOD (5)> DRINK(2)

FOOD: meat(4)>cheese(3)>Vegetanle (0)

DRINK:green(2)>orange(1)

High>medium>low

dataset WANT:

FOOD high 4

FOOD medium 1

FOOD low 2

meat high 4

meat medium 1

meat low 2

cheese high 4

cheese low 2

vegetable Medium 1

DRINK high 3

DRINK medium 2

DRINK low 1

green high 3

green medium 2

green low 1

orange high 3

orange medium 2

Tom · Posted 05-25-2013 11:30 PM

So use the PROC SQL as I explained to create both the detailed rows and the summary rows and UNION them into one table. Make sure to include proper values for the sorting variables so that the order is what you want. Because you have multiple observations per VALUE/INTENSITY combination you will need to so some sort of grouping/summary for those rows to prevent duplicates. Since you are sorting by descending FREQITEM use a large number for that variable on the records generated at the VALUE (family) level.

data prewant;

input freqvalue value $ flag item $ intensity $ nitem freqitem ;

cards;

5 FOOD 3 meat high 0 4

5 FOOD 2 meat medium 4 4

5 FOOD 1 meat low 4 4

5 FOOD 3 cheese high 0 3

5 FOOD 1 cheese low 3 3

5 FOOD 2 vegetable medium 0 0

2 DRINK 3 green high 0 2

2 DRINK 2 green medium 0 2

2 DRINK 1 green low 2 2

2 DRINK 3 orange high 1 1

2 DRINK 2 orange medium 0 1

run;

proc sql noprint ;

create table display as

select freqvalue, freqitem, flag

, item as level

, intensity

, nitem as count

from prewant

union

select freqvalue,99999 as freqitem, flag

, upcase(value) as level

, intensity

, max(nitem) as nitem /* This is the value that makes no sense */

from prewant group by 1,2,3,4,5

order by freqvalue descending

, freqitem descending

, flag descending

;

quit;

Obs freqvalue freqitem flag level intensity count

1 5 99999 3 FOOD high 0

2 5 99999 2 FOOD medium 4

3 5 99999 1 FOOD low 4

4 5 4 3 meat high 0

5 5 4 2 meat medium 4

6 5 4 1 meat low 4

7 5 3 3 cheese high 0

8 5 3 1 cheese low 3

9 5 0 2 vegetabl medium 0

10 2 99999 3 DRINK high 1

11 2 99999 2 DRINK medium 0

12 2 99999 1 DRINK low 2

13 2 2 3 green high 0

14 2 2 2 green medium 0

15 2 2 1 green low 2

16 2 1 3 orange high 1

17 2 1 2 orange medium 0

View solution in original post

art297 · Posted 05-25-2013 11:33 AM

: If one change can be made to one of your data records (namely removing the nvalue from

FOOD 5 low 2 meat 4 ), you could use something like the following to rearrange your data:

data HAVE;

length value intensity item $20.;

length nvalue nintensity nitem 8.;

input value $ nvalue intensity $ nintensity item $ nitem;

datalines;

DRINK 2 high 3 orange 1

DRINK 2 medium 2 orange .

DRINK . low 1 green 2

DRINK . high 3 green .

DRINK . medium 2 green .

FOOD 5 low 2 cheese 3

FOOD 5 high 4 cheese .

FOOD . medium 1 meat 4

FOOD . low 2 meat 4

FOOD . high 4 meat .

FOOD 5 medium 1 vegetable .

;

run;

proc format;

value $intensity

"high"=1

"medium"=2

"low"=3

;

run;

data want (keep=value intensity nintensity order);

set have;

if missing(nvalue) then do;

if value="DRINK" then do;

order=12+put(intensity,$intensity.);

output;

value=item;

order=15+put(intensity,$intensity.);

output;

end;

else do;

order=put(intensity,$intensity.);

output;

value=item;

order=3+put(intensity,$intensity.);

output;

end;

else do;

value=item;

select(item);

when ("cheese") order=6+put(intensity,$intensity.);

when ("vegetable") order=9+put(intensity,$intensity.);

when ("orange") order=18+put(intensity,$intensity.);

otherwise;

end;

output;

end;

run;

proc sort data=want;

by order;

run;

data want;

set want (drop=order);

run;

michtka · Posted 05-25-2013 11:48 AM

Thanks Arthur for this, you right the 4 value

is the right one.

Any possibility to do it using PROC SQL

using subqueries?

I am thinking about it too.

It would be great.

Thanks.

michtka · Posted 05-25-2013 02:17 PM

Hi guys, thank you very much for your contribution, but rather than specify the categories, I would like to do something like this:

because in my real problem I have 50 Families, and every individual family have 50 subfamilies...

Then, I am trying to do something like MAX() function as Tom Suggest:

proc sql noprint;

create table fillvalitem as

select max(nvalue) as freqvalue, max(nitem) as freqitem, value, flag, intensity, nintensity,item,nitem

from have

group by item

order by freqvalue descending, freqitem descending,flag descending;

run;

But still is not right, for example DRINK is missing the value 2, and is not sorting out in the descending order, should be 2, and 1,

I think the idea is to make a first query of VALUE (group by value), and subquery of ITEM (group by value,item), but still need

to think How I can do it. If you can help me with this: Thanks.

5 4 FOOD 3 high 4 meat .

5 4 FOOD 2 medium 1 meat 4

5 4 FOOD 1 low 2 meat 4

5 3 FOOD 3 high 4 cheese .

5 3 FOOD 1 low 2 cheese 3

5 . FOOD 2 medium 1 vegetable .

2 1 DRINK 3 high 3 orange 1

2 1 DRINK 2 medium 2 orange .

. 2 DRINK 3 high 3 green .

. 2 DRINK 2 medium 2 green .

. 2 DRINK 1 low 1 green 2

michtka · Posted 05-25-2013 02:22 PM

sorry, I missed the dataset with the flag variable, and the BY group (value,item) rather than item, to generate the above dataset:

data have;

set have;

if intensity="high" then flag=3;

else if intensity="medium" then flag=2;

else flag=1;

run;

proc sql noprint;

create table fillvalitem as

select max(nvalue) as freqvalue, max(nitem) as freqitem, value, flag, intensity, nintensity,item,nitem

from have

group by value, item

order by freqvalue descending, freqitem descending,flag descending;

run;

Tom · Posted 05-25-2013 02:59 PM

Since the output you displayed is two different summaries you could do it in SQL as two queries with a UNION.

In terms of the ordering you will need variables that you can use to sort. You could use FORMATs or CASE statements or have separate code list tables that you link in.

proc sql ;

create table want as

select .... from have group by value, intensity

UNION

select .... from have group by value, item, intensity

;

quit;

michtka · Posted 05-25-2013 06:07 PM

Thanks Tom,

I still thinking need to be a query and a subquery, because two queries that you mention will add more records that I really need I think.

michtka · Posted 05-25-2013 08:02 PM

at least I found the first step (first sorting) of my dataset WANT...I used three PROC SQL, still thinking that is possible to do it with only one PROC SQL, but

i dounno how...the thing is, that this works better...I need to think to get last final dataset WANT.

proc sql noprint;

create table fillvalue as

select max(nvalue) as freqvalue,value,flag,item,intensity

from have

group by value

order by freqvalue descending,flag descending;

quit;

proc sql noprint;

create table fillitem as

select max(nitem) as freqitem,value,flag,item,intensity

from have

group by value,item

order by freqitem descending,value,flag descending;

quit;

proc sql noprint;

create table prewant as

select a.*,b.freqitem

from fillvalue a left join fillitem b

on a.value=b.value and a.item=b.item and a.flag=b.flag

order by freqvalue descending,freqitem descending,flag descending;

quit;

PREWANT dataset:

freqvalue value flag item intensity freqitem

5 FOOD 3 meat high 4

5 FOOD 2 meat medium 4

5 FOOD 1 meat low 4

5 FOOD 3 cheese high 3

5 FOOD 1 cheese low 3

5 FOOD 2 vegetable medium 0

2 DRINK 3 green high 2

2 DRINK 2 green medium 2

2 DRINK 1 green low 2

2 DRINK 3 orange high 1

2 DRINK 2 orange medium 1

Finallly looking for this dataset WANT:

FOOD high 4

FOOD medium 1

FOOD low 2

meat high 4

meat medium 1

meat low 2

cheese high 4

cheese low 2

vegetable Medium 1

DRINK high 3

DRINK medium 2

DRINK low 1

green high 3

green medium 2

green low 1

orange high 3

orange medium 2

Tom · Posted 05-25-2013 08:21 PM

It would help a lot if you could explain what you are summarizing. I really have no idea where you are getting expected numbers from.

Plus I am not sure what the input records represent. The variable VALUE seems to the food/drink category. The variable ITEM seems to be type of food or drink. Intensity seems to be a ranking of something from low to high. But the numbers do not make sense. Perhaps NITEM is a a summary of the counts for that item already? Perhaps NVALUE is counts for that level of the VALUE? But then why are there missing values of NVALUE?

Why would the second row in your expected output have a 2? There were no foods with a value of NVALUE or NITEM in the input data with a value of 2.

michtka · Posted 05-25-2013 08:46 PM

Hi Tom, my apologize if I am not been clear here.

Nitem, Nvalue is just a proper summary of number of subjects. The missing values, is because no subjects.

This 2 you mean from PREWANT dataset?....5 FOOD 2 meat medium 4

it is a flag variable...corresponding to the medium intensity.

I get that from manipulating my original dataset have, before use it to PROC SQL:

data have;

set have;

if intensity="high" then flag=3;

else if intensity="medium" then flag=2;

else flag=1;

run;

This dataset finally go to the PROC SQL, I did that because with : descending frequency by this flag variable, I obtain the descending frequency in that order (High>Medium>Low)

The task of this problem is the SORTING following 4 rules:

I am not interesting to summarize anything, only to sorting out family (FOOD/DRINKS) and subfamily (items corresponding to every family) by descending frequency,

and finally sorting out by (high>medium>low).

I hope it is more clear now.

Tom · Posted 05-25-2013 09:20 PM

I think I know what you want to display, but I cannot see how the numbers you are providing map to the numbers you want. It sounds like you have ITEM (meat/cheese/green/...) nested within VALUE (FOOD/DRINK) and you want to display overall counts before you display the individual item counts. You have not given the variables in your output names so lets just call them ROW_LABEL, INTENSITY and COUNT. ROW_LABEL is either the value of VALUE or the value of ITEM depending on whether the row is for the overall summary or for the detailed summary.

It sounds like you have figured out how to sort the intensity levels by adding the variable FLAG. You can do something similar to sort the items. If you really only have two higher level categories of DRINK and FOOD then you can sort by that variable and use descending option to reverse it from the alphabetical order if you want.

But I still have no idea how you get the three numbers for the COUNT column in the overall summary rows? (FOOD -> 4,1,2 ; DRINK -> 3,2,1)

Where did the COUNT of 4 for HIGH level of FOOD come from in the source data? Is it just copied from one of the rows? Calculated? If so how? Similarly for the other 5 numbers.

Similarly how did you get the numbers for the detailed rows? Are they just copied forward? Summarized?

It would probably be easier if you displayed the raw values. It sounds like your raw values would look something like:

SUBJECT VALUE ITEM INTENSITY

1 FOOD meat high

1 DRINK green low

2 FOOD vegetable medium

....

Tom · Posted 05-25-2013 09:46 PM

Let's go back to your "RULES" and use the numbers in there to make some data that make sense?

, FOOD and DRINK are families > subfamilies.

FOOD (5)> DRINK(2)

FOOD: meat(4)>cheese(3)>Vegetanle (0)

DRINK:green(2)>orange(1)

High>medium>low

To put FOOD > DRINK we need to know the frequency (5) is larger than (2).

So you need a variable FAMILY with values of FOOD and DRINK and ANOTHER variable FAMILY_FREQ with values of 5 and 2, on every observation.

FAMILY FAMILY_FREQ

FOOD 5

DRINK 2

Now for the SUBFAMILY level since there is no overlap for drinks and foods it really doesn't matter if we calculate them within FAMILY or not, but it makes more sense for later to do it within the FAMILY.

FAMILY SUBFAMILY SUBFAMILY_FREQ

FOOD MEAT 4

FOOD CHEESE 3

FOOD VEGETABLE 0

DRINK GREEN 2

DRINK ORANGE 1

Combine these on value of FAMILY and you will get:

FAMILY FAMILY_FREQ SUBFAMILY SUBFAMILY_FREQ

FOOD 5 MEAT 4

FOOD 5 CHEESE 3

FOOD 5 VEGETABLE 0

DRINK 2 GREEN 2

DRINK 2 ORANGE 1

Now you can combine with your real data and use the FAMILY_FREQ and SUBFAMILY_FREQ (and your FLAG variable) to order the rows.

Does that help?

I still do not see how your are getting the NUMBERS your want the display.

michtka · Posted 05-25-2013 10:33 PM

Hi Tom, yes, forget about if the values of nitems are right or wrong, doesn't mattter (it become of a summary process that I just solved from the RAW of data, and I just got this numbers as the final count.)

then.. my final dataset WANT will be something like:

level intensity nitem

FOOD high x

FOOD medium x

FOOD low x

meat high x

meat medium x

meat low x

cheese high x

cheese low x

vegetable Medium x

DRINK high x

DRINK medium x

DRINK low x

green high x

green medium x

green low x

orange high x

orange medium x

Then,

STEP 1) Remember my dataset PREWANT:

freqvalue value flag item intensity nitem freqitem

5 FOOD 3 meat high 0 4

5 FOOD 2 meat medium 4 4

5 FOOD 1 meat low 4 4

5 FOOD 3 cheese high 0 3

5 FOOD 1 cheese low 3 3

5 FOOD 2 vegetable medium 0 0

2 DRINK 3 green high 0 2

2 DRINK 2 green medium 0 2

2 DRINK 1 green low 2 2

2 DRINK 3 orange high 1 1

2 DRINK 2 orange medium 0 1

Doing some transformation:

data base_want (keep=level);

set prewant;

by descending freqvalue descending freqitem;

if first.freqvalue then do;

level=upcase(VALUE);

count=nitem;

output;

end;

if first.freqitem then do;

level=item;

count=nitem;

output;

end;

run;

STEP 2) I get this one, called BASE_WANT:

level

FOOD

meat

cheese

vegetable

DRINK

green

orange

Mi idea is, combining this dataset 2) BASE_WANT with the previous datase t1 ) PREWANT to get the finalk dataset 3) WANT.

I think is is more clear now, btw (Thank you for your big interest)....I hope you can help me with this last step 3, FROM STEP 1 AND 2 , I still dont get it.

michtka · Posted 05-25-2013 11:11 PM

OK Tom, tomorrow I will define better my problem.

Thank you for your time.

Tom · Posted 05-25-2013 11:30 PM

So use the PROC SQL as I explained to create both the detailed rows and the summary rows and UNION them into one table. Make sure to include proper values for the sorting variables so that the order is what you want. Because you have multiple observations per VALUE/INTENSITY combination you will need to so some sort of grouping/summary for those rows to prevent duplicates. Since you are sorting by descending FREQITEM use a large number for that variable on the records generated at the VALUE (family) level.

data prewant;

input freqvalue value $ flag item $ intensity $ nitem freqitem ;

cards;

5 FOOD 3 meat high 0 4

5 FOOD 2 meat medium 4 4

5 FOOD 1 meat low 4 4

5 FOOD 3 cheese high 0 3

5 FOOD 1 cheese low 3 3

5 FOOD 2 vegetable medium 0 0

2 DRINK 3 green high 0 2

2 DRINK 2 green medium 0 2

2 DRINK 1 green low 2 2

2 DRINK 3 orange high 1 1

2 DRINK 2 orange medium 0 1

run;

proc sql noprint ;

create table display as

select freqvalue, freqitem, flag

, item as level

, intensity

, nitem as count

from prewant

union

select freqvalue,99999 as freqitem, flag

, upcase(value) as level

, intensity

, max(nitem) as nitem /* This is the value that makes no sense */

from prewant group by 1,2,3,4,5

order by freqvalue descending

, freqitem descending

, flag descending

;

quit;

Obs freqvalue freqitem flag level intensity count

1 5 99999 3 FOOD high 0

2 5 99999 2 FOOD medium 4

3 5 99999 1 FOOD low 4

4 5 4 3 meat high 0

5 5 4 2 meat medium 4

6 5 4 1 meat low 4

7 5 3 3 cheese high 0

8 5 3 1 cheese low 3

9 5 0 2 vegetabl medium 0

10 2 99999 3 DRINK high 1

11 2 99999 2 DRINK medium 0

12 2 99999 1 DRINK low 2

13 2 2 3 green high 0

14 2 2 2 green medium 0

15 2 2 1 green low 2

16 2 1 3 orange high 1

17 2 1 2 orange medium 0

Registration is open

SAS Training: Just a Click Away