Solved: Get Percent of Total from Long Dataset

A_Swoosh · Posted 03-17-2021 03:29 PM

Hello,

I'm looking for a way to get the percent of total from the total volume with a long dataset.

My dataset example is as follows:

Age Group	Gender	Month	Item	Count	WANT
0-13	FEMALE	Feb_2019_new	TELEVISION	0	#DIV/0!
0-13	FEMALE	Feb_2019_total	TELEVISION	0	
0-13	FEMALE	Feb_2019_new	FOOD	5	38.5%
0-13	FEMALE	Feb_2019_total	FOOD	13	
0-13	MALE	Feb_2019_new	TELEVISION	1	50.0%
0-13	MALE	Feb_2019_total	TELEVISION	2

I have a long dataset where one row has the new purchases and the second row has total purchases for any given month. I want to get the percent of total for each item by age group, gender, item, and monthly that are new; the WANT column. Would this be accomplished using multiple proc sql or is there an easier way to do this using a data step?

Thanks,

mkeintz · Posted 03-18-2021 02:38 AM

If your data reliably has a single month value followed by a total value, then you can use a merge of HAVE with itself offset by one record:

data have;
length name $20;
	infile datalines dsd missover delimiter=",";
	input age_grp $ gender $ name $ item $ count;
datalines;
0-13,FEMALE,FEB_2019_NEW                ,FOOD,12
0-13,FEMALE,FEB_2019_TOTAL              ,FOOD,23
0-13,FEMALE,FEB_2019_NEW                ,GAS,45
0-13,FEMALE,FEB_2019_TOTAL              ,GAS,76
0-13,FEMALE,FEB_2019_NEW                ,TELE,4
0-13,FEMALE,FEB_2019_TOTAL              ,TELE,5
0-13,MALE,FEB_2019_NEW                  ,TELE,3
0-13,MALE,FEB_2019_TOTAL                ,TELE,8
0-13,FEMALE,MAR_2019_NEW                ,TELE,0
0-13,FEMALE,MAR_2019_TOTAL              ,TELE,0
;
run;

data want (drop=_:);
  merge have 
        have (firstobs=2 keep=count rename=(count=_total_count));
  if _total_count^=0 and scan(name,-1,'_')='NEW' then newvar=count/_total_count;
run;

Now if once in a while there is no total following the new, then the self-merge with offset still works, but you have check that the name variables correspond:

data want (drop=_:);
  merge have 
        have (firstobs=2 keep=name count rename=(name=_total_name count=_total_count));
  if scan(name,-1,'_')='NEW' and _total_name=transtrn(name,'_NEW','_TOTAL') and _total_count^=0 
    then newvar=count/_total_count;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Reeza · Posted 03-17-2021 03:36 PM

I would recommend looking into PROC FREQ and the OUTPCT option on the TABLES statement.

@A_Swoosh wrote:

Hello,

I'm looking for a way to get the percent of total from the total volume with a long dataset.

My dataset example is as follows:
Age Group	Gender	Month	Metric	Item	Total	WANT
0-13	FEMALE	Feb-19	NEW PURCHASES	TELEVISION	0	#DIV/0!
0-13	FEMALE	Feb-19	TOTAL PURCHASES	TELEVISION	0	
0-13	FEMALE	Feb-19	NEW PURCHASES	FOOD	5	38.5%
0-13	FEMALE	Feb-19	TOTAL PURCHASES	FOOD	13	
0-13	MALE	Feb-19	NEW PURCHASES	TELEVISION	1	50.0%
0-13	MALE	Feb-19	TOTAL PURCHASES	TELEVISION	2	
I have a long dataset where one row has the new purchases and the second row has total purchases for any given month. I want to get the percent of total for each item by age group, gender, item, and month that are new; the WANT column. Would this be accomplished using multiple proc sql or is there an easier way to do this using a data step?

Thanks,

A_Swoosh · Posted 03-17-2021 04:01 PM

Sorry, I modified the example case slightly. Wouldn't that not work since I have these under the same variable therefore if I did a frequency table, my output would just sum them together when I'd really only want New row/Total row?

Reeza · Posted 03-17-2021 05:19 PM

You're correct but your question (and data) are awkwardly stated - your tables isn't aligned either.

If you can guarantee that you will always have New then Total you can look at the DIF and LAG() functions.
MOD() can be used to tell you if a value is odd/even.
_N_ will give you the row number.
or BY group processing.

From what you've shown this would work but this may not scale:

data want;
set have;
by item;
want=lag(total)/total;
if first.item then call missing(want);
run;

A_Swoosh · Posted 03-17-2021 06:52 PM

My apologies. Let me see if I'm able to state my question a little better.

A subset of my long dataset looks like this:

data have;
length name $20;
	infile datalines dsd missover delimiter=",";
	input age_grp $ gender $ name $ item $ count;
datalines;
0-13,FEMALE,FEB_2019_NEW                ,FOOD,12
0-13,FEMALE,FEB_2019_TOTAL              ,FOOD,23
0-13,FEMALE,FEB_2019_NEW                ,GAS,45
0-13,FEMALE,FEB_2019_TOTAL              ,GAS,76
0-13,FEMALE,FEB_2019_NEW                ,TELE,4
0-13,FEMALE,FEB_2019_TOTAL              ,TELE,5
0-13,MALE,FEB_2019_NEW                  ,TELE,3
0-13,MALE,FEB_2019_TOTAL                ,TELE,8
0-13,FEMALE,MAR_2019_NEW                ,TELE,0
0-13,FEMALE,MAR_2019_TOTAL              ,TELE,0
;
run;

I want to turn the dataset into something like this:

data have;
length name $20;
	infile datalines dsd missover delimiter=",";
	input age_grp $ gender $ name $ item $ count want;
datalines;
0-13,FEMALE,FEB_2019_NEW                ,FOOD,12,.5219
0-13,FEMALE,FEB_2019_TOTAL              ,FOOD,23,
0-13,FEMALE,FEB_2019_NEW                ,GAS,45,.592
0-13,FEMALE,FEB_2019_TOTAL              ,GAS,76, 
0-13,FEMALE,FEB_2019_NEW                ,TELE,4,.80
0-13,FEMALE,FEB_2019_TOTAL              ,TELE,5,
0-13,MALE,FEB_2019_NEW                  ,TELE,3,.375
0-13,MALE,FEB_2019_TOTAL                ,TELE,8, 
0-13,FEMALE,MAR_2019_NEW                ,TELE,0,.
0-13,FEMALE,MAR_2019_TOTAL              ,TELE,0, 
;
run;

My BY variables include: Age_grp, Gender, Name, and Item.

So, for each month, I want to report the percent of NEW/TOTAL for each age group, gender, and item combination. So, for food purchases among female 0-13 in February 2019, it was 23 total purchases but the percent of those that were NEW were 52.19% (.5219).

I think a lag function might work but wasn't entirely sure how to do the grouping using your example. As you mentioned, I can guarantee that each row will alternate between New then Total.

Then to take it one step further, I wanted to concatenate the values for TOTAL and NEW to produce a value for each age group, gender, and item combination to read out the

TOTAL value (NEW pct). So, in the example above, it would be 23(52.2%).

Reeza · Posted 03-17-2021 09:02 PM

Sort your data so that total is before new and then use the method I included initially, reversing the calculation obviously. SAS doesn't do LEAD well, but LAG is relatively easy. Then resort for you display.

A_Swoosh · Posted 03-17-2021 09:34 PM

So, I did what you suggested:

data want;
        set monthly_results;
        by  item age_new gender;
        want=total/lag(total);
        if first.item then call missing(want);
run;

I get the value for new/total in the NEW row; however, I get a value in the TOTAL row instead of a MISSING value.

Additionally, if I then wanted to concatenate the TOTAL (want) together, how would I go about this?

Would I just be better suited to do this using proc sql or transpose the data?

mkeintz · Posted 03-18-2021 02:38 AM

If your data reliably has a single month value followed by a total value, then you can use a merge of HAVE with itself offset by one record:

data have;
length name $20;
	infile datalines dsd missover delimiter=",";
	input age_grp $ gender $ name $ item $ count;
datalines;
0-13,FEMALE,FEB_2019_NEW                ,FOOD,12
0-13,FEMALE,FEB_2019_TOTAL              ,FOOD,23
0-13,FEMALE,FEB_2019_NEW                ,GAS,45
0-13,FEMALE,FEB_2019_TOTAL              ,GAS,76
0-13,FEMALE,FEB_2019_NEW                ,TELE,4
0-13,FEMALE,FEB_2019_TOTAL              ,TELE,5
0-13,MALE,FEB_2019_NEW                  ,TELE,3
0-13,MALE,FEB_2019_TOTAL                ,TELE,8
0-13,FEMALE,MAR_2019_NEW                ,TELE,0
0-13,FEMALE,MAR_2019_TOTAL              ,TELE,0
;
run;

data want (drop=_:);
  merge have 
        have (firstobs=2 keep=count rename=(count=_total_count));
  if _total_count^=0 and scan(name,-1,'_')='NEW' then newvar=count/_total_count;
run;

Now if once in a while there is no total following the new, then the self-merge with offset still works, but you have check that the name variables correspond:

data want (drop=_:);
  merge have 
        have (firstobs=2 keep=name count rename=(name=_total_name count=_total_count));
  if scan(name,-1,'_')='NEW' and _total_name=transtrn(name,'_NEW','_TOTAL') and _total_count^=0 
    then newvar=count/_total_count;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

A_Swoosh · Posted 03-18-2021 12:05 PM

This appears to work exactly how I described.

Thank you and @Reeza

Get Percent of Total from Long Dataset

Re: Get Percent of Total from Long Dataset

Re: Get Percent of Total from Long Dataset

Re: Get Percent of Total from Long Dataset

Re: Get Percent of Total from Long Dataset

Re: Get Percent of Total from Long Dataset

Re: Get Percent of Total from Long Dataset

Re: Get Percent of Total from Long Dataset

Re: Get Percent of Total from Long Dataset

Re: Get Percent of Total from Long Dataset

Registration is open

SAS Training: Just a Click Away