Hi all,
I have the following situation.
I have some huge underlying data set - imagine it like orders.
RowID Product Subproduct Counterparty Amount
1 A M X 10
2 A M Y 5
3 B M Z 7
.... and so on.
I need to generate a report something like this:
Category Amount
1 amount1 (is a sum of category 1.1 and 1.2)
1.1 amount1_1 (is a sum of category 1.1.1 and 1.1.2)
1.1.1 amount1_1_1
1.1.2 amount1_1_2
1.2 amount1_2 (equals amount from underlying category 1.2.1)
1.2.1 amount1_2_1
I read in the specs of the report from an XLS where I specify the where conditions for the lowest level conditions (meaning the Categories 1.1.1 and 1.1.2) and for the higher level conditions the category IDs of the sub-sums.
For example, for category 1.1.1. i read in "Product in (A,B) and Counterparty not in (Y,Z) and for category 1.1 i read in that is a sum of 1.1.1 and 1.1.2.
The XLS is something like this:
Category Type Content
Category 1 Sum Category 1.1, Category 1.2
Category 1.1 Sum Category 1.1.1, Category 1.1.2
Category 1.1.1 Formula Product = A and Subproduct = X
Category 1.1.2 Formula Product in (A,B) and Counterparty in (X,Z)
Category 1.2 Sum Category 1.2.1
Category 1.2.1 Forumula Product = B
.... and so on
I use a macro to go through the underlying data set and assign on a separate column the lowest level category where in the specs table the type is Formula (e.g 1.1.1 and 1.1.2). In case a row is assigned to more categories, I duplicate it and output it with each subcategory.
The output of the macro would be in a data set like this
RowID Category
1 1.1.1
1 1.1.2
2 No category
3 1.1.2
3 1.2.1
The report i need to produce should be at the end
Category Amount
Category 1 34
Category 1.1 27
Category 1.1.1 10
Category 1.1.2 17
Category 1.2 7
Category 1.2.1 7
Now my question - what would be the most efficient way of processing all the sums-ups and creatng the report I need considering that the underlying data set is quite huge and the report specification (the number of categories) is also quite big. I don't want to do this hardcoded, I want to be driven by the XLS read-in.
The most stupid way I could think about is to update the macro so that whenever I find a subcategory and I output the row (e.g. 1.1.1), I also read in which upper categories the row is included (in this particular case 1.1) and ooutput the row again, then repeat and output the row also for 1. This will surely increase the volume of data but then would be easier to just join these two data sets and sum the amounts group by category ID.
Do you have any other ideas?
Thanks 🙂
SAS is very fast while dealing with SAS tables;
You can import the excel file into SAS;
Add your categories and subcategories as new variables accroding to the definitions
and finally create the report using PROC TABULATE;
If you have dificulties with the coding the forum will help you.
Size of data is pretty irrelevant, it will just take longer to run if there is more data - this shouldn't be a problem. Why do you have logic in an Excel file? That to me sounds like the biggest problem here. Why not simply transfer the logic from that file into a programmed SAS format? As for your logic, I am afraid I don't see how that matches your data, there is no "category" variable given, so what is 1 or 1.1 etc.
As for the question, the most efficient method of processing data is to use SAS procedures, be that means, freq, or any of the other thousands of stats procedures given. Structure your data in a good way, then use by groups and such like to arrive at the results you need. I would suggest firstly, dropping all the macro, and the Excel file, and working closely with Base SAS (which is the programming language) to get a decent program working first before you try to stretch it to include other items.
I would hope that each row in your underlying data set is assigned to exactly one lowest level category (so it can't be in two).
If so then I believe what you're in the end dealing with is a fact table (that's your "underlying data" and a dimension table - your categories). So what you need to build up now are the hierarchies.
The ugly thing is that you're just getting the logic for creating the dimension and that this is in Excel. You can try to process that programatically - but if the Excel is not very clean then I fear this will become a manual excercise.
I assume someone had already a similar report. If you can then try and get hold of the actual dimensional data as well.
What I would be doing once you've got the dimensions: Take your fact table and add as many columns as you've got hierarchies (level1, level2, level3,....).
Level1 would be the top level containing "1,2,3", level2 would be "1.1, 1.2, 2.1,..."
Once you're there you can use Proc Tabulate, Proc Report or Proc Means have them do the job for you.
Provide an example input data set and the desired result.
This
Category Amount
Category 1 34
Category 1.1 27
Category 1.1.1 10
Category 1.1.2 17
Category 1.2 7
Category 1.2.1 7
without input basically tells us next to nothing and the example:
RowID Product Subproduct Counterparty Amount
1 A M X 10
2 A M Y 5
3 B M Z 7
has nothing related to your "category" listings. Categories shoud be defined in terms of variables and values in the actual data.
The instructions at https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn your input data set (or an example small enough to demostrate your issue) into data step code that can be pasted to the forum or attached as a TXT file.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.