DATA Step, Macro, Functions and more

Generating a report from huge data set

Reply
Contributor
Posts: 24

Generating a report from huge data set

Hi all,

 

I have the following situation.

 

I have some huge underlying data set - imagine it like orders.

 

RowID          Product          Subproduct          Counterparty          Amount

1                   A                    M                          X                            10

2                   A                    M                          Y                             5

3                   B                    M                          Z                             7

.... and so on.

 

I need to generate a report something like this:

 

Category          Amount

1                      amount1 (is a sum of category 1.1 and 1.2)

1.1                   amount1_1 (is a sum of category 1.1.1 and 1.1.2)

1.1.1                amount1_1_1

1.1.2                amount1_1_2

1.2                   amount1_2 (equals amount from underlying category 1.2.1)

1.2.1                amount1_2_1

 

 

I read in the specs of the report from an XLS where I specify the where conditions for the lowest level conditions (meaning the Categories 1.1.1 and 1.1.2) and for the higher level conditions the category IDs of the sub-sums.

For example, for category 1.1.1. i read in "Product in (A,B) and Counterparty not in (Y,Z) and for category 1.1 i read in that is a sum of 1.1.1 and 1.1.2. 

 

The XLS is something like this:

 

Category             Type             Content

Category 1          Sum             Category 1.1, Category 1.2

Category 1.1       Sum             Category 1.1.1, Category 1.1.2

Category 1.1.1    Formula       Product = A and Subproduct = X

Category 1.1.2    Formula       Product in (A,B) and Counterparty in (X,Z)

Category 1.2       Sum             Category 1.2.1

Category 1.2.1    Forumula      Product = B

.... and so on

 

I use a macro to go through the underlying data set and assign on a separate column the lowest level category where in the specs table the type is Formula (e.g 1.1.1 and 1.1.2). In case a row is assigned to more categories, I duplicate it and output it with each subcategory.

 

The output of the macro would be in a data set like this

 

RowID          Category

1                  1.1.1

1                  1.1.2

2                  No category

3                  1.1.2

3                  1.2.1

 

The report i need to produce should be at the end 

 

Category                Amount

Category 1             34

Category 1.1          27

Category 1.1.1       10

Category 1.1.2       17

Category 1.2          7

Category 1.2.1       7

 

Now my question - what would be the most efficient way of processing all the sums-ups and creatng the report I need considering that the underlying data set is quite huge and the report specification (the number of categories) is also quite big. I don't want to do this hardcoded, I want to be driven by the XLS read-in.

 

The most stupid way I could think about is to update the macro so that whenever I find a subcategory and I output the row (e.g. 1.1.1), I also read in which upper categories the row is included (in this particular case 1.1) and ooutput the row again, then repeat and output the row also for 1. This will surely increase the volume of data but then would be easier to just join these two data sets and sum the amounts group by category ID. 

 

Do you have any other ideas?

 

Thanks Smiley Happy

Trusted Advisor
Posts: 1,553

Re: Generating a report from huge data set

SAS is very fast while dealing with SAS tables;

 

You can import the excel file into SAS;

Add your categories and subcategories as new variables accroding to the definitions

and finally create the report using PROC TABULATE;

 

If you have dificulties with the coding the forum will help you.

Super User
Super User
Posts: 7,942

Re: Generating a report from huge data set

Size of data is pretty irrelevant, it will just take longer to run if there is more data - this shouldn't be a problem.  Why do you have logic in an Excel file?  That to me sounds like the biggest problem here.  Why not simply transfer the logic from that file into a programmed SAS format?  As for your logic, I am afraid I don't see how that matches your data, there is no "category" variable given, so what is 1 or 1.1 etc.

As for the question, the most efficient method of processing data is to use SAS procedures, be that means, freq, or any of the other thousands of stats procedures given.  Structure your data in a good way, then use by groups and such like to arrive at the results you need.  I would suggest firstly, dropping all the macro, and the Excel file, and working closely with Base SAS (which is the programming language) to get a decent program working first before you try to stretch it to include other items.

Respected Advisor
Posts: 4,173

Re: Generating a report from huge data set

I would hope that each row in your underlying data set is assigned to exactly one lowest level category (so it can't be in two). 

 

If so then I believe what you're in the end dealing with is a fact table (that's your "underlying data" and a dimension table - your categories). So what you need to build up now are the hierarchies.

 

The ugly thing is that you're just getting the logic for creating the dimension and that this is in Excel. You can try to process that programatically - but if the Excel is not very clean then I fear this will become a manual excercise. 

I assume someone had already a similar report. If you can then try and get hold of the actual dimensional data as well.

 

What I would be doing once you've got the dimensions: Take your fact table and add as many columns as you've got hierarchies (level1, level2, level3,....).

Level1 would be the top level containing "1,2,3", level2 would be "1.1, 1.2, 2.1,..."

Once you're there you can use Proc Tabulate, Proc Report or Proc Means have them do the job for you.

Super User
Posts: 11,343

Re: Generating a report from huge data set

Provide an example input data set and the desired result.

This

Category                Amount

Category 1             34

Category 1.1          27

Category 1.1.1       10

Category 1.1.2       17

Category 1.2          7

Category 1.2.1       7

without input basically tells us next to nothing and the example:

RowID          Product          Subproduct          Counterparty          Amount

1                   A                    M                          X                            10

2                   A                    M                          Y                             5

3                   B                    M                          Z                             7

has nothing related to your "category" listings. Categories shoud be defined in terms of variables and values in the actual data.

 

The instructions at https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn your input data set (or an example small enough to demostrate your issue) into data step code that can be pasted to the forum or attached as a TXT file.

Ask a Question
Discussion stats
  • 4 replies
  • 267 views
  • 0 likes
  • 5 in conversation