03-07-2017 12:04 PM
Dear SAS Experts,
we got a design question regarding sub-string storage analysis and reporting.
The incoming variable is of format 7051-3984,7514-7594,6843-6848,6677-6685,1212-6561 and longer.
The codes (4 digit number) are individual parts of animals and how those are packed or treated.
We want to optimize the analysis of the data and make it as fast as possible.
New code combination are delivered to us on a daily basis.
The data is stored on our SAS Server but all in one variable in one dataset.
How do we seperate the codes to ensure a fast analysis by for example 7051 "chicken"
and 6843-6848 "meat, packed".
Our idea was to leave it all in one variable and use string searching to look through the
codes but we believe that will be too slow for about 100m data rows.
We cant predict how long the code string will be so seperating it into individual variables
03-07-2017 12:55 PM
It isn't quite clear what you mean by "in one variable". Do you mean that for a single record you may have both the "chicken" and "Meat packed" in that one variable?
Could you show a few example rows of your existing data to clarify/
If so it may well be worth the not-too-complex effort to restructure the data to to a single code value. Subsetting data may be quicker and use of WHERE clause with IN operator and use of an index may be a good idea.
Secondly indicator variables to indicate whether the record is in some group or groups may help but maitaining that may be a bit time consuming if the data values change as frequently as you say.
03-07-2017 12:56 PM
As you can imagine, this leads to a few questions before any useful advice can be provided:
In one case, you use a four digit number as the item of interest. In the other one, you use the two numbers separated by dashes. Do you need to analyze on one, the other, or both?
Is the data always in the form "9999-9999,9999-9999...", or can it vary?
You mention "100 million data rows". Is that the total since the beginning? How long? How many rows do you receive each day?
Can you elaborate a little on how you're storing it now? Expand on the "one variable in one dataset".
And, most important, what kinds of analytical results do you want from the data? What questions are you trying to answer?
03-08-2017 03:22 AM
Many thanks for your reply.
With one variable I mean one column.
It is one big character/string column that holds
"if so it may well be worth the not-too-complex effort to restructure the data to to a single code value"
but we never know how long the code string will be. It could be a 100 columns or just 10.
We would need to loop through each string and create columns dynamically.
Thanks for your advice!
"In one case, you use a four digit number as the item of interest. In the other one, you use the two numbers separated by dashes"
Sometimes they are seperated by dashes, sometimes by comma. Whats is sure is that each is 4 digits.
"Is the data always in the form "9999-9999,9999-9999...", or can it vary?" -> It can vary.
I am not sure how many rows we receive by day but its around 10m per year.
Right now we dont have that big column with codes in it. We have a much simpler structure and
it is stored in one SAS dataset.
Good question about the reporting. We want to analyse data according to the individual
codes in the big code string. For example, give me all data with "chicken+packed+meat".
I can build dimension tables with those codes and their text value. No problem but
how do I efficiently look through the facts that have their dimension-keys
all in a single column.
03-08-2017 05:09 AM
How about something like the attached image to store the date and a OLAP cube to analyse it.
Putting the information horizontally i.e. the codes chicken, packed, meat, frozen, etc could be endless and not foreseeable.
However, a vertical table (as attached) would solve the storage but not the reporting/analysis.
For reporting/analysis reasons it is critically to have all the information horizontally available. I was wondering if an OLAP cube is the right thing to do the analysis later on and display the information next to each other on a horizontal axis.
03-08-2017 07:34 AM
This problem represents what my former organization called a "multiple response" variable, where a single respondent could have one or more valid responses (in our case it was to questions like "with which languages are you fluent").
We had a number of purpose-built analytical tools to use with these variables, but the only general-purpose approach that I found useful was to create a long, thin table with a respondent ID and each of the one to many responses on a separate row.
The following code should create something similar for you, with some simulated data that I created.
You can then use SQL to find the ID values that match combinations, like "6677", or "7012 and 3576", or "0625 and 4824 but not 3367".
Give it a try with a small-scale example, and see if it's going in the right direction.
infile cards dlm='09'x;
length ID 8 CharacterString $500;
input ID CharacterString;
data want(keep=ID XCode);
length XCode $4;
WordNumber = 1;
XCode = scan(CharacterString, WordNumber, ",-", "o");
WordNumber = WordNumber + 1;
XCode = scan(CharacterString, WordNumber, ",-", "o");
03-08-2017 07:03 AM - edited 03-08-2017 07:08 AM
From what I understand your data looks like the data organization @PhilipH suggests looks promising to me for permanent data storage and data maintenance.
As for reporting & analysis I'd say that OLAP cubes as a storage concept are outdated. The reporting & analysis requirements sound for me very much like a use case for VA.
I haven't done enough with VA to advice how your data should best be structured there - but I'm sure others will be able to give you guidance in this area if VA is an option for you.
The process as such would look like:
1. ETL into a data structure as suggested by @PhilipH ("data warehouse")
2. Incremental or full load from "data warehouse" into VA into a denormalized structure optimized for VA reporting & analysis