Efficient “One-Row-per-Subject” Data Mart Construction for Data Mining
- Article History
- RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Gerhard Svolba, SAS Austria
Abstract
Creating a "one-row-per-subject" data mart is a fundamental task when preparing data for data mining. To answer
the underlying business question, the analyst or data-mart programmer is challenged to distill the relevant
information from various data sources.
Creating a data mart involves more than reading columns from a source table to the data mart. It also includes the
aggregation or transposition of observations from "multiple-row-per-subject" tables like transactional tables and time
histories. This process is a critical success factor for being able to answer the business question or to have good
predictors available for a target event or target value.
This paper shows the details of input data structures for a "one-row-per-subject" data mart. It also discusses the
"one-row-per-subject" paradigm from a technical and a business point of view, and shows how data, some of which
include hierarchical dependencies, is aggregated into a “single-row-per-subject.” A comprehensive example shows
how a "one-row-per-subject" data mart is created from various data sources.
Watch the presentation
This SAS Global Forum paper has been published as a number of presentations of in my Data Preparation for Data Science webinar series.
DOWNLOAD THE FULL PAPER
https://support.sas.com/resources/papers/proceedings/proceedings/sugi31/078-31.pdf
DOWNLOAD THE SLIDES
- Slide Collection: Slides Deck #121 and #122 at https://github.com/gerhard1050/DataScience-Presentations-By-Gerhard
- Find the original presentations from SUGI 31 (2006) in the attachment. Note that you find the re-worked presentation in the link above.
CONCLUSION
This paper shows that data preparation is a key success factor for analytical projects. Besides multiple-row-persubject data marts or longitudinal data marts, the one-row-per-subject data mart is frequently used and is central for statistical and data mining analyses.
In many cases the available source data have a one-to-many relationship to the subject itself. Therefore data needs to be transposed or aggregated. The selection of the transposition and aggregation methods is primarily driven by business considerations of what will give meaningful derived variables.
This paper shows you ways to create a one-row-per-subject data mart from a conceptual and a coding point of view.
Anyone who is interested in more details on Data Preparation for Analytics is referred to my book with the same name, which will be published by SAS Press (Book# 60502) in September 2006 References
Recommended Reading
- Feature Engineering 1 - Using Correlation Analysis to Describe Behavior over Time
- Feature Engineering #2 - Accordance with predefined pattern
- Feature Engineering #3 – Describing the Trend over Time
- 3 ways to consider movable holidays in SAS
ASK THE EXPERT SEMINAR (in German)
- Womit Sie „DATA=“ in den analytischen Procedures von SAS am besten füttern - Teil 1
- Womit Sie „DATA=“ in den analytischen Procedures von SAS am besten füttern - Teil 2