BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
seamoh
Obsidian | Level 7

Hi all the members in the community, hope the best week for all of you. I have a question. why there are lots of  duplication in SAS datasets(excel and ...). why we don't use Normalization rules before beginning gathering the data?

Thank you very much.

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

You will notice that the vast majority of SAS analytical procedures expect a single dataset as input. A strategy I have used often is to keep the data in (quasi-) normalized tables and to define SQL views to expand the data on the fly for analysis. For non trivial datasets, this is the only robust way to ensure data integrity.

PG

View solution in original post

8 REPLIES 8
Reeza
Super User

Not sure what you mean by duplicates, but re lack of 'normalization' it's typically because SAS is originally Statistical Analysis Software, which takes a different view of data and normalized data isn't usually good for modelling.

Edited.

smantha
Lapis Lazuli | Level 10

Base SAS is not a database in the strictest sense. However it comes close to replicating one. Hence the same rules cannot be applied such as integration check on columns such as column being not null etc. This is in addition to Reeza's answer that SAS is primarily for modeling data and has a different take on what the definition of repeated records mean.

Reeza
Super User
I'll just add on to this, DI Studio is designed to support pipelines and normalization, and handle more of the data management side. SAS/BASE/Studio/Foundation are not the same tools but they can do it regardless.

It's more of a build your own though compared to using an Out of the Box tool.
seamoh
Obsidian | Level 7

I have been in SAS area for one month . you and other friends definitely have more experience than me . I was practicing SAS Essential report 1 and I did an exercise and there was unnecessary duplicate in Report.Of course there are lots of syntax could remove duplication in the reports

then I have this question in my mind.

 

1. Is it better to design good and normalized datasets for gathering the data from the scratch?

2. Do  Data mining and Feature engineering have role in Data analysis in SAS or either in Pyton or R?

 

I hope I can tell what is in my mind, then you as expert clarify more about.

Thank you very much,

I am looking forward for your answers.

Deeply appreciated

SASKiwi
PROC Star

Normalisation is a concept commonly applied to relational database design to enforce data relationships, and optimise disk storage and database transactional performance. It doesn't have the same relevance for analytical data. 

PGStats
Opal | Level 21

You will notice that the vast majority of SAS analytical procedures expect a single dataset as input. A strategy I have used often is to keep the data in (quasi-) normalized tables and to define SQL views to expand the data on the fly for analysis. For non trivial datasets, this is the only robust way to ensure data integrity.

PG
seamoh
Obsidian | Level 7

Hello and thank you very much for reply. it sounds interesting. would you please introduce me a tutorial described your method or would you please make a tutorial about that. it would help a lot.

Thank you again.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 1137 views
  • 1 like
  • 5 in conversation