turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Data Analysis

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-15-2009 10:14 AM

Hello,

I need to analyze data fields in some large flat files and I was thinking about using PROC FREQ to get a list of unique values for every column for starters.

We have the Enterprise version of SAS, but i havn't used SAS in several years, so i'm wondering what SAS tools may be available to get a "brids eyer view" of the data, or at least get a unique list of values for every column in these files?

Right now, i'm thinking about creating a PROC FREQ for every column and from those outputs comparing unique values in order to determine data relationships, but before doing that, i wanted to check with you'all and see if you could suggest more effective approach to this kind of data analysis?

Thanks very much for any insights!

BobK

I need to analyze data fields in some large flat files and I was thinking about using PROC FREQ to get a list of unique values for every column for starters.

We have the Enterprise version of SAS, but i havn't used SAS in several years, so i'm wondering what SAS tools may be available to get a "brids eyer view" of the data, or at least get a unique list of values for every column in these files?

Right now, i'm thinking about creating a PROC FREQ for every column and from those outputs comparing unique values in order to determine data relationships, but before doing that, i wanted to check with you'all and see if you could suggest more effective approach to this kind of data analysis?

Thanks very much for any insights!

BobK

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-16-2009 09:07 AM

Enterprise Guide uses SAS in a client/server configuration, so somewhere you have the complete SAS. SAS has several products that include the word "Enterprise" as an adjective.

In Enterprise Guide, you can use the "Characterize data" task to get a quick view of a dataset. It won't do all you want, but it will give you a high level overview of each file.

Doc Muhlbaier

Duke

In Enterprise Guide, you can use the "Characterize data" task to get a quick view of a dataset. It won't do all you want, but it will give you a high level overview of each file.

Doc Muhlbaier

Duke

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-16-2009 04:26 PM

I don't see much value in obtaining a "brids eyer view" that consists of the unique values of every variable. I doubt it would help me in this situation.

Usually, for us to provide advice, and for you to make meaningful progress in understanding what the data is telling you, you need to formulate actual questions that you want to answer from the data.

Data analysis is pointless without some clear question(s) to answer — its like computing the average phone number from a list of phone numbers, of course you can do the calculations, but why bother?

Usually, for us to provide advice, and for you to make meaningful progress in understanding what the data is telling you, you need to formulate actual questions that you want to answer from the data.

Data analysis is pointless without some clear question(s) to answer — its like computing the average phone number from a list of phone numbers, of course you can do the calculations, but why bother?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-16-2009 06:57 PM

Hi, Paige:

Long ago, in a galaxy far, far away, when I was young SAS newbie (no comments from the peanut gallery, those of you who know how -long- ago that was when I was "young"), "characterize your data" meant doing a bunch of PROC FREQs on all the character variables in the dataset and a PROC UNIVARIATE on all the numeric variables (to find the extreme obs, the mean and the median values).

And, sure enough, if you check out the EG "Characterize Data" task, it's doing just that -- the equivalent of PROC CONTENTS, PROC FREQ and PROC UNIVARIATE -- along with some GCHARTS thrown in to graphically show you the data and some PROC PRINTS of the data.

When I worked for lawyers, we'd give them that big stack of paper and then they'd formulate the questions (let's see the salary history of these 20 people; who makes over the median salary in these job categories, etc, etc.) but they'd always want the reams and reams of initial "paper" first.

cynthia

Long ago, in a galaxy far, far away, when I was young SAS newbie (no comments from the peanut gallery, those of you who know how -long- ago that was when I was "young"), "characterize your data" meant doing a bunch of PROC FREQs on all the character variables in the dataset and a PROC UNIVARIATE on all the numeric variables (to find the extreme obs, the mean and the median values).

And, sure enough, if you check out the EG "Characterize Data" task, it's doing just that -- the equivalent of PROC CONTENTS, PROC FREQ and PROC UNIVARIATE -- along with some GCHARTS thrown in to graphically show you the data and some PROC PRINTS of the data.

When I worked for lawyers, we'd give them that big stack of paper and then they'd formulate the questions (let's see the salary history of these 20 people; who makes over the median salary in these job categories, etc, etc.) but they'd always want the reams and reams of initial "paper" first.

cynthia

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-17-2009 09:13 AM

Cynthia

I have been there and done that myself.

My recommendation above is because I have never ever found any value to the PROC FREQ/PROC MEANS on everything approach. If the researcher cannot specify in advance some things he might want to know about the data, if its just a huge big fishing expedition, then I don't expect much success other than by random chance.

So, if someone wants to compute means, or frequencies, on phone numbers, please go right ahead. I'll pass ...

I have been there and done that myself.

My recommendation above is because I have never ever found any value to the PROC FREQ/PROC MEANS on everything approach. If the researcher cannot specify in advance some things he might want to know about the data, if its just a huge big fishing expedition, then I don't expect much success other than by random chance.

So, if someone wants to compute means, or frequencies, on phone numbers, please go right ahead. I'll pass ...

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-18-2009 07:27 AM

Thanks Cynthia, i'm having the Enterprise Guide software installed this week for the first time, so i'm looking forward to giving it a try, especially the "Characterize Data" feature and i'm not expecting any miracles, but if i can just get lists of finite higher level control type values, that would be a good start, ie for example from a list of mortgage types i would be able to say that this file only contains ALT-A, JUMBO and SUBPRIME data ect.

Cynthia I do remember when i used a PROC FREQ on a dollar value in a billion record file and the output was huge, so i was wondering if there may be a feature in PROC FREQ or "Characterize Data" that will allow me to set a limit count of lets say 10000, so if the number of unique values cached in a PROC FREQ or CD exceed 10000, then the PROC FREQ or CD would abort?

Thanks Cynthia and all for you insights!

BobK

Cynthia I do remember when i used a PROC FREQ on a dollar value in a billion record file and the output was huge, so i was wondering if there may be a feature in PROC FREQ or "Characterize Data" that will allow me to set a limit count of lets say 10000, so if the number of unique values cached in a PROC FREQ or CD exceed 10000, then the PROC FREQ or CD would abort?

Thanks Cynthia and all for you insights!

BobK

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-18-2009 09:20 AM

"Characterize data" does allow you to set a limit for the number of distinct values that it shows (the default is 30). It displays the **first** k distinct values, which can be misleading if there is some inherent order in the data that you don't know about. A separate limitation to "Characterize data" is that it only does the frequencies on character variables; it does mean/medians/etc. on numerics. Maybe EGuide 4.2 will be smarter.