BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
bnarang
Calcite | Level 5

Hi All

I work on banking data of customers and need to create a datamart for different purposes. we read data from different sources and it might not be properly cleaned and inserted with accuracy.

I derive variables as well from existing one.

My question is related to reading data in a way that makes it most accurate.

For example using upcase function while reading variables with possibility of upcase or lowcase input for deriving variables.

what would be your recommendations?

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

The most basic things you might want to check:

- missings

- data out of a valid range (if there is a valid range defined)

- invalid characters (eg. control characters)

- duplicate keys or rows - (only if this is still possible even though you are sourcing from dbms - eg. when combining tables from 2 different dbms).

And for look-ups

- no match over look-up key (if there must be a match, eg. to a row in a reference table).

To get a feel how data quality is: Proc Freq and Proc Univariate are quite helpful for this. Eg. a simple frequency count over your variables and then "eyeballing" the cases with low frequency (let's say product names and you find names which only occur a few times in a data set with millions of rows).

"But I would need more examples like this to avoid any future possibilities to commit any errors."

If data quality is not good then it's very hard to foresee all future possible cases. I would concentrate on the business keys so that bad data there can't mess up relationships (and you suddenly end up with many-to-many joins).

View solution in original post

6 REPLIES 6
PGStats
Opal | Level 21

Look at functions COMPLEV, COMPGED, SPEDIS and SOUNDEX as ways to find approximate matches.

PG

PG
newbie_ari
Fluorite | Level 6

Hi Bhpinder;

I also work with big data. To read data I always use the following methodology.

data test ; 

           infile "file_name.csv." delimiter = ',' MISSOVER DSD lrecl=32767

           firstobs=2 ;

           input

              name         : $10.

              occupation         : $50.

              item        : $25.

           ;

           run;

This way you can always be sure that all the data are properly read with apprpriate format.

Rgds.

Patrick
Opal | Level 21

The questions I would ask:

- What is the impact if data is not "clean"?

- Do I have a requirement to cleanse the data (and is there time/money planned for this)?

- Where should/can the data cleansing happen (as early as possible - but what is realistic)?

The first thing I would look at are the business keys. If this ones are not clean then you must find a way to cleanse them (or get the source systems to cleanse them). If these are character fields then I normally compress and upcase such keys for joining (and then add a surrogate key to the target table so I don't have to care about missmatching business keys later on).

If things are really bad then you can also consider using match codes.

If you need to cleanse the data: Be prepared that this can be quite a bit of work.

There is Ron Cody's book "Data Cleaning Techniques" which might be helpful SAS Press - Ron Cody Author Page and there is of course also DataFlux for such tasks.

And for your ETL design: Design for Data Validation steps creating Error and Exception tables to capture issues. And also plan for the actions to be taken if issues are found (eg. stop load and send emails/reports on issues, or continue load or ....). So you need here to plan for technical processes but also for business processes (who monitors these tables, who needs to take action and which action).

bnarang
Calcite | Level 5

    Thanks Patrick. Basically, I work on creating a sas process for datamart production which fetch historical data upto100 million and some times more than that and different sources (But not any txt or csv) always from dbms. But realized the loop holes and data issues.

While reading the data, we must apply some checks to make sure whatever we are reading or deriving from there cannot go wrong. The example I mentioned while deriving to use Upcasing is the realization I made recently I found error in the output. I need to use Upcase function as well.

But I would need more examples like this to avoid any future possibilities to commit any errors.

I can possibly look into the books you have referred.

Bhupinder

Patrick
Opal | Level 21

The most basic things you might want to check:

- missings

- data out of a valid range (if there is a valid range defined)

- invalid characters (eg. control characters)

- duplicate keys or rows - (only if this is still possible even though you are sourcing from dbms - eg. when combining tables from 2 different dbms).

And for look-ups

- no match over look-up key (if there must be a match, eg. to a row in a reference table).

To get a feel how data quality is: Proc Freq and Proc Univariate are quite helpful for this. Eg. a simple frequency count over your variables and then "eyeballing" the cases with low frequency (let's say product names and you find names which only occur a few times in a data set with millions of rows).

"But I would need more examples like this to avoid any future possibilities to commit any errors."

If data quality is not good then it's very hard to foresee all future possible cases. I would concentrate on the business keys so that bad data there can't mess up relationships (and you suddenly end up with many-to-many joins).

bnarang
Calcite | Level 5

Thanks for reply. It makes lot of sense first analyse your input data with basic procedures like freq and univariate before blindly processing them.

Thanks

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1037 views
  • 4 likes
  • 4 in conversation