SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Data Cleaning

Reply
Contributor
Posts: 23

Data Cleaning

1) This ist the description of the current situation:

The input data sources for the analysis is:

  • Contract_Config file (contains info about access lines)
  • Client (contains company names and company addresses)
  • AdressMaster - contains the addresses and serves as a reference for the correct address data
  • H_File: file from the external data provider; serves as a reference for the correct company names and addresses

The goal is to:

  • connect Contract_Config with AdressMaster in order to find the correct adressess of the access lines
    • Available Fileds for connection in Contract_Config: City, Door, Housenumber, Street, ZIP_Code
    • Available Fileds for connection in AdressMaster: City_Name, Door_Number, Building_Number, Street_Name, ZIP_Code
  • connect Client and H_File to find the correct comapany names and company addresses
    • Available Fileds for connection in Client (company Names): Client_Company, Client_Company2, Client_Firstname, Client_Lastname
    • Available Fileds for connection in H_File (Company Names): Firstname, Lastname, Comapny_Name
    • Available Fileds for connection in Client (company addresses): Client_Street, Client_ZIP_Code, Client_City
    • Available Fileds for connection in H_File (Company adresses): federal_state, door_number, place, ZIP, street, street_number
  • Connect AdressMaster and H_File to connect the addresses

 

2) Since I am new to the software, I find it hard. In fact, I have already spent few days to understand this software. Based on the information above, how can I make the corresponding data jobs? Please send me more detailed information and list the necessary job items/building blocks.

 

3) Additional dfficulty is to find very good tutorials that describe in easy (not in dull and tedious way) the purpose and the correct usae of the individual building blocks in the data job. Are you familiar maybe with any such tutorial?

Occasional Contributor
Posts: 9

Re: Data Cleaning

 

Hi,

 

I think what you are looking to do (rather how I understand it) is to link the different datasets to have a combined dataset.

 

Easiest way that I can think of is to use PROC SQL for this as there seems to be couple of fields with different names that needs to be joined. If you are new to SQL, please do some reading on the topic because the different joins will produce different results. (This paper should help http://www.lexjansen.com/nesug/nesug09/bb/BB03.pdf)

 

If for example you want to join/link Client and H_file information and you assume that Client has all the information you could do the following.

proc sql;
create table client_HComb as
select c.Client_Company, c.Client_Company2, c.Client_Firstname, c.Client_Lastname,
h.Firstname, h.Lastname, h.Comapny_Name
from client c left outer join h_file h on
c.Client_Firstname=h.firstname and c.Client_Lastname=h.Lastname and c.Client_Company = a.Comapny_Name
;
quit;

 

With the left join, if no matches from the H_file (right table) is found the fields from H_file will be missing. 

 

Similar results can be accomplished with Datastep merge but there's a couple of extract hoops that you'll need to jump through.

 

But important read up on SQL before hand. 

 

Best site I can suggest to find tutorials are www.lexjansen.com. Most of the papers presented at conferences are there and if covers from basics to advance. That is my one of my first stops when I need to find something.

 

HTH

 

KR

H

 

 

 

Ask a Question
Discussion stats
  • 1 reply
  • 222 views
  • 0 likes
  • 2 in conversation