DATA Step, Macro, Functions and more

Many datasets with variables of different FORMATS and LENGTHS--"Homogenise" FORMATS & LENGTHS

Reply
Contributor
Posts: 40

Many datasets with variables of different FORMATS and LENGTHS--"Homogenise" FORMATS & LENGTHS

Dear all,

 

I have a library with many datasets. I am only considered in two special sets of datasets

1st group ADS_HLK_date1, PLKS_ERT_HLK_date2, HIO_TRE_HLK_OPT_date3 etc..
(all of these datasets of interest have in common _HLK_ )

 

2nd group TUI_FRT_GOO_date1, HGK_ERT_GOO_999_date2, HIO_TRE_GOO_LIT_date3 etc..
(all of these datasets of interest have in common _GOO_ )

 

The real problem is although both sets of tables they have identical variables

NUMVAR1, CHARVAR1,NUMVAR2,NUMVAR3,NUMVAR4,NUMVAR5,CHARVAR2,CHARVAR3 etc.

 

in the first set (i.e._HLK_) some numeric variables NUMVAR(i) have CHARACTER FORMATS while
in the second set they have NUMERIC FORMATS


the same applies for the character variables CHARVAR(i) , in one set (i.e. _GOO_)
have character formats of certain type i.e. $CHAR20. but in the other set they appear to haave different character formats i.e. $15.

 

I would like to "homogenise":

 

1st-- the numeric formats across ALL sets (both _GOO_ and _HLK_)
by changing the LENGTH and CHARACTER FORMAT to the ones with the NUMERIC represantation

 

2nd-- the various character formats and LENGTHS across ALL datasets.

 

in order to CONCATENATE all datasets in one table.

 

I would more than welcome any hints/suggestions.

 

Thank you in advance.

Super User
Posts: 17,840

Re: Many datasets with variables of different FORMATS and LENGTHS--"Homogenise" FORMATS &a

How are these datasets created? Is there an option to fix them before you append so you don't have to do this in the first place?

Contributor
Posts: 40

Re: Many datasets with variables of different FORMATS and LENGTHS--"Homogenise" FORMATS &a

Hi Reeza,

 

They are csv. files and I use "Import Data" with Enterprise Guide..Unfortunately it would take me ages to correct the csv. files..

 

Thank you.

Super User
Posts: 17,840

Re: Many datasets with variables of different FORMATS and LENGTHS--"Homogenise" FORMATS &a

Is it the CSV files or the IMPORT that's incorrect?

Are the files the same?

Contributor
Posts: 40

Re: Many datasets with variables of different FORMATS and LENGTHS--"Homogenise" FORMATS &a

The original csv files are the same..

 

but the EG IMPORT produces the "inconsistencies"...differences..

Super User
Posts: 17,840

Re: Many datasets with variables of different FORMATS and LENGTHS--"Homogenise" FORMATS &a

[ Edited ]

Look at this data step. It will read all files in a single folder at once, if they have the same layout. First, use PROC IMPORT on a single file. Get the code from the log - primarily the FORMAT/INFORMAT/INPUT statement. 

 

Then, use the instructions to import all at once. This is less work than the solution to 'homogenize' will be, but at the end of the day it will be your choice. One question - especially if there are character variables, how sure are you SAS read all of them correctly and that you didn't miss anything? 

 

https://communities.sas.com/t5/SAS-Communities-Library/How-do-I-write-a-macro-to-import-multiple-tex...

Contributor
Posts: 40

Re: Many datasets with variables of different FORMATS and LENGTHS--"Homogenise" FORMATS &a

Although I would not know how to fix that within the csv...

 

I am afraid I have to come up with a process in SAS

 

Super User
Posts: 10,500

Re: Many datasets with variables of different FORMATS and LENGTHS--"Homogenise" FORMATS &a

If these files are supposed to be the same then you do not allow any of the import tasks to guess for each input set. Use a data step to read each one and you can specify the variable names, types, lengths, formats and such.

 

A not uncommon practice is to use an import task (or proc import code) once and then to capture the Log generated. The task actually creates data step code that can be saved and editted with the changes you want. Once you have the data behaving as desired keep the code and change the input file name and the output dataset name.

 

If you have multiple file layouts then you do this once for each layout.

 

Some things such a variable names or formats can be changed using proc datasets on existing data but to change variable type requires either re-reading the data or a separate data step operation on each data set to rename old variable and recreate with the new name. Which may be more code work than modifying a program to read a single file layout and rereading all the similar sets.

Ask a Question
Discussion stats
  • 7 replies
  • 93 views
  • 1 like
  • 3 in conversation