Solved: Re: Column name contains / like => adaptive data acceptance

TeddoVanMierle · Posted 07-28-2019 06:38 AM

Collecting data from several resources in different countries, in order to create an uniform data set, the following issue occurs.

Although using data templates for the country uploaders we still keep having small differences in the uploaded column names.

e.g. Sales_price_-/Ton; sales_price_€/Ton, Customer_type, customer type etc.

In total there are 10 unique columns per uploaded dataset. There are 9 datasets to be append after uniformation.

Although every country has it's own unformation datastep, still the program stoppes when an unexpected column name occurs.

e.g. Expected = Customer_type, recieved = Customer type.

What I want to achieve is to create an adaptive data acceptance model.

If a column name contains '%price%' then column name = Sales_Price;

If a column name contains '%type%' then column name = Customer_Type.

etc.

Having 9 countries and several materials, we are facing a 75 uniformation processes.

Therefor I think the best option would be to declare the variables in advance, covering all the processes and possible differences.

How could we get this code working?

LIBNAME TMP00001 "E:\SAS94\Lev1\Data\SASVisualAnalytics\ETPmanualUpload";
options validvarname=any;

%let Sales_Price = '%EXW%';
%let Customer_type = '%type%';

data UPLOAD_VIEW;
set TMP00001.'etp_upl_fr_wood_2019'n;

&Sales_Price;
&Customer_Type;
if 'Volume (in Ton)'n > 0;
run;

Thanks in advance!

TeddoVanMierle · Posted 07-31-2019 07:25 AM

LIBNAME TMP00001 "E:\SAS94\Lev1\Data\SASVisualAnalytics\ETPmanualUpload";

*POLAND;
data ETP_UPL_PL_PLASTICS_2019;
set TMP00001.'etp_upl_pl_PLASTICS_2019'n;

if 'Volume (in Ton)'n > 0;

run;

proc transpose data=ETP_UPL_PL_PLASTICS_2019(obs=0) out=vars;
var _all_;
run;

filename PTPL2019 temp;
data _null_;
file PTPL2019;
set vars;
put 'Rename "' _name_ '"n=Col' _n_ ';';

run;

proc datasets;
modify ETP_UPL_PL_PLASTICS_2019;
%inc PTPL2019;
run;
quit;

data ETP_UPL_PL_PLASTICS_2019 (drop=Col1 Col2 Col3 Col4 Col5 Col6 Col7 col8 Col9 Col10 Col11 Col12);
set ETP_UPL_PL_PLASTICS_2019;

format Upload_Date date9.;

Upload_Date = Col1;
Country_ISO = 'PL';
Month_No = input(Col3, best.);
Year = input(Col4, best.);
Customer_Name_Local = put(Col5, $55.);
Customer_type = put(substr(Col6,1,1),$25.);
Plastic_Grade_Name = input(Col7, $50.);
Plastic_Group_Grade = input(Col8, $50.);
Volume_Ton = input(Col9, best.);
EXW_SalesPrice_Ton = input(Col10, best.);
Sales_Area = input(Col11, $10.);
'EU/EXP/VAR'n = input(Col12, $15.);
run;

Solved!! Thank you all!!!

View solution in original post

ChrisNZ · Posted 07-28-2019 07:14 PM

Have you looked at using SQL union operator?

If the order of the columns is reliable, that's a good option as the names are ignored and only the column order matters.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 07-28-2019 07:27 PM

If the column order is unreliable, maybe a prior step such as this one would suit your needs.

It arranges the column order without fully naming the variables, provided variable prefixes are sufficient to differentiate variables.

data CLASS;
   merge SASHELP.CLASS(keep=AG:);
   merge SASHELP.CLASS(keep=NA:);
   merge SASHELP.CLASS(keep=HE:);
   merge SASHELP.CLASS(keep=WE:);
   merge SASHELP.CLASS(keep=SE:);
run;

What I am trying to do here is avoid macro code as your question shows a lack of understanding of the basics of macro language.

High-Performance SAS Coding - Third Edition

TeddoVanMierle · Posted 07-30-2019 04:28 AM

Triggered by the =AG: option, I tried to combine this with a replace statement.Without success.

A macro culd be the best solution I think.

The reason why we need to recognize the column name by a part of the name is due to the uniformation process.

A few columns are supposed to be numerical, but due to used excel formats (as result from ERP export) some of them are categorical.

In the code are lines that convert these specific columns to numeric when they are categorical.

Further there's an upload date which is in some cases not recognized as date. This is also converted in the process.

There are 75 files in a scheduled process running every 15 minutes. Goal is to have this fully AI driven.

Evaluating several options it seems to me that recognizing column names by part of the name is key to a solution.

What do you think?

Could it be possible? (can't find much information about this)

Thanks in advance!

Reeza · Posted 07-30-2019 12:25 PM

Change your process to have an online form that collects structured data instead of cleaning up or attempting this after the fact. AirTable is a great service that does a great job at this.

Now, assuming that's not possible for whatever reason, you still have to clean this data. Assuming this is a process that is ongoing, I'd probably not generalize it - there are too many things you would need to account for. Instead, hopefully, you can assume that the file will always come in from that country/team in the same format and you can process each file independently and streamline it. I would recommend converting the file to a text file and importing it, which gives you more control than Excel - VBS can do that easily, there's a code on my GitHub profile that demonstrates this. Then once the datasets are each in clean, you can combine and process them to a combined data set for analysis. A painful data engineering task, but probably likely to give you the highest quality data.

ChrisNZ · Posted 07-30-2019 07:34 PM

>Triggered by the =AG: option, I tried to combine this with a replace statement.

I am unsure what this means

> Evaluating several options it seems to me that recognizing column names by part of the name is key to a solution.

Is the prefix enough?

> In the code are lines that convert these specific columns to numeric when they are categorical.

Further there's an upload date which is in some cases not recognized as date. This is also converted in the process.

That's a much different -larger- requirement, when the data types change rather than just the names.

Is the order or of the columns constant?

High-Performance SAS Coding - Third Edition

TeddoVanMierle · Posted 07-31-2019 03:45 AM

Below a picture of one of the processes and snippets of code.

The order of the columns is constant. Ironically, in Python would this be a simple task. Just rename the Dataset[ ,2] second column.

Yesterday evening I sutmbled over this code which should rename the column names to numerical indexes.

Didn't try this yet.

data reference;
    set sashelp.heart;
    array _n
   _numeric_;
*Reference to numeric column 3;
    if _n[3] lt 50;
run;

Regarding to =AG: => I hoped that replacling Name for Na: would do the trick, but it doesn't. Having a replace statement on a 'joker' variable would answer the question.

data newclass (drop=Name);
   *length poprank 8;
   set CLASS(rename=(Name=tempvar));
   *Name=input(tempvar,8.);
run;

LIBNAME TMP00001 "E:\SAS94\Lev1\Data\SASVisualAnalytics\ETPmanualUpload";

data ETP_UPL_FR_WOOD_2019 (drop='Volume (in Ton)'n 'EXW Sales price (€/Ton)'n 'EXW Sales price (¬/Ton)'n 'Types of Customer'n 'Month #'n);
	set TMP00001.'etp_upl_fr_wood_2019'n;

		EXW_SalesPrice_Ton = input('EXW Sales price (€/Ton)'n, best.);
		Volume_Ton = input('Volume (in Ton)'n, best.);
		Country_ISO = 'FR';

		if Volume_Ton > 0;
		if year = 2019;
run;

data ETP_UPL_CZ_WOOD_2019 (drop='Volume (in Ton)'n 'EXW Sales price (€/Ton)'n);
	set TMP00001.'etp_upl_cz_wood_2019'n;

		EXW_SalesPrice_Ton = input('EXW Sales price (€/Ton)'n, best.);
		Volume_Ton = input('Volume (in Ton)'n, best.);
		Country_ISO = 'CZ';

		if Volume_Ton > 0;
		if year = 2019;
run;

data ETP_UPL_BE_WOOD_2019 (drop='Volume (in Ton)'n 'EXW Sales price (€/Ton)'n);
	set TMP00001.'etp_upl_be_wood_2019'n;

		EXW_SalesPrice_Ton = input('EXW Sales price (€/Ton)'n, best.);
		Volume_Ton = input('Volume (in Ton)'n, best.);
		Country_ISO = 'BE';

		if Volume_Ton > 0;
		if year = 2019;
run;

data WOOD_2019;
  set	ETP_UPL_FR_WOOD_2019
			ETP_UPL_BE_WOOD_2019
				ETP_UPL_CZ_WOOD_2019
					ETP_UPL_LU_WOOD_2019
						ETP_UPL_GB_WOOD_2019
							ETP_UPL_PL_WOOD_2019
								ETP_UPL_NL_WOOD_2019
									ETP_UPL_DE_WOOD_2019
										ETP_UPL_SE_WOOD_2019;
										


run;

TeddoVanMierle · Posted 07-31-2019 04:36 AM

This could actually work! Tried it and will modify it to our data.

data heart;
	set sashelp.heart (obs=10);
run;

proc transpose data=heart(obs=0 /*keep=_numeric_*/) out=vars;
	var _all_;
run;

filename FT76F001 temp;
data _null_;
	file FT76F001;
	set vars;
	put 'Rename ' _name_ '=Col' _n_ ';';
run;

proc datasets;
	modify heart;
	%inc FT76F001;
run;
quit;

TeddoVanMierle · Posted 07-31-2019 07:11 AM

Modified the code to the dataset.

It works only on column names without a space.

In reality people tend to forget adding underscores. It means that if there's a option to modify this code to accept column names with spaces, it would completely suite the task.

LIBNAME TMP00001 "E:\SAS94\Lev1\Data\SASVisualAnalytics\ETPmanualUpload";

data UPLOAD_VIEW;
	set TMP00001.'etp_upl_pl_wood_2019'n;

 if 'Volume (in Ton)'n > 0;
run;

proc transpose data=UPLOAD_VIEW(obs=0) out=vars;
	var _all_;
run;

filename FT76F001 temp;
data _null_;
	file FT76F001;
	set vars;
	put 'Rename ' _name_ '=Col' _n_ ';';
run;

proc datasets;
	modify UPLOAD_VIEW;
	%inc FT76F001;
run;
quit;

ChrisNZ · Posted 07-31-2019 07:20 AM

put 'Rename "' _name_ '"n=Col' _n_ ';';

High-Performance SAS Coding - Third Edition

TeddoVanMierle · Posted 07-31-2019 07:25 AM

LIBNAME TMP00001 "E:\SAS94\Lev1\Data\SASVisualAnalytics\ETPmanualUpload";

*POLAND;
data ETP_UPL_PL_PLASTICS_2019;
set TMP00001.'etp_upl_pl_PLASTICS_2019'n;

if 'Volume (in Ton)'n > 0;

run;

proc transpose data=ETP_UPL_PL_PLASTICS_2019(obs=0) out=vars;
var _all_;
run;

filename PTPL2019 temp;
data _null_;
file PTPL2019;
set vars;
put 'Rename "' _name_ '"n=Col' _n_ ';';

run;

proc datasets;
modify ETP_UPL_PL_PLASTICS_2019;
%inc PTPL2019;
run;
quit;

data ETP_UPL_PL_PLASTICS_2019 (drop=Col1 Col2 Col3 Col4 Col5 Col6 Col7 col8 Col9 Col10 Col11 Col12);
set ETP_UPL_PL_PLASTICS_2019;

format Upload_Date date9.;

Upload_Date = Col1;
Country_ISO = 'PL';
Month_No = input(Col3, best.);
Year = input(Col4, best.);
Customer_Name_Local = put(Col5, $55.);
Customer_type = put(substr(Col6,1,1),$25.);
Plastic_Grade_Name = input(Col7, $50.);
Plastic_Group_Grade = input(Col8, $50.);
Volume_Ton = input(Col9, best.);
EXW_SalesPrice_Ton = input(Col10, best.);
Sales_Area = input(Col11, $10.);
'EU/EXP/VAR'n = input(Col12, $15.);
run;

Solved!! Thank you all!!!

ChrisNZ · Posted 07-31-2019 06:17 PM

@TeddoVanMierle Remarks:

1. You are not changing data type

2. Proc contents seems more natural to use instead of proc transpose to list the variables names

3. You could apply a format on _COLn to arrive directly at your final wanted variable name

4. All this convoluted process could be replaced with a simple union in SQL as I mentioned

High-Performance SAS Coding - Third Edition

TeddoVanMierle · Posted 08-02-2019 04:09 AM

@ChrisNZ , thank you very much for your feedback!

The complete picture.

This code was only for answering the question the be able to recognize the uploaed column names.

Despite the column name used for uploading, it's now clear that Upload_Date will be always Col1, Country_ISO = Col2 etc.

Now Col1, Col2 etc. can be renamed to the standard names.

The overall goal is creating 1 uniform and standardized datalake out of 75 processes from multiple countries.

First priority is the convert those columns who should be numeric to numeric. Some are numeric and some not by upload.

Further we need a readable Upload_Date. Is some cases it is not, in some it is.

Then we need to modify the column contents for some countries and some materials, using scan or substr., specific filters

etc. etc.

This is only possible if you are able to call the right column. Due to the code it is now always possible.

After this is done and the content of all columns are equal in type and information, then there will a first append / union, creating 3 datasets. After this first append a second flow of modification will run over the datasets and next the second append / union will start, creating 1 uniform standardized datalake.

This process is running every 15 minutes automatically.

Indeed Union is a solution, but is it used in the end parts of the process.

Applying format directly on _COLn will not work in the first stage, in mine opinion, because only a few columns should be numeric. These should be recognized / called first. Using Col5 & Col6 it will be possible.

Maybe an integrated IF ELSE would be possible (IF Col5 then ..) but I think it won't help readability.

Now we have a part data collection and naming, then the part data preparation, etc.

Of course if there's a more efficiënt way, I am happy to know.

The process runs now in 0.03 second real time and cpu time.

ChrisNZ · Posted 08-04-2019 05:52 PM

> Applying format directly on _COLn will not work in the first stage, in mine opinion, because only a few columns should be numeric.

I meant:

proc format ;
  value $names 'Col1'='Sales_price_Ton'
               'Col2'='Customer_type  ';
run;

data _null_;
  file FT76F001;
  set VARS;
  NEWNAME = put(catt('Col',_n_), $names.);
  put 'rename "' _NAME_ '"n=' NEWNAME ';' ;
run;

High-Performance SAS Coding - Third Edition

TeddoVanMierle · Posted 08-09-2019 05:44 AM

Hi, thank you for the suggestion. Tried the code, but got an error "kolomname exists". Therefor modified the code and tried several combinations.

Enclosed the final code (one of the 75), which is running now for a few days without any issues. The use of templates ensures the consistend location of the columns. In case of shifting columns this code won't work, because it assume's that Column 1 is always on the first position etc.

I am aware that there are opportunities for better efficient coding. On the otherside, it's in a transperant and readible way.

LIBNAME TMP00001 "E:\SAS94\Lev1\Data\SASVisualAnalytics\ETPmanualUpload";

*POLAND;
data ETP_UPL_PL_PLASTICS_2019;
	set TMP00001.'etp_upl_pl_PLASTICS_2019'n;

 if 'Volume (in Ton)'n > 0;
run;

proc transpose data=ETP_UPL_PL_PLASTICS_2019(obs=0) out=vars;
	var _all_;
run;

filename PTPL2019 temp;
data _null_;
	file PTPL2019;
	set vars;
	put 'Rename "' _name_ '"n=Col' _n_ ';';

run;

proc datasets;
	modify ETP_UPL_PL_PLASTICS_2019;
	%inc PTPL2019;
run;
quit;

data ETP_UPL_PL_PLASTICS_2019 (drop=Col1 Col2 Col3 Col4 Col5 Col6 Col7 col8 Col9 Col10 Col11 Col12);	
	set ETP_UPL_PL_PLASTICS_2019; 

	format Upload_Date date9.;

	Upload_Date = Col1;
	Country_ISO = 'PL';
	Month_No = input(Col3, best.);
	Year = input(Col4, best.);
	Customer_Name_Local = put(Col5, $55.);
	Customer_type = put(substr(Col6,1,1),$25.);
	Plastic_Grade_Name = input(Col7, $50.);
	Plastic_Group_Grade = input(Col8, $50.);
	Volume_Ton = input(Col9, best.);
	EXW_SalesPrice_Ton = input(Col10, best.);
	Sales_Area = input(Col11, $10.);
	'EU/EXP/VAR'n = input(Col12, $15.);
run;

SAS Innovate 2025: Call for Content

Classroom Training Available!