Collecting data from several resources in different countries, in order to create an uniform data set, the following issue occurs.
Although using data templates for the country uploaders we still keep having small differences in the uploaded column names.
e.g. Sales_price_-/Ton; sales_price_€/Ton, Customer_type, customer type etc.
In total there are 10 unique columns per uploaded dataset. There are 9 datasets to be append after uniformation.
Although every country has it's own unformation datastep, still the program stoppes when an unexpected column name occurs.
e.g. Expected = Customer_type, recieved = Customer type.
What I want to achieve is to create an adaptive data acceptance model.
If a column name contains '%price%' then column name = Sales_Price;
If a column name contains '%type%' then column name = Customer_Type.
etc.
Having 9 countries and several materials, we are facing a 75 uniformation processes.
Therefor I think the best option would be to declare the variables in advance, covering all the processes and possible differences.
How could we get this code working?
LIBNAME TMP00001 "E:\SAS94\Lev1\Data\SASVisualAnalytics\ETPmanualUpload";
options validvarname=any;
%let Sales_Price = '%EXW%';
%let Customer_type = '%type%';
data UPLOAD_VIEW;
set TMP00001.'etp_upl_fr_wood_2019'n;
&Sales_Price;
&Customer_Type;
if 'Volume (in Ton)'n > 0;
run;
Thanks in advance!
LIBNAME TMP00001 "E:\SAS94\Lev1\Data\SASVisualAnalytics\ETPmanualUpload";
*POLAND;
data ETP_UPL_PL_PLASTICS_2019;
set TMP00001.'etp_upl_pl_PLASTICS_2019'n;
if 'Volume (in Ton)'n > 0;
run;
proc transpose data=ETP_UPL_PL_PLASTICS_2019(obs=0) out=vars;
var _all_;
run;
filename PTPL2019 temp;
data _null_;
file PTPL2019;
set vars;
put 'Rename "' _name_ '"n=Col' _n_ ';';
run;
proc datasets;
modify ETP_UPL_PL_PLASTICS_2019;
%inc PTPL2019;
run;
quit;
data ETP_UPL_PL_PLASTICS_2019 (drop=Col1 Col2 Col3 Col4 Col5 Col6 Col7 col8 Col9 Col10 Col11 Col12);
set ETP_UPL_PL_PLASTICS_2019;
format Upload_Date date9.;
Upload_Date = Col1;
Country_ISO = 'PL';
Month_No = input(Col3, best.);
Year = input(Col4, best.);
Customer_Name_Local = put(Col5, $55.);
Customer_type = put(substr(Col6,1,1),$25.);
Plastic_Grade_Name = input(Col7, $50.);
Plastic_Group_Grade = input(Col8, $50.);
Volume_Ton = input(Col9, best.);
EXW_SalesPrice_Ton = input(Col10, best.);
Sales_Area = input(Col11, $10.);
'EU/EXP/VAR'n = input(Col12, $15.);
run;
Solved!! Thank you all!!!
Have you looked at using SQL union operator?
If the order of the columns is reliable, that's a good option as the names are ignored and only the column order matters.
If the column order is unreliable, maybe a prior step such as this one would suit your needs.
It arranges the column order without fully naming the variables, provided variable prefixes are sufficient to differentiate variables.
data CLASS;
merge SASHELP.CLASS(keep=AG:);
merge SASHELP.CLASS(keep=NA:);
merge SASHELP.CLASS(keep=HE:);
merge SASHELP.CLASS(keep=WE:);
merge SASHELP.CLASS(keep=SE:);
run;
What I am trying to do here is avoid macro code as your question shows a lack of understanding of the basics of macro language.
Triggered by the =AG: option, I tried to combine this with a replace statement.Without success.
A macro culd be the best solution I think.
The reason why we need to recognize the column name by a part of the name is due to the uniformation process.
A few columns are supposed to be numerical, but due to used excel formats (as result from ERP export) some of them are categorical.
In the code are lines that convert these specific columns to numeric when they are categorical.
Further there's an upload date which is in some cases not recognized as date. This is also converted in the process.
There are 75 files in a scheduled process running every 15 minutes. Goal is to have this fully AI driven.
Evaluating several options it seems to me that recognizing column names by part of the name is key to a solution.
What do you think?
Could it be possible? (can't find much information about this)
Thanks in advance!
>Triggered by the =AG: option, I tried to combine this with a replace statement.
I am unsure what this means
> Evaluating several options it seems to me that recognizing column names by part of the name is key to a solution.
Is the prefix enough?
> In the code are lines that convert these specific columns to numeric when they are categorical.
Further there's an upload date which is in some cases not recognized as date. This is also converted in the process.
That's a much different -larger- requirement, when the data types change rather than just the names.
Is the order or of the columns constant?
Below a picture of one of the processes and snippets of code.
The order of the columns is constant. Ironically, in Python would this be a simple task. Just rename the Dataset[ ,2] second column.
Yesterday evening I sutmbled over this code which should rename the column names to numerical indexes.
Didn't try this yet.
data reference;
set sashelp.heart;
array _n
_numeric_;
*Reference to numeric column 3;
if _n[3] lt 50;
run;
Regarding to =AG: => I hoped that replacling Name for Na: would do the trick, but it doesn't. Having a replace statement on a 'joker' variable would answer the question.
data newclass (drop=Name);
*length poprank 8;
set CLASS(rename=(Name=tempvar));
*Name=input(tempvar,8.);
run;
LIBNAME TMP00001 "E:\SAS94\Lev1\Data\SASVisualAnalytics\ETPmanualUpload";
data ETP_UPL_FR_WOOD_2019 (drop='Volume (in Ton)'n 'EXW Sales price (€/Ton)'n 'EXW Sales price (¬/Ton)'n 'Types of Customer'n 'Month #'n);
set TMP00001.'etp_upl_fr_wood_2019'n;
EXW_SalesPrice_Ton = input('EXW Sales price (€/Ton)'n, best.);
Volume_Ton = input('Volume (in Ton)'n, best.);
Country_ISO = 'FR';
if Volume_Ton > 0;
if year = 2019;
run;
data ETP_UPL_CZ_WOOD_2019 (drop='Volume (in Ton)'n 'EXW Sales price (€/Ton)'n);
set TMP00001.'etp_upl_cz_wood_2019'n;
EXW_SalesPrice_Ton = input('EXW Sales price (€/Ton)'n, best.);
Volume_Ton = input('Volume (in Ton)'n, best.);
Country_ISO = 'CZ';
if Volume_Ton > 0;
if year = 2019;
run;
data ETP_UPL_BE_WOOD_2019 (drop='Volume (in Ton)'n 'EXW Sales price (€/Ton)'n);
set TMP00001.'etp_upl_be_wood_2019'n;
EXW_SalesPrice_Ton = input('EXW Sales price (€/Ton)'n, best.);
Volume_Ton = input('Volume (in Ton)'n, best.);
Country_ISO = 'BE';
if Volume_Ton > 0;
if year = 2019;
run;
data WOOD_2019;
set ETP_UPL_FR_WOOD_2019
ETP_UPL_BE_WOOD_2019
ETP_UPL_CZ_WOOD_2019
ETP_UPL_LU_WOOD_2019
ETP_UPL_GB_WOOD_2019
ETP_UPL_PL_WOOD_2019
ETP_UPL_NL_WOOD_2019
ETP_UPL_DE_WOOD_2019
ETP_UPL_SE_WOOD_2019;
run;
This could actually work! Tried it and will modify it to our data.
data heart;
set sashelp.heart (obs=10);
run;
proc transpose data=heart(obs=0 /*keep=_numeric_*/) out=vars;
var _all_;
run;
filename FT76F001 temp;
data _null_;
file FT76F001;
set vars;
put 'Rename ' _name_ '=Col' _n_ ';';
run;
proc datasets;
modify heart;
%inc FT76F001;
run;
quit;
Modified the code to the dataset.
It works only on column names without a space.
In reality people tend to forget adding underscores. It means that if there's a option to modify this code to accept column names with spaces, it would completely suite the task.
LIBNAME TMP00001 "E:\SAS94\Lev1\Data\SASVisualAnalytics\ETPmanualUpload";
data UPLOAD_VIEW;
set TMP00001.'etp_upl_pl_wood_2019'n;
if 'Volume (in Ton)'n > 0;
run;
proc transpose data=UPLOAD_VIEW(obs=0) out=vars;
var _all_;
run;
filename FT76F001 temp;
data _null_;
file FT76F001;
set vars;
put 'Rename ' _name_ '=Col' _n_ ';';
run;
proc datasets;
modify UPLOAD_VIEW;
%inc FT76F001;
run;
quit;
LIBNAME TMP00001 "E:\SAS94\Lev1\Data\SASVisualAnalytics\ETPmanualUpload";
*POLAND;
data ETP_UPL_PL_PLASTICS_2019;
set TMP00001.'etp_upl_pl_PLASTICS_2019'n;
if 'Volume (in Ton)'n > 0;
run;
proc transpose data=ETP_UPL_PL_PLASTICS_2019(obs=0) out=vars;
var _all_;
run;
filename PTPL2019 temp;
data _null_;
file PTPL2019;
set vars;
put 'Rename "' _name_ '"n=Col' _n_ ';';
run;
proc datasets;
modify ETP_UPL_PL_PLASTICS_2019;
%inc PTPL2019;
run;
quit;
data ETP_UPL_PL_PLASTICS_2019 (drop=Col1 Col2 Col3 Col4 Col5 Col6 Col7 col8 Col9 Col10 Col11 Col12);
set ETP_UPL_PL_PLASTICS_2019;
format Upload_Date date9.;
Upload_Date = Col1;
Country_ISO = 'PL';
Month_No = input(Col3, best.);
Year = input(Col4, best.);
Customer_Name_Local = put(Col5, $55.);
Customer_type = put(substr(Col6,1,1),$25.);
Plastic_Grade_Name = input(Col7, $50.);
Plastic_Group_Grade = input(Col8, $50.);
Volume_Ton = input(Col9, best.);
EXW_SalesPrice_Ton = input(Col10, best.);
Sales_Area = input(Col11, $10.);
'EU/EXP/VAR'n = input(Col12, $15.);
run;
Solved!! Thank you all!!!
@TeddoVanMierle Remarks:
1. You are not changing data type
2. Proc contents seems more natural to use instead of proc transpose to list the variables names
3. You could apply a format on _COLn to arrive directly at your final wanted variable name
4. All this convoluted process could be replaced with a simple union in SQL as I mentioned
@ChrisNZ , thank you very much for your feedback!
The complete picture.
This code was only for answering the question the be able to recognize the uploaed column names.
Despite the column name used for uploading, it's now clear that Upload_Date will be always Col1, Country_ISO = Col2 etc.
Now Col1, Col2 etc. can be renamed to the standard names.
The overall goal is creating 1 uniform and standardized datalake out of 75 processes from multiple countries.
First priority is the convert those columns who should be numeric to numeric. Some are numeric and some not by upload.
Further we need a readable Upload_Date. Is some cases it is not, in some it is.
Then we need to modify the column contents for some countries and some materials, using scan or substr., specific filters
etc. etc.
This is only possible if you are able to call the right column. Due to the code it is now always possible.
After this is done and the content of all columns are equal in type and information, then there will a first append / union, creating 3 datasets. After this first append a second flow of modification will run over the datasets and next the second append / union will start, creating 1 uniform standardized datalake.
This process is running every 15 minutes automatically.
Indeed Union is a solution, but is it used in the end parts of the process.
Applying format directly on _COLn will not work in the first stage, in mine opinion, because only a few columns should be numeric. These should be recognized / called first. Using Col5 & Col6 it will be possible.
Maybe an integrated IF ELSE would be possible (IF Col5 then ..) but I think it won't help readability.
Now we have a part data collection and naming, then the part data preparation, etc.
Of course if there's a more efficiënt way, I am happy to know.
The process runs now in 0.03 second real time and cpu time.
> Applying format directly on _COLn will not work in the first stage, in mine opinion, because only a few columns should be numeric.
I meant:
proc format ;
value $names 'Col1'='Sales_price_Ton'
'Col2'='Customer_type ';
run;
data _null_;
file FT76F001;
set VARS;
NEWNAME = put(catt('Col',_n_), $names.);
put 'rename "' _NAME_ '"n=' NEWNAME ';' ;
run;
Hi, thank you for the suggestion. Tried the code, but got an error "kolomname exists". Therefor modified the code and tried several combinations.
Enclosed the final code (one of the 75), which is running now for a few days without any issues. The use of templates ensures the consistend location of the columns. In case of shifting columns this code won't work, because it assume's that Column 1 is always on the first position etc.
I am aware that there are opportunities for better efficient coding. On the otherside, it's in a transperant and readible way.
LIBNAME TMP00001 "E:\SAS94\Lev1\Data\SASVisualAnalytics\ETPmanualUpload";
*POLAND;
data ETP_UPL_PL_PLASTICS_2019;
set TMP00001.'etp_upl_pl_PLASTICS_2019'n;
if 'Volume (in Ton)'n > 0;
run;
proc transpose data=ETP_UPL_PL_PLASTICS_2019(obs=0) out=vars;
var _all_;
run;
filename PTPL2019 temp;
data _null_;
file PTPL2019;
set vars;
put 'Rename "' _name_ '"n=Col' _n_ ';';
run;
proc datasets;
modify ETP_UPL_PL_PLASTICS_2019;
%inc PTPL2019;
run;
quit;
data ETP_UPL_PL_PLASTICS_2019 (drop=Col1 Col2 Col3 Col4 Col5 Col6 Col7 col8 Col9 Col10 Col11 Col12);
set ETP_UPL_PL_PLASTICS_2019;
format Upload_Date date9.;
Upload_Date = Col1;
Country_ISO = 'PL';
Month_No = input(Col3, best.);
Year = input(Col4, best.);
Customer_Name_Local = put(Col5, $55.);
Customer_type = put(substr(Col6,1,1),$25.);
Plastic_Grade_Name = input(Col7, $50.);
Plastic_Group_Grade = input(Col8, $50.);
Volume_Ton = input(Col9, best.);
EXW_SalesPrice_Ton = input(Col10, best.);
Sales_Area = input(Col11, $10.);
'EU/EXP/VAR'n = input(Col12, $15.);
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.