Hi,
My program is currently reading from a directory in Linux where it has multiple excel spreadsheets in it and import into SAS Dataset. 1 spreadsheet will be loaded into 1 dataset. So if there is 50 files in the directory, it will produce 50 datasets. However, after discussing with my team, there are some changes.
Changes 1:
Data starts Cell C:5th while column name starts at Cell C:4th row.
Difficulty: I tried adding "Range" statement but it prompts error.
If we look at Cell C4 D4 E4 and C5-7 D5-7 E5-7:
C D E
4 100 Yes 99
5 700 No 18
6 600 Yes 41
7 400 Yes 888
Changes 2:
On first row of excel, A1 to C1 and E1 data are important as we will need to transpose the data and populate it to the target table.
For example:
In Excel-
A B C D E
1 BMW M5 Turbo Blue
So in every excel spreadsheet, I have to come out with a code to scan A1, B1, C1 and E1 first. Then, populate the value to the target final table as below:
Make Model Variant Color A B C
BMW M5 Turbo Blue 100 Yes 99
BMW M5 Turbo Blue 700 No 18
BMW M5 Turbo Blue 600 Yes 41
BMW M5 Turbo Blue 400 Yes 888
However, my team and I are stuck at this transpose portion.
Changes 3:
This is 1 of our hardest challenges. In the 2nd row of every excel, the first value appear in D2 cell. The value could be 100 or 700 or 600 and etc. Then, Cell J2 could have value of 100 or 700 or 600 and etc as well. Occasionally, Cell P2 will have value of 100 or 700 or 600 and etc also.
Meaning to say that it is possible that the Row no.2 will have a value every 6 cells apart. May or may not have more than 2 cells of them. So the code gotta be dynamic enough to detect if every 6th Cell Row 2 contains 100 or 700 or 600 and etc.
Changes 4:
There are some formula in the excel. When we first load, it is getting the formula instead of the end value. Is there any bypass option for that?
Changes 5:
Instead of loading all excel into seperate datasets, we want to actually append them into 1 final table. However, we think it would be easier to achieve this once we settle the first 4 item changes above.
Here are the codes that I have currently(loading from multiple excel(xls) into multiple sas datasets):
%macro drive(dir,ext); 
   %local cnt filrf rc did memcnt name; 
   %let cnt=0;          
   %let filrf=mydir;    
   %let rc=%sysfunc(filename(filrf,&dir)); 
   %let did=%sysfunc(dopen(&filrf));
    %if &did ne 0 %then %do;   
   %let memcnt=%sysfunc(dnum(&did));    
    %do i=1 %to &memcnt;              
                       
      %let name=%qscan(%qsysfunc(dread(&did,&i)),-1,.);                    
                    
      %if %qupcase(%qsysfunc(dread(&did,&i))) ne %qupcase(&name) %then %do;
       %if %superq(ext) = %superq(name) %then %do;                         
          %let cnt=%eval(&cnt+1);       
          %put %qsysfunc(dread(&did,&i));  
          proc import datafile="&dir\%qsysfunc(dread(&did,&i))" out=dsn&cnt 
           dbms=csv replace;            
          run;          
       %end; 
      %end;  
    %end;
      %end;
  %else %put &dir cannot be open.;
  %let rc=%sysfunc(dclose(&did));      
             
 %mend drive;
 
%drive(/sas/source/tttt/_Files/Centre/ABBA/,xls) 
I would be more than happy to see any advice given. Feel free to comment if you have any questions or advices.
Thanks in advanced.
@imdickson wrote:
BMW M5 Turbo Blue 100 700 MW MVAR Power MW MVAR Power Day Date Time Imp Exp Imp Exp MVA Factor Imp Exp Imp Exp MVA Factor Wednesday 1 0:30 0 5.426 0 1.606 5.658684653 0.958880081 0 5.429 0 1.605 5.661277771 0.958970787 Wednesday 1 1:00 0 5.247 0 1.535 5.466921803 0.959772279 0 5.249 0 1.534 5.46856078 0.959850354 Wednesday 1 1:30 0 5.187 0 1.481 5.394286793 0.961572901 0 5.19 0 1.48 5.396897257 0.961663666 Wednesday 1 2:00 0 5.095 0 1.462 5.300610248 0.96121008 0 5.098 0 1.462 5.303493943 0.961253101 Wednesday 1 2:30 0 4.997 0 1.439 5.200070192 0.960948567 0 4.999 0 1.438 5.201715582 0.961029091 Wednesday 1 3:00 0 4.931 0 1.454 5.140902353 0.959170134 0 4.934 0 1.453 5.143497351 0.959269474 Wednesday 1 3:30 0 4.848 0 1.423 5.052527387 0.959519786 0 4.85 0 1.423 5.054446458 0.959551167Hi @ballardw. I agree that the data is kept in a very bad form and this shouldn't be the method of storing data in excel. Unfortunately, we had no choice but to proceed with the said requirements.
I've also prepared the raw data with Row and Column for your reference for ease of reading below.
A B C D E F G H I J K L M N O 1 BMW M5 Turbo Blue 2 100 700 3 MW MVAR Power MW MVAR Power 4 Day Date Time Imp Exp Imp Exp MVA Factor Imp Exp Imp Exp MVA Factor 5 Wednesday 1 0:30 0 5.426 0 1.606 5.658684653 0.958880081 0 5.429 0 1.605 5.661277771 0.958970787 6 Wednesday 1 1:00 0 5.247 0 1.535 5.466921803 0.959772279 0 5.249 0 1.534 5.46856078 0.959850354 7 Wednesday 1 1:30 0 5.187 0 1.481 5.394286793 0.961572901 0 5.19 0 1.48 5.396897257 0.961663666 8 Wednesday 1 2:00 0 5.095 0 1.462 5.300610248 0.96121008 0 5.098 0 1.462 5.303493943 0.961253101 9 Wednesday 1 2:30 0 4.997 0 1.439 5.200070192 0.960948567 0 4.999 0 1.438 5.201715582 0.961029091 10 Wednesday 1 3:00 0 4.931 0 1.454 5.140902353 0.959170134 0 4.934 0 1.453 5.143497351 0.959269474
I strongly suggest posting the data as CSV. What you have, at this point appears to be mostly TAB delimited but that can be hard to work with when filtered through copy and paste across multiple systems. Adding non-existent row/column headers really do not help much when an actual delimiter character, such as the , is visible and may introduce complications.
Here's a sort of working example using the | as a delimiter, comma should work similarly but I don't know if any of your values might have a comma imbedded in a value for the model type or variant text.
This sort of come by using a search and replace on the data you posted replacing the tabs with |.
data work.read;
   infile datalines dlm='|'   truncover n=4;
   informat Brand $15.  Model $15. Variant $15. Color $15. Type $15. Day $10. Date best4. Time time. MW_Imp MW_Exp MVAR_Imp MVAR_Exp MVA Factor  MW_Imp best12.;
   format time time.;
   retain  brand  model variant  color type;
   If _n_=1 then input brand model Variant color
      /  type
      /
      /
   ;
   else input Day  Date Time  MW_Imp MW_Exp MVAR_Imp MVAR_Exp MVA Factor  MW_Imp;
   if not missing (day);
datalines;
BMW|M5|Turbo||Blue
|||100||||||700|||||
|||             MW||           MVAR|||Power|             MW||           MVAR|||Power
Day|      Date|Time|Imp|Exp|Imp|Exp|MVA|        Factor|        Imp|Exp|Imp|Exp|MVA|        Factor
Wednesday|1|0:30|0|5.426|0|1.606|5.658684653|0.958880081|0|5.429|0|1.605|5.661277771|0.958970787
Wednesday|1|1:00|0|5.247|0|1.535|5.466921803|0.959772279|0|5.249|0|1.534|5.46856078|0.959850354
Wednesday|1|1:30|0|5.187|0|1.481|5.394286793|0.961572901|0|5.19|0|1.48|5.396897257|0.961663666
Wednesday|1|2:00|0|5.095|0|1.462|5.300610248|0.96121008|0|5.098|0|1.462|5.303493943|0.961253101
Wednesday|1|2:30|0|4.997|0|1.439|5.200070192|0.960948567|0|4.999|0|1.438|5.201715582|0.961029091
Wednesday|1|3:00|0|4.931|0|1.454|5.140902353|0.959170134|0|4.934|0|1.453|5.143497351|0.959269474
;
run;
If you have multiple brand/ model/ variant information on a single sheet though you're on your own.
I will also say that a value of "1" for a "date" is not going to be greatly helpful in the future. An actual calendar date with proper informat and format will likely be much more useful.
The n=4 on the infile line says to make up to 4 lines available for reading. It may not be needed reading an external file. The / in the input statement say "go to next line to read. The last two without variables are to advance past the two rows of embedded column headers.
The If not missing (day) prevents outputting the header row with only the brand and model information. The RETAIN statement keeps those brand related values along with the later read values.
Poor or complex data layouts are why you will find many comments here about "Excel is a poor tool to interchange data".
If the files have any semi consistent layout where the data from "row 2" is used one way and other data differently then you might consider saving those files to CSV and using data step code to read them. That way you don't get different lengths or types of variables because of the guesses that Proc Import has to make from context. Data step code will let you specify lines to read and different inputs for different lines of data in the body.
You might be better off posting an example of a CSV file, paste text into a code box opened on the forum using the {I} icon to preserve format as the message window can do odd things sometimes. The CSV should also resolve the formula cells to the result instead of the formula.
Another advantage with the CSV approach is that a data step can read multiple files at one time and include details such as the name of the source file if desired.
I am not going to attempt to translate Cell X terminology as without seeing it that is hard to keep straight.
BMW M5 Turbo Blue 100 700 MW MVAR Power MW MVAR Power Day Date Time Imp Exp Imp Exp MVA Factor Imp Exp Imp Exp MVA Factor Wednesday 1 0:30 0 5.426 0 1.606 5.658684653 0.958880081 0 5.429 0 1.605 5.661277771 0.958970787 Wednesday 1 1:00 0 5.247 0 1.535 5.466921803 0.959772279 0 5.249 0 1.534 5.46856078 0.959850354 Wednesday 1 1:30 0 5.187 0 1.481 5.394286793 0.961572901 0 5.19 0 1.48 5.396897257 0.961663666 Wednesday 1 2:00 0 5.095 0 1.462 5.300610248 0.96121008 0 5.098 0 1.462 5.303493943 0.961253101 Wednesday 1 2:30 0 4.997 0 1.439 5.200070192 0.960948567 0 4.999 0 1.438 5.201715582 0.961029091 Wednesday 1 3:00 0 4.931 0 1.454 5.140902353 0.959170134 0 4.934 0 1.453 5.143497351 0.959269474 Wednesday 1 3:30 0 4.848 0 1.423 5.052527387 0.959519786 0 4.85 0 1.423 5.054446458 0.959551167
Hi @ballardw. I agree that the data is kept in a very bad form and this shouldn't be the method of storing data in excel. Unfortunately, we had no choice but to proceed with the said requirements.
I've also prepared the raw data with Row and Column for your reference for ease of reading below.
A B C D E F G H I J K L M N O 1 BMW M5 Turbo Blue 2 100 700 3 MW MVAR Power MW MVAR Power 4 Day Date Time Imp Exp Imp Exp MVA Factor Imp Exp Imp Exp MVA Factor 5 Wednesday 1 0:30 0 5.426 0 1.606 5.658684653 0.958880081 0 5.429 0 1.605 5.661277771 0.958970787 6 Wednesday 1 1:00 0 5.247 0 1.535 5.466921803 0.959772279 0 5.249 0 1.534 5.46856078 0.959850354 7 Wednesday 1 1:30 0 5.187 0 1.481 5.394286793 0.961572901 0 5.19 0 1.48 5.396897257 0.961663666 8 Wednesday 1 2:00 0 5.095 0 1.462 5.300610248 0.96121008 0 5.098 0 1.462 5.303493943 0.961253101 9 Wednesday 1 2:30 0 4.997 0 1.439 5.200070192 0.960948567 0 4.999 0 1.438 5.201715582 0.961029091 10 Wednesday 1 3:00 0 4.931 0 1.454 5.140902353 0.959170134 0 4.934 0 1.453 5.143497351 0.959269474
As you can see, the first row of the data has value at A, B, C and E. At Row no.2, there is value at D and J. Do take note that it is possible for Q2 to have value of either 100 or 700 or 600 OR "SUM". If it is SUM, then have to ignore it/dont process it.
Here i will produce the expected output for better understanding.(Target Table)
A B C D E F G H I J K L M N O p 1 Brand Model Variant Color Type Day Date Time MW_Imp MW_Exp MVAR_ImpMVAR_ExpMVA Factor MW_Imp 2 BMW M5 Turbo Blue 100 Wednesday 1 0:30 0 5.426 0 1.606 5.658684653 0.958880081 0 3 BMW M5 Turbo Blue 100 Wednesday 1 1:00 0 5.247 0 1.535 5.466921803 0.959772279 0 4 BMW M5 Turbo Blue 100 Wednesday 1 1:30 0 5.187 0 1.481 5.394286793 0.961572901 0 5 BMW M5 Turbo Blue 100 Wednesday 1 2:00 0 5.095 0 1.462 5.300610248 0.96121008 0 6 BMW M5 Turbo Blue 100 Wednesday 1 2:30 0 4.997 0 1.439 5.200070192 0.960948567 0 7 BMW M5 Turbo Blue 100 Wednesday 1 3:00 0 4.931 0 1.454 5.140902353 0.959170134 0 8 BMW M5 Turbo Blue 700 Wednesday 1 3:30 0 4.85 0 1.423 5.054446458 0.959551167 0 9 BMW M5 Turbo Blue 700 Wednesday 1 4:00 0 4.765 0 1.414 4.970374332 0.958680309 0 10 BMW M5 Turbo Blue 700 Wednesday 1 4:30 0 4.743 0 1.426 4.952729046 0.957653842 0
Target table:
Notice on the first four column. The data in the first row will be transposed and populated in this form throughout the table.
Then we look at Column E where data in Row no.2 which can consists of 100 or 700 or 200 or 300 only in row no.2 will be populated into Cell E in the target table. In this case, Type 100 will be the first half of the data while Type 700 will be the 2nd half of the data in target table.
Also notice Column J K L M where data in row no.3(in source data) will be changed into this form(in target table) by adding MW_ or MVAR infront.
After all these, I have to also find way to output all these data into 1 final table instead of 1 excel = 1 dataset.
So far i have not done any transpose and manipulation of data when loading from excel file. Are these done in the same proc import statement or separate out?
If these are achievable, i would be more than happy to see the sample code if you are able to hint me some. 🙂
Once again i thank for your effort for trying to help me.
And to other forumer, feel free to leave a reply/advice.
Thanks a lot.
@imdickson wrote:
BMW M5 Turbo Blue 100 700 MW MVAR Power MW MVAR Power Day Date Time Imp Exp Imp Exp MVA Factor Imp Exp Imp Exp MVA Factor Wednesday 1 0:30 0 5.426 0 1.606 5.658684653 0.958880081 0 5.429 0 1.605 5.661277771 0.958970787 Wednesday 1 1:00 0 5.247 0 1.535 5.466921803 0.959772279 0 5.249 0 1.534 5.46856078 0.959850354 Wednesday 1 1:30 0 5.187 0 1.481 5.394286793 0.961572901 0 5.19 0 1.48 5.396897257 0.961663666 Wednesday 1 2:00 0 5.095 0 1.462 5.300610248 0.96121008 0 5.098 0 1.462 5.303493943 0.961253101 Wednesday 1 2:30 0 4.997 0 1.439 5.200070192 0.960948567 0 4.999 0 1.438 5.201715582 0.961029091 Wednesday 1 3:00 0 4.931 0 1.454 5.140902353 0.959170134 0 4.934 0 1.453 5.143497351 0.959269474 Wednesday 1 3:30 0 4.848 0 1.423 5.052527387 0.959519786 0 4.85 0 1.423 5.054446458 0.959551167Hi @ballardw. I agree that the data is kept in a very bad form and this shouldn't be the method of storing data in excel. Unfortunately, we had no choice but to proceed with the said requirements.
I've also prepared the raw data with Row and Column for your reference for ease of reading below.
A B C D E F G H I J K L M N O 1 BMW M5 Turbo Blue 2 100 700 3 MW MVAR Power MW MVAR Power 4 Day Date Time Imp Exp Imp Exp MVA Factor Imp Exp Imp Exp MVA Factor 5 Wednesday 1 0:30 0 5.426 0 1.606 5.658684653 0.958880081 0 5.429 0 1.605 5.661277771 0.958970787 6 Wednesday 1 1:00 0 5.247 0 1.535 5.466921803 0.959772279 0 5.249 0 1.534 5.46856078 0.959850354 7 Wednesday 1 1:30 0 5.187 0 1.481 5.394286793 0.961572901 0 5.19 0 1.48 5.396897257 0.961663666 8 Wednesday 1 2:00 0 5.095 0 1.462 5.300610248 0.96121008 0 5.098 0 1.462 5.303493943 0.961253101 9 Wednesday 1 2:30 0 4.997 0 1.439 5.200070192 0.960948567 0 4.999 0 1.438 5.201715582 0.961029091 10 Wednesday 1 3:00 0 4.931 0 1.454 5.140902353 0.959170134 0 4.934 0 1.453 5.143497351 0.959269474
I strongly suggest posting the data as CSV. What you have, at this point appears to be mostly TAB delimited but that can be hard to work with when filtered through copy and paste across multiple systems. Adding non-existent row/column headers really do not help much when an actual delimiter character, such as the , is visible and may introduce complications.
Here's a sort of working example using the | as a delimiter, comma should work similarly but I don't know if any of your values might have a comma imbedded in a value for the model type or variant text.
This sort of come by using a search and replace on the data you posted replacing the tabs with |.
data work.read;
   infile datalines dlm='|'   truncover n=4;
   informat Brand $15.  Model $15. Variant $15. Color $15. Type $15. Day $10. Date best4. Time time. MW_Imp MW_Exp MVAR_Imp MVAR_Exp MVA Factor  MW_Imp best12.;
   format time time.;
   retain  brand  model variant  color type;
   If _n_=1 then input brand model Variant color
      /  type
      /
      /
   ;
   else input Day  Date Time  MW_Imp MW_Exp MVAR_Imp MVAR_Exp MVA Factor  MW_Imp;
   if not missing (day);
datalines;
BMW|M5|Turbo||Blue
|||100||||||700|||||
|||             MW||           MVAR|||Power|             MW||           MVAR|||Power
Day|      Date|Time|Imp|Exp|Imp|Exp|MVA|        Factor|        Imp|Exp|Imp|Exp|MVA|        Factor
Wednesday|1|0:30|0|5.426|0|1.606|5.658684653|0.958880081|0|5.429|0|1.605|5.661277771|0.958970787
Wednesday|1|1:00|0|5.247|0|1.535|5.466921803|0.959772279|0|5.249|0|1.534|5.46856078|0.959850354
Wednesday|1|1:30|0|5.187|0|1.481|5.394286793|0.961572901|0|5.19|0|1.48|5.396897257|0.961663666
Wednesday|1|2:00|0|5.095|0|1.462|5.300610248|0.96121008|0|5.098|0|1.462|5.303493943|0.961253101
Wednesday|1|2:30|0|4.997|0|1.439|5.200070192|0.960948567|0|4.999|0|1.438|5.201715582|0.961029091
Wednesday|1|3:00|0|4.931|0|1.454|5.140902353|0.959170134|0|4.934|0|1.453|5.143497351|0.959269474
;
run;
If you have multiple brand/ model/ variant information on a single sheet though you're on your own.
I will also say that a value of "1" for a "date" is not going to be greatly helpful in the future. An actual calendar date with proper informat and format will likely be much more useful.
The n=4 on the infile line says to make up to 4 lines available for reading. It may not be needed reading an external file. The / in the input statement say "go to next line to read. The last two without variables are to advance past the two rows of embedded column headers.
The If not missing (day) prevents outputting the header row with only the brand and model information. The RETAIN statement keeps those brand related values along with the later read values.
Also, by using the code that i posted in the first post, the dataset is showing like this.
This is somehow due to unable to detech the variable name as column as the column name starts at row 3. I tried using range but i kept on getting strange error. Is there a way for me to specify the starting variable CELL(row and column combination) and starting data CELL(row and column combination but without specifying the final row as each excel have different rows of data)?
Furthermore, numeric data are formulated. Is there a way to load the calculated data instead of the formula?
@imdickson wrote:
Also, by using the code that i posted in the first post, the dataset is showing like this.
This is somehow due to unable to detech the variable name as column as the column name starts at row 3. I tried using range but i kept on getting strange error. Is there a way for me to specify the starting variable CELL(row and column combination) and starting data CELL(row and column combination but without specifying the final row as each excel have different rows of data)?
Furthermore, numeric data are formulated. Is there a way to load the calculated data instead of the formula?
Last first: Save as CSV (or other delimited text). The export nature of that from Excel will resolve the formulae.
Proc Import always expects a single column in the data to be the same variable with the same type of data, numeric or text. Always. The procedure is designed to import nice tabular data. If your data is not such then proc import, as you have discovered, is not going to come close to what you want. Proc import can deal with specified ranges (I don't use this feature for a number of reasons). In this case you would require 2 proc import steps to deal with the required two ranges of values and then combine the data afterward.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
