Solved: Re: Further process after PROC IMPORT

imdickson · Posted 09-21-2018 10:28 AM

Hi,

My program is currently reading from a directory in Linux where it has multiple excel spreadsheets in it and import into SAS Dataset. 1 spreadsheet will be loaded into 1 dataset. So if there is 50 files in the directory, it will produce 50 datasets. However, after discussing with my team, there are some changes.

Changes 1:

Data starts Cell C:5th while column name starts at Cell C:4th row.

Difficulty: I tried adding "Range" statement but it prompts error.

If we look at Cell C4 D4 E4 and C5-7 D5-7 E5-7:

   C          D          E
4 100       Yes      99
5 700       No        18
6 600       Yes       41
7 400       Yes       888

Changes 2:

On first row of excel, A1 to C1 and E1 data are important as we will need to transpose the data and populate it to the target table.

For example:

In Excel-

A B C D E
1 BMW M5 Turbo Blue

So in every excel spreadsheet, I have to come out with a code to scan A1, B1, C1 and E1 first. Then, populate the value to the target final table as below:

Make          Model          Variant          Color          A          B          C
BMW          M5               Turbo            Blue           100      Yes       99
BMW          M5               Turbo            Blue           700      No       18
BMW          M5               Turbo            Blue           600      Yes       41
BMW          M5               Turbo            Blue           400      Yes       888

However, my team and I are stuck at this transpose portion.

Changes 3:

This is 1 of our hardest challenges. In the 2nd row of every excel, the first value appear in D2 cell. The value could be 100 or 700 or 600 and etc. Then, Cell J2 could have value of 100 or 700 or 600 and etc as well. Occasionally, Cell P2 will have value of 100 or 700 or 600 and etc also.

Meaning to say that it is possible that the Row no.2 will have a value every 6 cells apart. May or may not have more than 2 cells of them. So the code gotta be dynamic enough to detect if every 6th Cell Row 2 contains 100 or 700 or 600 and etc.

Changes 4:

There are some formula in the excel. When we first load, it is getting the formula instead of the end value. Is there any bypass option for that?

Changes 5:

Instead of loading all excel into seperate datasets, we want to actually append them into 1 final table. However, we think it would be easier to achieve this once we settle the first 4 item changes above.

Here are the codes that I have currently(loading from multiple excel(xls) into multiple sas datasets):

%macro drive(dir,ext); 
   %local cnt filrf rc did memcnt name; 
   %let cnt=0;          

   %let filrf=mydir;    
   %let rc=%sysfunc(filename(filrf,&dir)); 
   %let did=%sysfunc(dopen(&filrf));
    %if &did ne 0 %then %do;   
   %let memcnt=%sysfunc(dnum(&did));    

    %do i=1 %to &memcnt;              
                       
      %let name=%qscan(%qsysfunc(dread(&did,&i)),-1,.);                    
                    
      %if %qupcase(%qsysfunc(dread(&did,&i))) ne %qupcase(&name) %then %do;
       %if %superq(ext) = %superq(name) %then %do;                         
          %let cnt=%eval(&cnt+1);       
          %put %qsysfunc(dread(&did,&i));  
          proc import datafile="&dir\%qsysfunc(dread(&did,&i))" out=dsn&cnt 
           dbms=csv replace;            
          run;          
       %end; 
      %end;  

    %end;
      %end;
  %else %put &dir cannot be open.;
  %let rc=%sysfunc(dclose(&did));      
             
 %mend drive;
 
%drive(/sas/source/tttt/_Files/Centre/ABBA/,xls)

I would be more than happy to see any advice given. Feel free to comment if you have any questions or advices.

Thanks in advanced.

ballardw · Posted 09-24-2018 04:43 PM

@imdickson wrote:

BMW	M5	Turbo		Blue										
			100						700					
			             MW		           MVAR			Power	             MW		           MVAR			Power
Day	      Date	Time	Imp	Exp	Imp	Exp	MVA	        Factor	       Imp	Exp	Imp	Exp	MVA	        Factor
Wednesday	1	0:30	0	5.426	0	1.606	5.658684653	0.958880081	0	5.429	0	1.605	5.661277771	0.958970787
Wednesday	1	1:00	0	5.247	0	1.535	5.466921803	0.959772279	0	5.249	0	1.534	5.46856078	0.959850354
Wednesday	1	1:30	0	5.187	0	1.481	5.394286793	0.961572901	0	5.19	0	1.48	5.396897257	0.961663666
Wednesday	1	2:00	0	5.095	0	1.462	5.300610248	0.96121008	0	5.098	0	1.462	5.303493943	0.961253101
Wednesday	1	2:30	0	4.997	0	1.439	5.200070192	0.960948567	0	4.999	0	1.438	5.201715582	0.961029091
Wednesday	1	3:00	0	4.931	0	1.454	5.140902353	0.959170134	0	4.934	0	1.453	5.143497351	0.959269474
Wednesday	1	3:30	0	4.848	0	1.423	5.052527387	0.959519786	0	4.85	0	1.423	5.054446458	0.959551167

Hi @ballardw. I agree that the data is kept in a very bad form and this shouldn't be the method of storing data in excel. Unfortunately, we had no choice but to proceed with the said requirements.

I've also prepared the raw data with Row and Column for your reference for ease of reading below.

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
1	BMW	M5	Turbo		Blue										
2				100						700					
3				             MW		           MVAR			Power	             MW		           MVAR			Power
4	Day	        Date	Time	Imp	Exp	Imp	Exp	MVA	        Factor	        Imp	Exp	Imp	Exp	MVA	        Factor
5	Wednesday	1	0:30	0	5.426	0	1.606	5.658684653	0.958880081	0	5.429	0	1.605	5.661277771	0.958970787
6	Wednesday	1	1:00	0	5.247	0	1.535	5.466921803	0.959772279	0	5.249	0	1.534	5.46856078	0.959850354
7	Wednesday	1	1:30	0	5.187	0	1.481	5.394286793	0.961572901	0	5.19	0	1.48	5.396897257	0.961663666
8	Wednesday	1	2:00	0	5.095	0	1.462	5.300610248	0.96121008	0	5.098	0	1.462	5.303493943	0.961253101
9	Wednesday	1	2:30	0	4.997	0	1.439	5.200070192	0.960948567	0	4.999	0	1.438	5.201715582	0.961029091
10	Wednesday	1	3:00	0	4.931	0	1.454	5.140902353	0.959170134	0	4.934	0	1.453	5.143497351	0.959269474

I strongly suggest posting the data as CSV. What you have, at this point appears to be mostly TAB delimited but that can be hard to work with when filtered through copy and paste across multiple systems. Adding non-existent row/column headers really do not help much when an actual delimiter character, such as the , is visible and may introduce complications.

Here's a sort of working example using the | as a delimiter, comma should work similarly but I don't know if any of your values might have a comma imbedded in a value for the model type or variant text.

This sort of come by using a search and replace on the data you posted replacing the tabs with |.

data work.read;
   infile datalines dlm='|'   truncover n=4;
   informat Brand $15.  Model $15. Variant $15. Color $15. Type $15. Day $10. Date best4. Time time. MW_Imp MW_Exp MVAR_Imp MVAR_Exp MVA Factor  MW_Imp best12.;
   format time time.;
   retain  brand  model variant  color type;
   If _n_=1 then input brand model Variant color
      /  type
      /
      /
   ;
   else input Day  Date Time  MW_Imp MW_Exp MVAR_Imp MVAR_Exp MVA Factor  MW_Imp;
   if not missing (day);
datalines;
BMW|M5|Turbo||Blue
|||100||||||700|||||
|||             MW||           MVAR|||Power|             MW||           MVAR|||Power
Day|      Date|Time|Imp|Exp|Imp|Exp|MVA|        Factor|        Imp|Exp|Imp|Exp|MVA|        Factor
Wednesday|1|0:30|0|5.426|0|1.606|5.658684653|0.958880081|0|5.429|0|1.605|5.661277771|0.958970787
Wednesday|1|1:00|0|5.247|0|1.535|5.466921803|0.959772279|0|5.249|0|1.534|5.46856078|0.959850354
Wednesday|1|1:30|0|5.187|0|1.481|5.394286793|0.961572901|0|5.19|0|1.48|5.396897257|0.961663666
Wednesday|1|2:00|0|5.095|0|1.462|5.300610248|0.96121008|0|5.098|0|1.462|5.303493943|0.961253101
Wednesday|1|2:30|0|4.997|0|1.439|5.200070192|0.960948567|0|4.999|0|1.438|5.201715582|0.961029091
Wednesday|1|3:00|0|4.931|0|1.454|5.140902353|0.959170134|0|4.934|0|1.453|5.143497351|0.959269474
;
run;

If you have multiple brand/ model/ variant information on a single sheet though you're on your own.

I will also say that a value of "1" for a "date" is not going to be greatly helpful in the future. An actual calendar date with proper informat and format will likely be much more useful.

The n=4 on the infile line says to make up to 4 lines available for reading. It may not be needed reading an external file. The / in the input statement say "go to next line to read. The last two without variables are to advance past the two rows of embedded column headers.

The If not missing (day) prevents outputting the header row with only the brand and model information. The RETAIN statement keeps those brand related values along with the later read values.

View solution in original post

ballardw · Posted 09-21-2018 10:41 AM

Poor or complex data layouts are why you will find many comments here about "Excel is a poor tool to interchange data".

If the files have any semi consistent layout where the data from "row 2" is used one way and other data differently then you might consider saving those files to CSV and using data step code to read them. That way you don't get different lengths or types of variables because of the guesses that Proc Import has to make from context. Data step code will let you specify lines to read and different inputs for different lines of data in the body.

You might be better off posting an example of a CSV file, paste text into a code box opened on the forum using the {I} icon to preserve format as the message window can do odd things sometimes. The CSV should also resolve the formula cells to the result instead of the formula.

Another advantage with the CSV approach is that a data step can read multiple files at one time and include details such as the name of the source file if desired.

I am not going to attempt to translate Cell X terminology as without seeing it that is hard to keep straight.

imdickson · Posted 09-22-2018 02:52 AM

BMW	M5	Turbo		Blue										
			100						700					
			             MW		           MVAR			Power	             MW		           MVAR			Power
Day	      Date	Time	Imp	Exp	Imp	Exp	MVA	        Factor	       Imp	Exp	Imp	Exp	MVA	        Factor
Wednesday	1	0:30	0	5.426	0	1.606	5.658684653	0.958880081	0	5.429	0	1.605	5.661277771	0.958970787
Wednesday	1	1:00	0	5.247	0	1.535	5.466921803	0.959772279	0	5.249	0	1.534	5.46856078	0.959850354
Wednesday	1	1:30	0	5.187	0	1.481	5.394286793	0.961572901	0	5.19	0	1.48	5.396897257	0.961663666
Wednesday	1	2:00	0	5.095	0	1.462	5.300610248	0.96121008	0	5.098	0	1.462	5.303493943	0.961253101
Wednesday	1	2:30	0	4.997	0	1.439	5.200070192	0.960948567	0	4.999	0	1.438	5.201715582	0.961029091
Wednesday	1	3:00	0	4.931	0	1.454	5.140902353	0.959170134	0	4.934	0	1.453	5.143497351	0.959269474
Wednesday	1	3:30	0	4.848	0	1.423	5.052527387	0.959519786	0	4.85	0	1.423	5.054446458	0.959551167

Hi @ballardw. I agree that the data is kept in a very bad form and this shouldn't be the method of storing data in excel. Unfortunately, we had no choice but to proceed with the said requirements.

I've also prepared the raw data with Row and Column for your reference for ease of reading below.

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
1	BMW	M5	Turbo		Blue										
2				100						700					
3				             MW		           MVAR			Power	             MW		           MVAR			Power
4	Day	        Date	Time	Imp	Exp	Imp	Exp	MVA	        Factor	        Imp	Exp	Imp	Exp	MVA	        Factor
5	Wednesday	1	0:30	0	5.426	0	1.606	5.658684653	0.958880081	0	5.429	0	1.605	5.661277771	0.958970787
6	Wednesday	1	1:00	0	5.247	0	1.535	5.466921803	0.959772279	0	5.249	0	1.534	5.46856078	0.959850354
7	Wednesday	1	1:30	0	5.187	0	1.481	5.394286793	0.961572901	0	5.19	0	1.48	5.396897257	0.961663666
8	Wednesday	1	2:00	0	5.095	0	1.462	5.300610248	0.96121008	0	5.098	0	1.462	5.303493943	0.961253101
9	Wednesday	1	2:30	0	4.997	0	1.439	5.200070192	0.960948567	0	4.999	0	1.438	5.201715582	0.961029091
10	Wednesday	1	3:00	0	4.931	0	1.454	5.140902353	0.959170134	0	4.934	0	1.453	5.143497351	0.959269474

As you can see, the first row of the data has value at A, B, C and E. At Row no.2, there is value at D and J. Do take note that it is possible for Q2 to have value of either 100 or 700 or 600 OR "SUM". If it is SUM, then have to ignore it/dont process it.

Here i will produce the expected output for better understanding.(Target Table)

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	         O              p
1	Brand	Model	Variant	Color	Type	Day	        Date	Time	MW_Imp	MW_Exp	MVAR_ImpMVAR_ExpMVA	        Factor	        MW_Imp
2	BMW	M5	Turbo	Blue	100	Wednesday	1	0:30	0	5.426	0	1.606	5.658684653	0.958880081	0
3	BMW	M5	Turbo	Blue	100	Wednesday	1	1:00	0	5.247	0	1.535	5.466921803	0.959772279	0
4	BMW	M5	Turbo	Blue	100	Wednesday	1	1:30	0	5.187	0	1.481	5.394286793	0.961572901	0
5	BMW	M5	Turbo	Blue	100	Wednesday	1	2:00	0	5.095	0	1.462	5.300610248	0.96121008	0
6	BMW	M5	Turbo	Blue	100	Wednesday	1	2:30	0	4.997	0	1.439	5.200070192	0.960948567	0
7	BMW	M5	Turbo	Blue	100	Wednesday	1	3:00	0	4.931	0	1.454	5.140902353	0.959170134	0
8	BMW	M5	Turbo	Blue	700	Wednesday	1	3:30	0	4.85	0	1.423	5.054446458	0.959551167	0
9	BMW	M5	Turbo	Blue	700	Wednesday	1	4:00	0	4.765	0	1.414	4.970374332	0.958680309	0
10	BMW	M5	Turbo	Blue	700	Wednesday	1	4:30	0	4.743	0	1.426	4.952729046	0.957653842	0

Target table:

Notice on the first four column. The data in the first row will be transposed and populated in this form throughout the table.

Then we look at Column E where data in Row no.2 which can consists of 100 or 700 or 200 or 300 only in row no.2 will be populated into Cell E in the target table. In this case, Type 100 will be the first half of the data while Type 700 will be the 2nd half of the data in target table.

Also notice Column J K L M where data in row no.3(in source data) will be changed into this form(in target table) by adding MW_ or MVAR infront.

After all these, I have to also find way to output all these data into 1 final table instead of 1 excel = 1 dataset.

So far i have not done any transpose and manipulation of data when loading from excel file. Are these done in the same proc import statement or separate out?

If these are achievable, i would be more than happy to see the sample code if you are able to hint me some. 🙂

Once again i thank for your effort for trying to help me.

And to other forumer, feel free to leave a reply/advice.

Thanks a lot.

ballardw · Posted 09-24-2018 04:43 PM

@imdickson wrote:

BMW	M5	Turbo		Blue										
			100						700					
			             MW		           MVAR			Power	             MW		           MVAR			Power
Day	      Date	Time	Imp	Exp	Imp	Exp	MVA	        Factor	       Imp	Exp	Imp	Exp	MVA	        Factor
Wednesday	1	0:30	0	5.426	0	1.606	5.658684653	0.958880081	0	5.429	0	1.605	5.661277771	0.958970787
Wednesday	1	1:00	0	5.247	0	1.535	5.466921803	0.959772279	0	5.249	0	1.534	5.46856078	0.959850354
Wednesday	1	1:30	0	5.187	0	1.481	5.394286793	0.961572901	0	5.19	0	1.48	5.396897257	0.961663666
Wednesday	1	2:00	0	5.095	0	1.462	5.300610248	0.96121008	0	5.098	0	1.462	5.303493943	0.961253101
Wednesday	1	2:30	0	4.997	0	1.439	5.200070192	0.960948567	0	4.999	0	1.438	5.201715582	0.961029091
Wednesday	1	3:00	0	4.931	0	1.454	5.140902353	0.959170134	0	4.934	0	1.453	5.143497351	0.959269474
Wednesday	1	3:30	0	4.848	0	1.423	5.052527387	0.959519786	0	4.85	0	1.423	5.054446458	0.959551167

Hi @ballardw. I agree that the data is kept in a very bad form and this shouldn't be the method of storing data in excel. Unfortunately, we had no choice but to proceed with the said requirements.

I've also prepared the raw data with Row and Column for your reference for ease of reading below.

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
1	BMW	M5	Turbo		Blue										
2				100						700					
3				             MW		           MVAR			Power	             MW		           MVAR			Power
4	Day	        Date	Time	Imp	Exp	Imp	Exp	MVA	        Factor	        Imp	Exp	Imp	Exp	MVA	        Factor
5	Wednesday	1	0:30	0	5.426	0	1.606	5.658684653	0.958880081	0	5.429	0	1.605	5.661277771	0.958970787
6	Wednesday	1	1:00	0	5.247	0	1.535	5.466921803	0.959772279	0	5.249	0	1.534	5.46856078	0.959850354
7	Wednesday	1	1:30	0	5.187	0	1.481	5.394286793	0.961572901	0	5.19	0	1.48	5.396897257	0.961663666
8	Wednesday	1	2:00	0	5.095	0	1.462	5.300610248	0.96121008	0	5.098	0	1.462	5.303493943	0.961253101
9	Wednesday	1	2:30	0	4.997	0	1.439	5.200070192	0.960948567	0	4.999	0	1.438	5.201715582	0.961029091
10	Wednesday	1	3:00	0	4.931	0	1.454	5.140902353	0.959170134	0	4.934	0	1.453	5.143497351	0.959269474

I strongly suggest posting the data as CSV. What you have, at this point appears to be mostly TAB delimited but that can be hard to work with when filtered through copy and paste across multiple systems. Adding non-existent row/column headers really do not help much when an actual delimiter character, such as the , is visible and may introduce complications.

Here's a sort of working example using the | as a delimiter, comma should work similarly but I don't know if any of your values might have a comma imbedded in a value for the model type or variant text.

This sort of come by using a search and replace on the data you posted replacing the tabs with |.

data work.read;
   infile datalines dlm='|'   truncover n=4;
   informat Brand $15.  Model $15. Variant $15. Color $15. Type $15. Day $10. Date best4. Time time. MW_Imp MW_Exp MVAR_Imp MVAR_Exp MVA Factor  MW_Imp best12.;
   format time time.;
   retain  brand  model variant  color type;
   If _n_=1 then input brand model Variant color
      /  type
      /
      /
   ;
   else input Day  Date Time  MW_Imp MW_Exp MVAR_Imp MVAR_Exp MVA Factor  MW_Imp;
   if not missing (day);
datalines;
BMW|M5|Turbo||Blue
|||100||||||700|||||
|||             MW||           MVAR|||Power|             MW||           MVAR|||Power
Day|      Date|Time|Imp|Exp|Imp|Exp|MVA|        Factor|        Imp|Exp|Imp|Exp|MVA|        Factor
Wednesday|1|0:30|0|5.426|0|1.606|5.658684653|0.958880081|0|5.429|0|1.605|5.661277771|0.958970787
Wednesday|1|1:00|0|5.247|0|1.535|5.466921803|0.959772279|0|5.249|0|1.534|5.46856078|0.959850354
Wednesday|1|1:30|0|5.187|0|1.481|5.394286793|0.961572901|0|5.19|0|1.48|5.396897257|0.961663666
Wednesday|1|2:00|0|5.095|0|1.462|5.300610248|0.96121008|0|5.098|0|1.462|5.303493943|0.961253101
Wednesday|1|2:30|0|4.997|0|1.439|5.200070192|0.960948567|0|4.999|0|1.438|5.201715582|0.961029091
Wednesday|1|3:00|0|4.931|0|1.454|5.140902353|0.959170134|0|4.934|0|1.453|5.143497351|0.959269474
;
run;

If you have multiple brand/ model/ variant information on a single sheet though you're on your own.

I will also say that a value of "1" for a "date" is not going to be greatly helpful in the future. An actual calendar date with proper informat and format will likely be much more useful.

The n=4 on the infile line says to make up to 4 lines available for reading. It may not be needed reading an external file. The / in the input statement say "go to next line to read. The last two without variables are to advance past the two rows of embedded column headers.

The If not missing (day) prevents outputting the header row with only the brand and model information. The RETAIN statement keeps those brand related values along with the later read values.

imdickson · Posted 09-22-2018 04:45 AM

Also, by using the code that i posted in the first post, the dataset is showing like this.

This is somehow due to unable to detech the variable name as column as the column name starts at row 3. I tried using range but i kept on getting strange error. Is there a way for me to specify the starting variable CELL(row and column combination) and starting data CELL(row and column combination but without specifying the final row as each excel have different rows of data)?

Furthermore, numeric data are formulated. Is there a way to load the calculated data instead of the formula?

ballardw · Posted 10-02-2018 11:08 AM

@imdickson wrote:

Also, by using the code that i posted in the first post, the dataset is showing like this.

This is somehow due to unable to detech the variable name as column as the column name starts at row 3. I tried using range but i kept on getting strange error. Is there a way for me to specify the starting variable CELL(row and column combination) and starting data CELL(row and column combination but without specifying the final row as each excel have different rows of data)?

Furthermore, numeric data are formulated. Is there a way to load the calculated data instead of the formula?

Last first: Save as CSV (or other delimited text). The export nature of that from Excel will resolve the formulae.

Proc Import always expects a single column in the data to be the same variable with the same type of data, numeric or text. Always. The procedure is designed to import nice tabular data. If your data is not such then proc import, as you have discovered, is not going to come close to what you want. Proc import can deal with specified ranges (I don't use this feature for a number of reasons). In this case you would require 2 proc import steps to deal with the required two ranges of values and then combine the data afterward.

Further process after PROC IMPORT

Re: Further process after PROC IMPORT

Re: Further process after PROC IMPORT

Re: Further process after PROC IMPORT

Re: Further process after PROC IMPORT

Re: Further process after PROC IMPORT

Re: Further process after PROC IMPORT

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away