02-24-2016 01:40 AM
I have a .csv file with 1001 rows of entries. I saved in .xlsx file to be imported into SAS with PROC IMPORT. Due to the original data structure, I only want to select a specific range of rows in Excel to create several subset data. However, when I run the following SAS code in SAS University Edition, SAS kept reading the whole records from Excel into SAS instead of the specified range of records. I also altered the RANGE statement, the result was the same.
May I ask how to resolve this problem to have SAS only read the range I specified? e.g. data with range = "A2:N39" from excel. Many thanks!
PROC IMPORT OUT = TEST
DATAFILE = "/folders/myfolders/Data/Source/health89.xlsx" DBMS = XLSX REPLACE;
SHEET = 'Data';
RANGE = "A2:N39"; (or RANGE = "Data$A2:N39"
GETNAMES = YES;
DBSASLABEL = NONE;
NOTE: The import data set has 1000 observations and 5 variables.
02-24-2016 01:45 AM
When you already have a .csv, then why do you involve the execrable Excel format at all?
Import the csv and make your selection in SAS, KEEPing the desired columns and selecting the desired observations with the FIRSTOBS and OBS dataset options.
02-24-2016 02:33 AM
Not quite sure I understand what you suggested. Could you help me out or share me any references?
The variable name is at row1 with 5 columns, range of data I want to read is from row2 - row39. Another variable name is at row40, range of data is from row41 - row60.
How could I read these two chunks of data from a .csv via PROC IMPORT accordingly? Many thanks!
02-24-2016 02:45 AM
Does that mean you have different structures in one csv file, in consecutive blocks?
If so, how precise is the information about the location of header lines and columns within a row?
Do you need to determine the location of header lines dynamically?
Could you attach a sample csv file?
02-24-2016 04:15 AM
I am afraid you don't know your own question. Let me explain:
CSV = Comma Separated Value file. It is a text file with data elements separated by a comma. It is NOT and Excel file. Excel can however read and parse this text file and render it in a worksheet using its own CSV parser.
XLSX, XLS are Office proprietary formats, these are native Excel files.
So which of the two do you have, is it an Excel file or a CSV?
If indeed it is a CSV, and then it must follow the general rules of a CSV = one header row separated by commas, then one row per data row, separated by commas - then your request to take row 2-39 and 41-60 doesn't make any sense, i.e. why skip row 40?
It sounds to me like you have some kind of Excel dump with subheadings and such like, but without seeing the file I can't tell.
02-25-2016 08:10 PM
Hi! Sorry to response in delay. I think I did not express my problem well...
I took a detail look into source data again. Here is what I got. It was originally provided in .csv file extension, but can be opened in Excel workbook. So, at first place, I used as it was and import in Excel version. After reading your replies, I opened the same file in Notepad, I could see it's comma-delimited csv file. However, it's not a typical csv file. It is because there are several lines with similar header variables inserted at different line position. This finding in data structure can answer your concern below.
"If indeed it is a CSV, and then it must follow the general rules of a CSV = one header row separated by commas, then one row per data row, separated by commas - then your request to take row 2-39 and 41-60 doesn't make any sense, i.e. why skip row 40?"
At row 40, it's a line of header variable inserted. I think this dataset is a master datafile by stacking up different data update at different time schedule - simly like a running and growing records and coming into with different batchs in different time.
Since converting Excel workbook then importing can give me a simple way to read in the data with less complexity, so I resolve this in this way. But, the other problems come from it's unbalanced data, which has inconsistent numbers of records in each data update...Am thinking how to output records of each update into different datasets...
Thank you for your insight!
02-26-2016 01:42 AM
Having multiple header lines takes PROC IMPORT out of the solution.
You will have to do it on your own
1) scan through the csv and find the lines that contain headers. There should be some pattern to recognize
2) from that, store the locations of header lines and the content in macro variables; also keep the number of header lines
3) then, in a macro, create the datasteps for every section from the macro variables gained in step 2
OTOH, I would get back to the original source of the data and have them deliver the tables in separate files.
02-26-2016 04:29 AM
There appears to several things wrong here, hard to say without looking at the file. Firstly whoever sent you file has sent an incorrect file. As I pointed out CSV is pretty well defined, they haven't followed the rules, and so the file is invalid, and I would return it to them.
Secondly you don't appear to have any documentation, transfer specification or import agreement. Without these your process will be harder to maintain - i.e. data can change next time meaning total re-work - and you have less idea of what the data is. A bad process will always work badly.
Thirdly, this "unbalanced data, which has inconsistent numbers of records in each data update.". It appears, following good Excel usage guides, the data you are using is not even consistent within itself.
Now if I was given this file, and assuming that after many days of complaining, I was told to use this then I would do a pre-scan of the file, i.e. one datastep, read each line as a whole, and have some logic to work out where headers occur. At each of these headers I would write the subsequent lines out to a new file each time. So basically for each of those blocks have a sperate file which does confirm to CSV rules, i.e. one header row, then data separated by commas, Now for each of the those files, write a datastep import, but as they have no standard, and you have no documentation, you would need to look at each file and identify what the data is, how it should be read in etc. Sure you could rely on proc import, and set ranges, but how do you know it has "guessed" the data correctly, how do you know its taken the correct block etc. All in all a lot of problems created by someone not doing things earlier down the line and producing a datafile following standards.