Solved: Splitting row data in separate colums

AyoSho · Posted 01-31-2020 10:50 PM

Hi everyone,
I am in need of some help. I am very new here and would appreciate all assistance.
I have a dataset with 9000 rows but only the first column contains the data for each row. The data in each row has 27 variables which are separated by semicolon (some have 3 semicolons back to back). I am interested in splitting the data into different columns for each row so I can have each variable in a column. Could anyone please suggest a SAS code that I can adapt to the data in order to separate out the row data into separate columns? Thanks everyone for your help.
Best,
Ay.

Tom · Posted 02-02-2020 10:57 PM

@AyoSho wrote:
Hi Tom,
Unfortunately, I don't have the original text file.... I only have access
to the excel file. Does that mean the excel file would be unusable? Is
there a way to manipulate the excel file at least and make it usable?.
Thanks!

You will need to be careful using the file as you need watch out for the type of changes that excel is likely to make when reading a text file. Your method of just asking EXCEL to write out a text file and then reading it in again could work. You need to try to figure our what delimiter you want to insert between the cells on a row in the spreadsheet. Or you could do the same thing in SAS.

So I would do it in three steps. (1) Convert the XLSX file into a sas dataset. (2) Write it back out using comma (or tab or space) as the delimiter. (3) read the new text file using semi-colon as the delimiter.

libname old xlsx 'the_exisitng_excel_file.xlsx';
data work.step1;
  set old.sheetname;
run;

filename text 'new_name_for_a_text_file.txt';
data _null_;
  file text dsd dlm=',' ;
  set step1;
  put (_all_) (+0);
run;

data want;
  infile text dsd dlm=';' truncover ;
  input (var1-var20) (:$30.);
run;

View solution in original post

Tom · Posted 02-01-2020 12:36 AM

Do you have dataset? Or just a text file that you are trying to read?

If it is a text file then use semi-colon as the delimiter when reading the file.

data want;
  infile 'myfile.txt' dsd dlm=';' truncover ;
  input var1 var2 var3 ;
run;

If it is already a dataset then what do the other variables have in them if all of the data is in just one variable? Or do you mean you only have one variable and you want to parse it into more variables? Anyway use the SCAN() function. Let's assume your dataset is named HAVE the variable is named ONLYVAR. So something like this. Note you should really add a LENGTH statement to define the lengths of the new character variables.

data want;
  set have;
  var1 = scan(onlyvar,1,';');
  var2 = scan(onlyvar,2,';');
run;

If you can to interpret some of the text in ONLYVAR as numbers then use the INPUT() function.

data want;
  set have;
  var1 = scan(onlyvar,1,';');
  var2 = input(scan(onlyvar,2,';'),32.);
run;

AyoSho · Posted 02-02-2020 09:50 AM

Hi Tom,
I do have a dataset in excel, and it looks somewhat as shown below in excel:

A B
C D E
1 name;age;state;country;labs
2 duf;45;lopos;;7
3 steve;;bol;90 rolo;76
4 kevin;30;nu;zip;32
5 bela;9;hun; zip;87;Neo;lot;9

There are about 9000 rows with most data located under column A.
Occasionally, data is located in column B as well but nothing recorded in
columns C, D, E. The main challenge I have is how to separate the data in
column A and B (the A1 variables need to be split into separate columns and
made the header of the table before splitting the variables in A2, A3, B3,
and moving downwards till the 9000th row). The dataset is quite messed up,
but it has very important data for my project. Hopefully, I will be able to
work this out. Looking forward to a solution. Thanks so much.

AyoSho · Posted 02-02-2020 09:55 AM

Hi Tom,

I do have a dataset in excel, and it looks somewhat as shown below in excel:

A B C D E 1 name;age;state;country;labs
2 duf;45;lopos;;7
3 steve;;bol;90 rolo;76
4 kevin;30;nu;zip;32
5 bela;9;hun; zip;87;Neo;lot;9

There are about 9000 rows with most data located under column A. Occasionally, data is located in column B as well but nothing recorded in columns C, D, E. The main challenge I have is how to separate the data in column A and B (the A1 variables need to be split into separate columns and made the header of the table before splitting the variables in A2, A3, B3, and moving downwards till the 9000th row). The dataset is quite messed up, but it has very important data for my project. Hopefully, I will be able to work this out. Looking forward to a solution. Thanks so much.

Tom · Posted 02-02-2020 12:37 PM

Looks like somewhere along the way someone took a text file and opened it in EXCEL. So your third line was probably originally something like this in the text file:

3,steve;;bol;90,rolo;76

So EXCEL decided the comma was the separator and converted that line to 3 cells in the spreadsheet.

By saving the Excel file back into a text file you have made something similar to the original file. But note that opening a text file in EXCEL can cause it to change some of the values. For example strings that look like numbers are converted to numbers. So leading zeros can be removed. Strings like 10-12 might be interpreted as a date.

Do you have the original text file? It would be better to read that instead.

AyoSho · Posted 02-02-2020 12:51 PM

Hi Tom,
Unfortunately, I don't have the original text file.... I only have access
to the excel file. Does that mean the excel file would be unusable? Is
there a way to manipulate the excel file at least and make it usable?.
Thanks!

Tom · Posted 02-02-2020 10:57 PM

@AyoSho wrote:
Hi Tom,
Unfortunately, I don't have the original text file.... I only have access
to the excel file. Does that mean the excel file would be unusable? Is
there a way to manipulate the excel file at least and make it usable?.
Thanks!

You will need to be careful using the file as you need watch out for the type of changes that excel is likely to make when reading a text file. Your method of just asking EXCEL to write out a text file and then reading it in again could work. You need to try to figure our what delimiter you want to insert between the cells on a row in the spreadsheet. Or you could do the same thing in SAS.

So I would do it in three steps. (1) Convert the XLSX file into a sas dataset. (2) Write it back out using comma (or tab or space) as the delimiter. (3) read the new text file using semi-colon as the delimiter.

libname old xlsx 'the_exisitng_excel_file.xlsx';
data work.step1;
  set old.sheetname;
run;

filename text 'new_name_for_a_text_file.txt';
data _null_;
  file text dsd dlm=',' ;
  set step1;
  put (_all_) (+0);
run;

data want;
  infile text dsd dlm=';' truncover ;
  input (var1-var20) (:$30.);
run;

AyoSho · Posted 02-03-2020 04:10 PM

Hi Tom,

The syntax you provided worked wonderfully!.

I now have the data in SAS well delineated.

Thank you for your help!

I appreciate it.

Best wishes.

AyoSho · Posted 02-07-2020 05:26 PM

Hi Tom,

Thank for your help with the syntax!. I was able to import the EXCEL file into SAS and split it into columns.

However, the very first row which describes the data and which should be split into column headings was not included in the SAS dataset. I have attached what the import looks like in SAS explorer.

How can I make SAS keep the first row of the data within the excel file and split it just as it has done for the other rows...I believe it should be something that needs to be tweaked in the code you provided? I don't want SAS to substitute the first row with var1, var2, etc., but rather would want SAS to keep the first row in the excel sheet and split it as it has done for the other rows.

Thanks for your help once again.

Tom · Posted 02-07-2020 05:50 PM

Just fix the final data step to use the right variable names. You din't tell us what variables your file had so I just used a simple generic data step to read it in my example code.

Is the problem that you don't know what variables are in the file? Do you want SAS to guess what names it should use for the variables? If you include the header line in the fixed file and then use PROC IMPORT to GUESS what the variables are it will make up names from the header line.

One way you keep the header line is to change how you first read the Excel file so that it includes the first line as part of the data. Then it will get written as the first line of the resulting text file. So if you used PROC IMPORT to read the Excel file add the GETNAMES=NO statement to the proc step.

Or you could write the line with the variable names that SAS generated when it read the Excel file to the fixed text file. Insert steps to write that line and then modify the step that writes the data to use the MOD option on the FILE statement so that it appends the data to the file you just wrote with the header line.

filename text 'new_name_for_a_text_file.txt';

proc transpose data=step1(obs=0) out=names;
  var _all_;
run;

data _null_;
  file text dsd dlm=',';
  set names ;
  put _name_ @;
run;

data _null_;
  file text dsd dlm=',' MOD ;
  set step1;
  put (_all_) (+0);
run;

AyoSho · Posted 02-10-2020 11:40 AM

Thanks Tom!. The proc import fix (i.e GETNAMES=0) took care of it. The dataset is in a completely different language and I had no idea what the contents describe!. Now that i have it all split up, I can utilize reference materials to look up what each header stands for. I appreciate your help and patience!

Splitting row data in separate colums

Re: Splitting row data in separate colums

Re: Splitting row data in separate colums

Re: Splitting row data in separate colums

Re: Splitting row data in separate colums

Re: Splitting row data in separate colums

Re: Splitting row data in separate colums

Re: Splitting row data in separate colums

Re: Splitting row data in separate colums

Re: Splitting row data in separate colums

Re: Splitting row data in separate colums

Re: Splitting row data in separate colums

Ready to join fellow brilliant minds for the SAS Hackathon?