Solved: importing text file with unknown delimiter

Baraso · Posted 09-24-2021 01:19 PM

Hi,

I've received a huge zip dataset (zip file .gz). I was able to unzip it somehow, and I am trying to import (read) it in SAS. The dataset is too big to open and read on Notepad. So, I have to make a guess about the number of variables and their delimiter.

I have tried all possible delimiters for importing in SAS, but the imported data does not look fine. The first two lines of data look as below.

Claim_ID_MS_MDC_Cd_MSDRG_Cd_Serv

1873420141|||PRP0||||||

I am wondering if there is any way to import data in SAS when you don't have any idea how data look like.

Thanks.

ballardw · Posted 09-24-2021 04:07 PM

@Baraso wrote:

Hi,

I've received a huge zip dataset (zip file .gz). I was able to unzip it somehow, and I am trying to import (read) it in SAS. The dataset is too big to open and read on Notepad. So, I have to make a guess about the number of variables and their delimiter.

I have tried all possible delimiters for importing in SAS, but the imported data does not look fine. The first two lines of data look as below.

Claim_ID_MS_MDC_Cd_MSDRG_Cd_Serv

1873420141|||PRP0||||||

I am wondering if there is any way to import data in SAS when you don't have any idea how data look like.

Thanks.

I would start with Proc Import. That looks like the pipe character is the delimiter.

proc import datafile="C:\My Documents\myfiles\delimiter.txt"
            dbms=dlm
            out=mydata
            replace;
     delimiter='|';
     getnames=yes;
    guessingrows=max;

run;

change the file to yours. There may be an issue with your variable names as they should also have the delimiters between them

View solution in original post

ballardw · Posted 09-24-2021 04:07 PM

@Baraso wrote:

Hi,

I've received a huge zip dataset (zip file .gz). I was able to unzip it somehow, and I am trying to import (read) it in SAS. The dataset is too big to open and read on Notepad. So, I have to make a guess about the number of variables and their delimiter.

I have tried all possible delimiters for importing in SAS, but the imported data does not look fine. The first two lines of data look as below.

Claim_ID_MS_MDC_Cd_MSDRG_Cd_Serv

1873420141|||PRP0||||||

I am wondering if there is any way to import data in SAS when you don't have any idea how data look like.

Thanks.

I would start with Proc Import. That looks like the pipe character is the delimiter.

proc import datafile="C:\My Documents\myfiles\delimiter.txt"
            dbms=dlm
            out=mydata
            replace;
     delimiter='|';
     getnames=yes;
    guessingrows=max;

run;

change the file to yours. There may be an issue with your variable names as they should also have the delimiters between them

Baraso · Posted 09-30-2021 11:34 AM

Thank you Ballardw. I was able to import the text data to SAS.

ballardw · Posted 09-30-2021 12:41 PM

@Baraso wrote:
Thank you Ballardw. I was able to import the text data to SAS.

Great!

If you are going to have multiple files to read that should be similar in structure here are your next steps.

First, Proc Import generated data step code that appears in the log. Copy that data step into the editor and clean it up (basically remove line numbers and apply any code formatting such as indenting that you prefer.You can remove any FORMAT statements assigning $xx. formats.). Test that cleaned up code that it reads the file (just in case). SAVE the code.

Now you check the data and variable properties. If you understand the informats well enough you can read the code. If not run Proc Contents and see if the variable types (numeric and character) agree with the documentation I hope you have from the data source. Common things to look for are identity variables, account numbers, social security numbers and such that will not do arithmetic with being read as numeric. Those should likely be character to preserve such things as leading zeroes. Replace the likely Best32. or similar with $xx. on the INFORMAT for those variables. Check that anything with dates or datetimes is read correctly. If your data source does not use 4 digit years then Proc Import may use the wrong informat to read the specific values. Typically there will be a better informat in that case. Change and test the informats for those.

If your data source places values like NULL or N/A in variables that should be numeric you may see variables that should be numeric appear as character. You can force them to be numeric by using an Informat like Best32 (or just plain 32) which will have the values of Null or similar as SAS missing numeric values. Better is a custom informat that explicitly sets those known values to missing to keep the log clean.

Make sure the character variables are long enough to store expected values. If the documentation says a variable could contain 50 characters but your data only had 35 then sometime in the future you may have a problem with truncated values in later data sets. Change the informat to $50. (or match the documentation). If there is no documentation you may want to consider setting the informat for character variables to 10 or 20 longer as a "just in case" for later. The character length issue is a common question on this forum dealing with combining data sets and getting warnings about different lengths and truncated data. Worse is when Proc Import guesses different types for different files (one file doesn't have those NA values so the variable imports as numeric, but attempting to combine data is an error because of different variable types).

SAVE the program and test your changes.

Strongly recommend looking at the variable names and seeing if they make sense. I have one data source that will start almost every variable with "CLIENT_". When it appears on everything it is meaningless you can write code with shorter names. Also if the source had very long names you may just want to make shorter names (the editor search and replace is helpful if very careful). Then use LABELS to provide longer and more grammatically correct descriptions of the variables.

Test the code and results.

Then the next file you need to read you only change the name of the output data set and the source text file.

Exercise for the interested reader: If you have fields in the data that contain a small number of "code" values, such as 1, 2, 3 you can write a custom informat that will show an error if somehow a 99 is in the data. Or if one of the codes, such as 99, is to indicate "not collected" or "missing" or similar, the custom informat could read those code values into missing directly. Or if a continuous measurement should be an expected range you can set other values outside the expected range to missing. No need for a later step to "fix" values.

I have a project where one of the values is a Site code. Occasionally the source will insert a new code without warning. I have in informat to read that variable and generate an error message when they do this. Then I can contact the source and get the other associated information, name, location and such about that site to update my code and other data elements (calculations related to elevation above sea level for instance).

Tom · Posted 09-24-2021 05:07 PM

First thing is that it sounds like you do not have ZIP file. Instead you have a file made with GZIP, a totally different compression utility. A ZIP file can contain multiple files, but a GZIP file can only contain one file. Fortunately SAS can read both now.

If the file really has these two lines as the first two lines

Claim_ID_MS_MDC_Cd_MSDRG_Cd_Serv
1873420141|||PRP0||||||

then you seem to have something wrong.

From the second line it looks like | is the delimiter. But then what the heck is that first line? It looks like it is use _ as the delimiter. But still have an issue as there are 8 underscore delimited strings on that first line and 9 pipe characters on the second line. So even if there is an extra | at the end of the line (I have seen some files add that extra delimiter to avoid issues with extra CR inserted by Windows end of line character standards) you are still one name short.

To get a quick look at the file you don't need to open it with Notepad. Just use a data step to look at the file. This code will dump the first 10 lines of the file to the SAS log so you can see what is there.

data _null_;
   infile 'myfile.gz' zip gzip obs=10;
   input;
   list;
run;

Since it looks like you only have 9 or 10 fields in that file just write your own data step to read it. If you really have no idea what the columns are and how long they should just start by reading every column as text and taking a look at it.

data want;
   infile 'myfile.gz' zip gzip firstobs=2 dsd dlm='|' truncover;
   length var1-var10 $30 ;
   input var1-var10;
run;

importing text file with unknown delimiter

Re: importing text file with unknown delimiter

Re: importing text file with unknown delimiter

Re: importing text file with unknown delimiter

Re: importing text file with unknown delimiter

Re: importing text file with unknown delimiter

importing text file with unknown delimiter

Re: importing text file with unknown delimiter

Re: importing text file with unknown delimiter

Re: importing text file with unknown delimiter

Re: importing text file with unknown delimiter

Re: importing text file with unknown delimiter

Click image to register for webinar

Classroom Training Available!