Solved: Re: Combining datasets

dapenDaniel · Posted 10-15-2018 06:23 PM

Hi! I have more than 200 txt files with same variables (40 variables in total). I used PROC IMPORT to import txt into SAS seperately and would like to combine them into one file. I used the code following below (I did not list all the file names).

data all;
set b01a
    b01b
    b02a
    b02b
    b02c
    b02d
    b02e
    .
    .
    .
    b08b;
run;

After I run the code, it showed

ERROR: Variable AppDate has been defined as both character and numeric.
ERROR: Variable PubDate has been defined as both character and numeric.
ERROR: Variable JPFI has been defined as both character and numeric.
ERROR: Variable ItemNum has been defined as both character and numeric.
ERROR: Variable AssCount has been defined as both character and numeric.
ERROR: Variable InvCount has been defined as both character and numeric.

ERROR: Variable AssCount has been defined as both character and numeric.
ERROR: Variable InvCount has been defined as both character and numeric.

Can anyone tell me what I should do to solve this problem? Thank you very much!

Tom · Posted 10-15-2018 08:42 PM

DO NOT use PROC IMPORT for this problem. You could try using on one file and looking at the code it generated. But it generates really ugly code so you will probably get to the right solution faster just writing your own code.

Do you know the names of the variables ("fields")?

Do you know the type of data that each variable contains?

Are any of the variables DATE, TIME, DATETIME or other format that SAS might need to use a special INFORMAT to read and require a special FORMAT for humans to understand?

Do your files have header rows?

Are you files delimited text files (like a CSV file or a tab delimited file)? Or is the data in fixed character positions on the lines?

Is there any information you need that is only available in the name of the file? Such as a YEAR or a LOCATION?

The basic structure of a data step to read a delimited file is DATA statement, INFILE statement, LENGTH statement,optional LABEL statement, optional INFORMAT statement, optional FORMAT statement. INPUT statement and RUN statement.

So assuming your files have just the variables (fields) in the error messages you showed and making some assumptions about what type of data those variables have based on their names your program might look like this:

data want ;
  length fname $255;
  infile "/path to files/*.csv" dsd dlm=',' truncover filename=fname ;
  length AppDate 8 PubDate 8 JPFI $10 ItemNum 8 AssCount 8 InvCount 8;
  informat  AppDate PubDate mmddyy. ;
  format AppDate PubDate yymmdd10. ;
  input @;
  if fname ne lag(fname) then input;
  input AppDate -- InvCount ;
run;

The variable FNAME and the FILENAME=FNAME option and the extra INPUT @ and IF statement are there to allow skipping header rows. If you files do not have header rows then you do not need the extra INPUT and IF statement. If there is any information in the filename then you need then add some statements after the real INPUT statement to pull that information from the FNAME variable.

The * in the INFILE statement is a wildcard. SAS will only support a single use of the * the filename.

Note that you should use a length of 8 for all numeric variables. For character variables use a $ and the maximum number characters (actually bytes) that you will need store.

The INPUT statement uses a positional variable list which means that you need to define the variables (see LENGTH statement) in the same order as they appear in the file. If you did not define them in the same order then just list them in the INPUT statement in the order they do appear in the file. If the variables are already defined (and those that need informats have them attached) then there is no need to add informat specifications to the INPUT statement.

To speed up the process you can just copy the fist line of one the files to begin creating the LENGTH statement.

View solution in original post

Reeza · Posted 10-15-2018 06:35 PM

You need to make them the same type. You would likely need to reimport the data, it's likely been imported incorrectly. Is there an chance this is patent data?

dapenDaniel · Posted 10-15-2018 06:38 PM

Yes. It is patent data.

There is no error message shown when I import the txt file. Is there any way that I can solve this problem when I combine these data?

Reeza · Posted 10-15-2018 06:48 PM

There's no error message but that doesn't mean it's correct. The code is correct but it's clearly reading in the data differently. There was another user, https://communities.sas.com/t5/user/viewprofilepage/user-id/194466 who seemed to have the same issues and possibly the same data you're working with here. He encountered this issue as well I believe, though he also had significant trouble reading the data at first.

If you used PROC IMPORT, that would be why you're having this issue. It guesses at the variable types and ... it can guess wrong. So in this case, for some files it guessed one type and in another file, it guessed the other type. Then you cannot stack the data because of the file variation.

mkeintz · Posted 10-15-2018 06:49 PM

If

All your imported txt data files have the same columns in the same order
and
One of the proc import's produces variables of the type you want in the final version,

then take the "successful" proc import generated code, and revise it to read each of the other txt files. Then see if all the other txt imports produce the expected data. If so, then run the data step you posted.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

data_null__ · Posted 10-15-2018 06:49 PM

If all the TXT files have the same field layout it would be easier to read them with a single data step. You can read them all as one file using a wild card FILEREF

FILENAME BTXT "path-text-files\B*.txt";

Would need more info for complete program.

If they don't have the same field layout but have common field names you can also still read them all with one data step but it is a bit more involved. Which scenario do you have?

dapenDaniel · Posted 10-15-2018 07:03 PM

Does field layout mean the order of the variables? The order of the variables are the same and the variable name are the same. Does that make sense? Thanks.

Reeza · Posted 10-15-2018 07:29 PM

Change your process.

1. Put all files in a single folder, or a folder for each type.

2. Pick one file, Read one file using PROC IMPORT. Copy code from the log. Modify it to ensure variable types and lengths are correct and match your documentation or record layout.

3. Change the file reference to incorporate a wild card, which would allow it to read all the file types at once.

4. No need to append, just to merge for final output now.

dapenDaniel · Posted 10-15-2018 07:39 PM

Can you explain more details about Step 3? what specific code do I need to use? Sorry I am still new to SAS and not familiar with wild card. Thanks.

Reeza · Posted 10-15-2018 08:51 PM

https://blogs.sas.com/content/sasdummy/2018/10/09/read-multiple-text-files/

Tom · Posted 10-15-2018 08:42 PM

DO NOT use PROC IMPORT for this problem. You could try using on one file and looking at the code it generated. But it generates really ugly code so you will probably get to the right solution faster just writing your own code.

Do you know the names of the variables ("fields")?

Do you know the type of data that each variable contains?

Are any of the variables DATE, TIME, DATETIME or other format that SAS might need to use a special INFORMAT to read and require a special FORMAT for humans to understand?

Do your files have header rows?

Are you files delimited text files (like a CSV file or a tab delimited file)? Or is the data in fixed character positions on the lines?

Is there any information you need that is only available in the name of the file? Such as a YEAR or a LOCATION?

The basic structure of a data step to read a delimited file is DATA statement, INFILE statement, LENGTH statement,optional LABEL statement, optional INFORMAT statement, optional FORMAT statement. INPUT statement and RUN statement.

So assuming your files have just the variables (fields) in the error messages you showed and making some assumptions about what type of data those variables have based on their names your program might look like this:

data want ;
  length fname $255;
  infile "/path to files/*.csv" dsd dlm=',' truncover filename=fname ;
  length AppDate 8 PubDate 8 JPFI $10 ItemNum 8 AssCount 8 InvCount 8;
  informat  AppDate PubDate mmddyy. ;
  format AppDate PubDate yymmdd10. ;
  input @;
  if fname ne lag(fname) then input;
  input AppDate -- InvCount ;
run;

The variable FNAME and the FILENAME=FNAME option and the extra INPUT @ and IF statement are there to allow skipping header rows. If you files do not have header rows then you do not need the extra INPUT and IF statement. If there is any information in the filename then you need then add some statements after the real INPUT statement to pull that information from the FNAME variable.

The * in the INFILE statement is a wildcard. SAS will only support a single use of the * the filename.

Note that you should use a length of 8 for all numeric variables. For character variables use a $ and the maximum number characters (actually bytes) that you will need store.

The INPUT statement uses a positional variable list which means that you need to define the variables (see LENGTH statement) in the same order as they appear in the file. If you did not define them in the same order then just list them in the INPUT statement in the order they do appear in the file. If the variables are already defined (and those that need informats have them attached) then there is no need to add informat specifications to the INPUT statement.

To speed up the process you can just copy the fist line of one the files to begin creating the LENGTH statement.

dapenDaniel · Posted 10-16-2018 02:07 PM

Thank you for your detailed answer. I tried the code you offered but I still have two problems.

1. I do not have a header row. However, the first line of each file is the variable name (field name). I used firstobs = 2 but it only worked for the first file.

2. in the log file, it shows that One or more lines were truncated for the second file but the number of observation is correct except it also read the line of variable names.

I attached my code and log below.

data abc.VV;
infile "D:\PV1\*.txt" delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=2 ;
length PubNum $ 20
       Title $ 32767
       TitleD $ 32767
       PrioNum $ 32767
       PrioNumD $ 32767
       PrioDate $ 32767
       PrioDateD $ 32767
       AppNum $ 50
      AppDate $ 20；
input
       PubNum 
       Title 
       TitleD 
       PrioNum 
       PrioNumD 
       PrioDate 
       PrioDateD 
       AppNum
       AppDate;
run;

NOTE: The infile "D:\PV1\*.txt" is:
Filename=D:\PV1\V02a.txt,
File List=D:\PV1\*.txt,RECFM=V,LRECL=32767

NOTE: The infile "D:\PV1\*.txt" is:
Filename=D:\PV1\V02b.txt,
File List=D:\PV1\*.txt,RECFM=V,LRECL=32767

NOTE: 29318 records were read from the infile "D:\PV1\*.txt". (This number is correct.)
The minimum record length was 405.
The maximum record length was 28301.
NOTE: 22952 records were read from the infile "D:\PV1\*.txt". (The correct number of observations is 22591)
The minimum record length was 324.
The maximum record length was 32767.
One or more lines were truncated.
NOTE: The data set ABC.VV has 52270 observations and 40 variables.
NOTE: DATA statement used (Total process time):
real time 3:37.41
cpu time 1:12.51

Tom · Posted 10-16-2018 02:15 PM

Change the LRECL on the INFILE statement. The default is too short for your data files. You use a value as large as 4 million without much issue on most systems.

Add the FILENAME= option and remove the FIRSTOBS=2 option.

Add the two lines that skip the header row by seeing when the filename changes.

dapenDaniel · Posted 10-17-2018 12:09 PM

Is there any way that I can only import specific variables? The following below is my code. Thanks.

data abc.V;
 length fname $255;
infile "D:\PV\*.txt" delimiter='09'x MISSOVER DSD lrecl=4000000 filename=fname;
length PubNum $ 50
       Title $ 32767
       TitleD $ 32767
       PrioNum $ 32767
       PrioNumD $ 32767
       PrioDate $ 32767
       PrioDateD $ 32767
       AppNum $ 50
	   AppDate $ 20
       PubDate $ 20
       IPCC $ 32767
       IPCCD $ 32767
       Assignee $ 32767
       AssUS $ 32767
       JPF $ 10
       JPFI $ 10
       ItemNum $ 8
       AssS $ 32767
       AssO $ 32767
       AssCode_D $ 32767
       AssCount $ 8
       InvO $ 32767
       InvD $ 32767
       InvCount $ 8
       Country $ 2
       PubYear $ 4
       AppYear $ 4 
       AppNumSer $ 100
       Class $ 32767
       ClassM $ 32767
       ClassO $ 32767
       ClassD $ 32767
       ManCodeD $ 32767
       CitedRefP $ 32767
       CitedRefCountP $ 8
       CitedNonPDOI $ 32767
       CitedRefCountNP $ 8
       CitingP $ 32767
       CitingCount $ 8
       MainStatus $ 50;
	input @;
	if fname ne lag(fname) then input;
input
       PubNum 
       Title 
       TitleD 
       PrioNum 
       PrioNumD 
       PrioDate 
       PrioDateD 
       AppNum
       AppDate
       PubDate
       IPCC 
       IPCCD 
       Assignee 
       AssUS 
       JPF 
       JPFI 
       ItemNum
       AssS 
       AssO 
       AssCode_D 
       AssCount
       InvO 
       InvD 
       InvCount
       Country 
       PubYear 
       AppYear 
       AppNumSer 
       Class 
       ClassM 
       ClassO 
       ClassD 
       ManCodeD 
       CitedRefP 
       CitedRefCountP
       CitedNonPDOI 
       CitedRefCountNP
       CitingP 
       CitingCount
       MainStatus;
if pubnum = "aaaaaaaaaaaaaaaaaaaa" then delete;
run;

Tom · Posted 10-17-2018 12:16 PM

Yes and No.

If the variables you don't want are at the end of the line then there is no need to read those.

But you cannot skip to delimited variable number 5 without first reading the variables ahead of it.

But you can DROP the variables you don't need.

And there is no need to define variables you don't need. Just read the columns you don't care about into a dummy variable and drop it.

For example say you only want to read the first, third and tenth variables out of 20 variables. So you need to read 10 values from each line. But you only need to define 4 variables and drop one of them.

data example;
  length first $10 third 8 tenth $5 ;
  length dummy $1 ;
  infile ... dsd truncover ... ;
  input first dummy third 6*dummy tenth;
  drop dummy;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!