SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Dataflux combine multiple txt files into a single file for further processing

Reply
Occasional Contributor
Posts: 7

Dataflux combine multiple txt files into a single file for further processing

Hi All,

 

I am trying to identify the best method for handling 1 or more txt files in Data Management Studio 2.7, being received from different sources but in the same format and then combining them into a single file to allow for processing all the records together.  The files may contain the same records so need to be combined to ensure only the unique records will be processed.

 

I have created a process job using the following method (see image Process image.png):

  • Create a blank generic txt file to hold all the data received with the correct heading record
  • Check to see if file 1 exists - if it does, union with the generic txt file and write back to the generic text file
  • Check to see if file 2 exists - if it does, union with the generic txt file and write back to the generic text file
  • Use this generic txt file in the main data job
  • Archive all files

Acknowledging that in this process there is only two files that may be received but there are others that have larger numbers of files, how then do you manage in DMS where there are more files rather than having to check for each file to add to the generic txt file?

 

I think there must be a better way but I just don't know what it could be.

 

Also I have found an issue with the reading in of the generic txt file in the "Data Job 1" node where the output is written correctly in the "Generate data for OSHC Dir..." as say 458 records but is not read into the "Data Job 1" in its entirety.  It is stopping at the same location in the generic txt file when reading this file into the input file node within this data job.

 

I expect that this issue is due to reading and writting to the same file but am not sure how to avoid this scenario to combine these files.

 

Any suggestions would be greatly appreciated.


Thanks,


Allan. 


Process image.png
SAS Super FREQ
Posts: 97

Re: Dataflux combine multiple txt files into a single file for further processing

You may be able to use the parallel iterator node in your process job to check for new files and write the contents to a work table using the work table writer node in your data job. A subsequent data job could read data out of the work table using the work table reader node after which you could perform duplicate removal processing using clustering and surviving record identification nodes.

 

For the possible issue related to reading from and writing to the same file at the same time, using a work table may help with that or if you wanted to stay with the current job design, you could use a branch node in between steps where you are reading and writing data. In the branch node, you can set the option to "land all data locally before processing continues' to force the behavior you are looking for.

 

Ron

Occasional Contributor
Posts: 7

Re: Dataflux combine multiple txt files into a single file for further processing

Posted in reply to RonAgresta

Hi Ron,

 

Thanks for the reply.  I have converted the process over to use a database table rather than the txt file to aggregate the multiple files into one.  I will look into the work table option to ensure it is all contained within dataflux.

 

I have looked at the parellel processor node and it looks like it may be able to achieve what I am after.  The question I have now is how do you identify the number of files that exist in a particular location and call these file names to the process?

 

Thanks,

 

Allan.

Ask a Question
Discussion stats
  • 2 replies
  • 278 views
  • 0 likes
  • 2 in conversation