BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
djbateman
Lapis Lazuli | Level 10

I receive data files (.txt) from an external vendor.  The files contain subject IDs in the format XXX-XXX-XXX (e.g., 101-007-123).  I have noticed that when SAS reads this in using PROC IMPORT that it is trying to apply the mmddyy10. format to this field.  Because it is expecting a date string, it gives me an invalid data note for every record in the text file and then at the end gives me an unsuccessful import error (even though the data is successfully imported but with an empty subject field).

 

I'm wondering if there is a way to work around this.  I can ignore the invalid data notes, but the error in the log will cause a later flag that will keep later linked programs from executing.

 

I know that SAS provides the import code in the log, which I could copy and change the informat/format lines, but the program is standardized for different domains that have different variables, and this hardcoding would make things much more difficult to maintain.  Also, I don't even use this subject field in my program, so I don't need to have it.  Is there a way to have SAS not read in a specific field?  (I doubt it.)  Basically, I just need to know how I can get SAS to recognize SUBJECT as a character string and not a hyphenated date field.

 

Here is the SAS code I'm using (dummy text file attached):

filename inputtxt "S:\cdm\Development\Historical\864\002\Clinical_DB\20210715_Copy\864-002-UAT-CM 20210719091521.txt" termstr=crlf;	* encoding='UTF-8';

proc import out=temp datafile=inputtxt dbms=dlm replace;
	delimiter='09'x;
	getnames=yes;
	datarow=2;
	guessingrows=max;
run;

 

 

Here is the log output:

1002  filename inputtxt
1002! "S:\cdm\Development\Historical\864\002\Clinical_DB\20210715_Copy\864-002-UAT-CM
1002! 20210719091521.txt" termstr=crlf; * encoding='UTF-8';
1003
1004  proc import out=temp datafile=inputtxt dbms=dlm replace;
1005      delimiter='09'x;
1006      getnames=yes;
1007      datarow=2;
1008      guessingrows=max;
1009  run;

NOTE: A byte-order mark in the file
      "S:\cdm\Development\Historical\864\002\Clinical_DB\20210715_Copy\864-002-UAT-CM
      20210719091521.txt" (for fileref "INPUTTXT") indicates that the data is encoded in "utf-8".
      This encoding will be used to process the file.
NOTE: A byte-order mark in the file
      "S:\cdm\Development\Historical\864\002\Clinical_DB\20210715_Copy\864-002-UAT-CM
      20210719091521.txt" (for fileref "INPUTTXT") indicates that the data is encoded in "utf-8".
      This encoding will be used to process the file.
1010   /**********************************************************************
1011   *   PRODUCT:   SAS
1012   *   VERSION:   9.4
1013   *   CREATOR:   External File Interface
1014   *   DATE:      21JUL21
1015   *   DESC:      Generated SAS Datastep Code
1016   *   TEMPLATE SOURCE:  (None Specified.)
1017   ***********************************************************************/
1018      data WORK.TEMP    ;
1019      %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
1020      infile INPUTTXT delimiter='09'x MISSOVER DSD lrecl=13106 firstobs=2 ;
1021         informat ID best32. ;
1022         informat Verbatim $11. ;
1023         informat Status $8. ;
1024         informat External_Id $139. ;
1025         informat Assigned_Type $5. ;
1026         informat Assigned_Code best32. ;
1027         informat Assigned_Text $17. ;
1028         informat Path_Code $53. ;
1029         informat Assignment_Context $9. ;
1030         informat Source $1. ;
1031         informat Score best32. ;
1032         informat Dictionary $12. ;
1033         informat Version anydtdtm40. ;
1034         informat Study $17. ;
1035         informat Subject mmddyy10. ;
1036         informat Site best32. ;
1037         informat Form $33. ;
1038         informat Route $7. ;
1039         informat Indication $9. ;
1040         informat Parent1Type $4. ;
1041         informat Parent1Code $1. ;
1042         informat Parent1Text $31. ;
1043         informat Parent2Type $4. ;
1044         informat Parent2Code $3. ;
1045         informat Parent2Text $27. ;
1046         informat Parent3Type $4. ;
1047         informat Parent3Code $4. ;
1048         informat Parent3Text $33. ;
1049         informat Parent4Type $4. ;
1050         informat Parent4Code $5. ;
1051         informat Parent4Text $37. ;
1052         informat Parent5Type $4. ;
1053         informat Parent5Code best32. ;
1054         informat Parent5Text $11. ;
1055         informat Last_Modified anydtdtm40. ;
1056         informat Creation_Date anydtdtm40. ;
1057         format ID best12. ;
1058         format Verbatim $11. ;
1059         format Status $8. ;
1060         format External_Id $139. ;
1061         format Assigned_Type $5. ;
1062         format Assigned_Code best12. ;
1063         format Assigned_Text $17. ;
1064         format Path_Code $53. ;
1065         format Assignment_Context $9. ;
1066         format Source $1. ;
1067         format Score best12. ;
1068         format Dictionary $12. ;
1069         format Version datetime. ;
1070         format Study $17. ;
1071         format Subject mmddyy10. ;
1072         format Site best12. ;
1073         format Form $33. ;
1074         format Route $7. ;
1075         format Indication $9. ;
1076         format Parent1Type $4. ;
1077         format Parent1Code $1. ;
1078         format Parent1Text $31. ;
1079         format Parent2Type $4. ;
1080         format Parent2Code $3. ;
1081         format Parent2Text $27. ;
1082         format Parent3Type $4. ;
1083         format Parent3Code $4. ;
1084         format Parent3Text $33. ;
1085         format Parent4Type $4. ;
1086         format Parent4Code $5. ;
1087         format Parent4Text $37. ;
1088         format Parent5Type $4. ;
1089         format Parent5Code best12. ;
1090         format Parent5Text $11. ;
1091         format Last_Modified datetime. ;
1092         format Creation_Date datetime. ;
1093      input
1094                  ID
1095                  Verbatim $
1096                  Status $
1097                  External_Id $
1098                  Assigned_Type $
1099                  Assigned_Code
1100                  Assigned_Text $
1101                  Path_Code $
1102                  Assignment_Context $
1103                  Source $
1104                  Score
1105                  Dictionary $
1106                  Version
1107                  Study $
1108                  Subject
1109                  Site
1110                  Form $
1111                  Route $
1112                  Indication $
1113                  Parent1Type $
1114                  Parent1Code $
1115                  Parent1Text $
1116                  Parent2Type $
1117                  Parent2Code $
1118                  Parent2Text $
1119                  Parent3Type $
1120                  Parent3Code $
1121                  Parent3Text $
1122                  Parent4Type $
1123                  Parent4Code $
1124                  Parent4Text $
1125                  Parent5Type $
1126                  Parent5Code
1127                  Parent5Text $
1128                  Last_Modified
1129                  Creation_Date
1130      ;
1131      if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
1132      run;

NOTE: A byte-order mark in the file
      "S:\cdm\Development\Historical\864\002\Clinical_DB\20210715_Copy\864-002-UAT-CM
      20210719091521.txt" (for fileref "INPUTTXT") indicates that the data is encoded in "utf-8".
      This encoding will be used to process the file.
NOTE: The infile INPUTTXT is:

      Filename=S:\cdm\Development\Historical\864\002\Clinical_DB\20210715_Copy\864-002-UAT-CM
      20210719091521.txt,
      RECFM=V,LRECL=52424,File Size (bytes)=2067,
      Last Modified=19Jul2021:09:20:00,
      Create Time=20Jul2021:16:35:34

NOTE: Invalid data for Subject in line 2 296-306.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9--

2   CHAR  131.Tylenol.Exported.VX20-864-002_UAT3|United States|001|002-001-123|common_log_forms|1|comm
    ZONE  33305766666047767766055332333233355453756676625767677333733323332333766666656665667677376666
    NUMR  131949C5EFC9580F245496820D864D002F5143C5E94540341453C001C002D001D123C3FDDFEFCF7F6F2D3C1C3FDD

      93  on_log_forms|1|concomitant_medications|1|VV-000258|cmig|1|medication.TRADE.00020001005.TYLEN
    ZONE  66566656676773766666667667566666676667737552333333766667376666667666055444033333333333055444
    NUMR  FEFCF7F6F2D3C1C3FE3FD941E4FD5493149FE3C1C66D000258C3D97C1CD5493149FE942145900020001005949C5E

     185  OL.ATC1:N|ATC2:N02|ATC3:N02B|ATC4:N02BE|PREF:00020001001.pgsupport.D.1.WHODrug G B3.Mar 2021
    ZONE  44045433474543343374543343347454334334475544333333333333076777767704030544477624243046723333
    NUMR  FC91431AEC1432AE02C1433AE022C1434AE0225C0256A000200010019073500F249491978F4257070239D1202021

     277  .VX20-864-002_UAT3.002-001-123.1.Prior and Concomitant Medications.Oral.Migraines.ATC1.N.NER
    ZONE  05533233323335545303332333233303057667266624666666766724666667666704766046676666704543040445
    NUMR  96820D864D002F51439002D001D123919029F201E403FE3FD941E40D5493149FE39F21C9D97219E53914319E9E52

     369  VOUS SYSTEM.ATC2.N02.ANALGESICS.ATC3.N02B.OTHER ANALGESICS AND ANTIPYRETICS.ATC4.N02BE.ANILI
    ZONE  54552555544045430433044444454450454304334045445244444454452444244545554544504543043344044444
    NUMR  6F53039345D914329E0291E1C753933914339E0229F485201E1C75393301E401E4909254933914349E022591E9C9

     461  DES.PREF.00020001001.PARACETAMOL.16/Jul/2021 08:11:12.16/Jul/2021 08:10:56 534
    ZONE  44505544033333333333054544454444033247623333233333333033247623333233333333
    NUMR  45390256900020001001901213541DFC916FA5CF2021008A11A12916FA5CF2021008A10A56
ID=131 Verbatim=Tylenol Status=Exported
External_Id=VX20-864-002_UAT3|United States|001|002-001-123|common_log_forms|1|common_log_forms|1|conc
omitant_medications|1|VV-000258|cmig|1|medication Assigned_Type=TRADE Assigned_Code=20001005
Assigned_Text=TYLENOL Path_Code=ATC1:N|ATC2:N02|ATC3:N02B|ATC4:N02BE|PREF:00020001001
Assignment_Context=pgsupport Source=D Score=1 Dictionary=WHODrug G B3 Version=01MAR21:00:00:00
Study=VX20-864-002_UAT3 Subject=. Site=1 Form=Prior and Concomitant Medications Route=Oral
Indication=Migraines Parent1Type=ATC1 Parent1Code=N Parent1Text=NERVOUS SYSTEM Parent2Type=ATC2
Parent2Code=N02 Parent2Text=ANALGESICS Parent3Type=ATC3 Parent3Code=N02B
Parent3Text=OTHER ANALGESICS AND ANTIPYRETICS Parent4Type=ATC4 Parent4Code=N02BE Parent4Text=ANILIDES
Parent5Type=PREF Parent5Code=20001001 Parent5Text=PARACETAMOL Last_Modified=16JUL21:08:11:12
Creation_Date=16JUL21:08:10:56 _ERROR_=1 _N_=1
NOTE: Invalid data for Subject in line 4 300-310.

4   CHAR  134.Advil.Exported.VX20-864-002_UAT3|United States|001|002-001-888|common_log_forms|1|common
    ZONE  33304676604776776605533233323335545375667662576767733373332333233376666665666566767737666666
    NUMR  13491469C9580F245496820D864D002F5143C5E94540341453C001C002D001D888C3FDDFEFCF7F6F2D3C1C3FDDFE

      93  _log_forms|1|concomitant_medications|1|VV-000258|cmig|1|medication.TRADE.00109201030.ADVIL [
    ZONE  56665667677376666666766756666667666773755233333376666737666666766605544403333333333304454425
    NUMR  FCF7F6F2D3C1C3FE3FD941E4FD5493149FE3C1C66D000258C3D97C1CD5493149FE94214590010920103091469C0B

     185  IBUPROFEN].ATC1:C|ATC2:C01|ATC3:C01E|ATC4:C01EB|PREF:00109201001.islack.D..WHODrug G B3.Mar
    ZONE  44555444450454334745433433745433433474543343344755443333333333330676666040054447762424304672
    NUMR  92502F65ED91431A3C1432A301C1433A3015C1434A30152C0256A00109201001993C13B949978F4257070239D120

     277  2021.VX20-864-002_UAT3.002-001-888.1.Prior and Concomitant Medications.Oral.Rash.ATC1.C.CARD
    ZONE  33330553323332333554530333233323330305766726662466666676672466666766670476605676045430404454
    NUMR  202196820D864D002F51439002D001D888919029F201E403FE3FD941E40D5493149FE39F21C92138914319393124



RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9--

     369  IOVASCULAR SYSTEM.ATC2.C01.CARDIAC THERAPY.ATC3.C01E.OTHER CARDIAC PREPARATIONS.ATC4.C01EB.O
    ZONE  44545454452555544045430433044544442544545504543043340454452445444425545454544450454304334404
    NUMR  9F61335C12039345D914329301931249130485210991433930159F4852031249130025012149FE3914349301529F

     461  THER CARDIAC PREPARATIONS.PREF.00109201001.IBUPROFEN.19/Jul/2021 11:36:45.16/Jul/2021 08:10:
    ZONE  54452445444425545454544450554403333333333304455544440332476233332333333330332476233332333333
    NUMR  4852031249130025012149FE390256900109201001992502F65E919FA5CF2021011A36A45916FA5CF2021008A10A
     553  56 554
ID=134 Verbatim=Advil Status=Exported
External_Id=VX20-864-002_UAT3|United States|001|002-001-888|common_log_forms|1|common_log_forms|1|conc
omitant_medications|1|VV-000258|cmig|1|medication Assigned_Type=TRADE Assigned_Code=109201030
Assigned_Text=ADVIL [IBUPROFEN] Path_Code=ATC1:C|ATC2:C01|ATC3:C01E|ATC4:C01EB|PREF:00109201001
Assignment_Context=islack Source=D Score=. Dictionary=WHODrug G B3 Version=01MAR21:00:00:00
Study=VX20-864-002_UAT3 Subject=. Site=1 Form=Prior and Concomitant Medications Route=Oral
Indication=Rash Parent1Type=ATC1 Parent1Code=C Parent1Text=CARDIOVASCULAR SYSTEM Parent2Type=ATC2
Parent2Code=C01 Parent2Text=CARDIAC THERAPY Parent3Type=ATC3 Parent3Code=C01E
Parent3Text=OTHER CARDIAC PREPARATIONS Parent4Type=ATC4 Parent4Code=C01EB
Parent4Text=OTHER CARDIAC PREPARATIONS Parent5Type=PREF Parent5Code=109201001 Parent5Text=IBUPROFEN
Last_Modified=19JUL21:11:36:45 Creation_Date=16JUL21:08:10:56 _ERROR_=1 _N_=3
NOTE: 3 records were read from the infile INPUTTXT.
      The minimum record length was 534.
      The maximum record length was 588.
NOTE: The data set WORK.TEMP has 3 observations and 36 variables.
NOTE: DATA statement used (Total process time):
      real time           0.11 seconds
      cpu time            0.12 seconds


Errors detected in submitted DATA step. Examine log.
3 rows created in WORK.TEMP from INPUTTXT.



ERROR: Import unsuccessful.  See SAS Log for details.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           0.21 seconds
      cpu time            0.20 seconds

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Any process that uses multiple files and relies on Proc Import to read the data will have problems eventually unless the data has very low variability. Why you may ask.  The procedure examines each data and makes guesses as to the properties. So variable lengths can change. Depending on the actual data in different files the type of variable can change. With some unreliable sources the variable names can change.

 

If you have text file start with that data step in the log.

Copy to the editor and clean up, like remove line numbers and such.

Set the properties you want, such as the informat/format (basically only needed for date, time or datetime variables).

SAVE the program.

Reread the data.

Next time you get a file change the name of the input file and maybe the output data set depending on what you do with it.

 

If you have documentation that a character variable like External_id in your code could be up to 150 characters use that for the informat. Then you don't get truncated data.

For your specific problem variable use an informat like $10 (or $12 or how ever long you expect the values).

 

View solution in original post

5 REPLIES 5
ballardw
Super User

Any process that uses multiple files and relies on Proc Import to read the data will have problems eventually unless the data has very low variability. Why you may ask.  The procedure examines each data and makes guesses as to the properties. So variable lengths can change. Depending on the actual data in different files the type of variable can change. With some unreliable sources the variable names can change.

 

If you have text file start with that data step in the log.

Copy to the editor and clean up, like remove line numbers and such.

Set the properties you want, such as the informat/format (basically only needed for date, time or datetime variables).

SAVE the program.

Reread the data.

Next time you get a file change the name of the input file and maybe the output data set depending on what you do with it.

 

If you have documentation that a character variable like External_id in your code could be up to 150 characters use that for the informat. Then you don't get truncated data.

For your specific problem variable use an informat like $10 (or $12 or how ever long you expect the values).

 

djbateman
Lapis Lazuli | Level 10

ballardw,

 

Thanks for your input.  I have thought about copying the data step code from the log, but this would only apply to this CM file that is being read in.  I'm also reading in AE, MH, and a couple of other domains that differ slightly from each other.  With a total of 5 domains, I would have to write 5 different blocks of data step code and conditionally run depending on which domain is detected.  It's not impossible, but I was hoping to avoid this mess, especially only to fix one variable that isn't even being used anywhere else in the program.  If there is no better option, then I might have to suck it up and just use your method.  I'm hoping the vendor can just send me the file without the Subject variable, but they may have configured and validated things on their end and won't want to accommodate that change.

ballardw
Super User

@djbateman wrote:

ballardw,

 

Thanks for your input.  I have thought about copying the data step code from the log, but this would only apply to this CM file that is being read in.  I'm also reading in AE, MH, and a couple of other domains that differ slightly from each other.  With a total of 5 domains, I would have to write 5 different blocks of data step code and conditionally run depending on which domain is detected.  It's not impossible, but I was hoping to avoid this mess, especially only to fix one variable that isn't even being used anywhere else in the program.  If there is no better option, then I might have to suck it up and just use your method.  I'm hoping the vendor can just send me the file without the Subject variable, but they may have configured and validated things on their end and won't want to accommodate that change.


"Write Read program once properly, run on thousands of files.

Proc import each file, fix thousands of data problems later."

 

Maybe not a real "saying" but experience.

 

When multiple files are being read you want to make sure they have the same structure. Otherwise any process that involves combining the data is going to be fraught with problems.

 

If there is a key value to identify the type of "records" then 5 is not that much of an issue. I used to work with remote sensor data that had literally dozens of different record layouts in a single file with different reporting intervals.

That is part of what the Input @; is for. So you can examine a row of text and determine what type of record you are on.

 

 

So you "import" one of each, refer to the data documentation clean up and have 5 programs.

I have data sources that change files by changing order constantly. Proc Import would create garbage because the data is sparse for some fields, i.e. some of the fields don't get populated in every file. So those would be "$1." variables in one file but should be numeric or much longer character variables.

When I see a problem with the data order (like every other file), all I do is change the order of variables on the INPUT statement. Otherwise when I combine the data from different time periods the sets will not append because variable types are wrong or result in truncated data.

Or when someone changes a heading from "subject id" to "Identification". Proc import will have two different variables. The data step still reads column 3 (just an example) as the named "subject_id" originally set up for my SAS data sets.

 

If that sounds like too much work then search this forum for "proc import" and find out how many ways you have to fix data later.

djbateman
Lapis Lazuli | Level 10

Okay, you have convinced me.  After looking at the file structure of the 5 different domains, they are almost identical with about 1-2 variables differing.  It will be much easier and trustworthy to use the data step.  I was just hoping there was a magic fix out there that I was unaware of.

Tom
Super User Tom
Super User

Just write the data step to read the text file yourself and you will have complete control over how name, type, storage length, formats permanently attached, informat used to read from the text file for each variable.

 

Your data step does not need to be as ugly as the code that PROC IMPORT generates.

 

In general you should just define the variables using a LENGTH statement. Attach any formats needed (most variables do not need formats attached.)  Attach any informats needed (most variables do not need informats).  Then read the variables. If you defined the variables in the same order as they will appear in the text file then the INPUT statement is trivial using a position based variable list.  

 

Plus since you are in control you can even rename repeating variables to have the series number at the end instead of the  in the middle to make them easier to use in SAS code.

 

data want;
  infile "&path/&fname" dsd dlm='09'x truncover firstobs=2;
  length
    id                     8      
    verbatim               $50    
    status                 $8     
    external_id            $200   
    assigned_type          $5     
    assigned_code          8      
    assigned_text          $20    
    path_code              $80    
    assignment_context     $9     
    source                 $1     
    score                  8      
    dictionary             $12    
    version                $30    
    study                  $20    
    subject                $10    
    site                   8      
    form                   $40    
    route                  $7     
    indication             $40    
    parent_type1           $8             
    parent_code1           $8             
    parent_text1           $50            
    parent_type2           $8             
    parent_code2           $8             
    parent_text2           $50            
    parent_type3           $8             
    parent_code3           $8             
    parent_text3           $50            
    parent_type4           $8             
    parent_code4           $8             
    parent_text4           $50            
    parent_type5           $8             
    parent_code5           $8             
    parent_text5           $50            
    last_modified          8       
    creation_date          8       
  ;
  format  last_modified creation_date datetime19.;
  informat  last_modified creation_date datetime.;
  input id -- creation_date ;
run;

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 917 views
  • 0 likes
  • 3 in conversation