<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Dealing with text in &amp;quot;newspaper archive formats&amp;quot; in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/Dealing-with-text-in-quot-newspaper-archive-formats-quot/m-p/558360#M10019</link>
    <description>&lt;P&gt;Has anyone had to process scanned/OCRed text in any of the following formats?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL type="disc" style="margin-top: 0cm;"&gt;
&lt;LI style="margin: 0px; color: #000000; font-family: 'Calibri',sans-serif; font-size: 11pt; font-style: normal; font-weight: normal;"&gt;&lt;SPAN style="margin: 0px;"&gt;GALEN XML - &lt;/SPAN&gt;&lt;SPAN style="margin: 0px;"&gt;&lt;A href="https://github.com/alan-turing-institute/i_newspaper_rods/blob/epcc-master/newsrods/test/fixtures/2000_04_24.xml" target="_blank" rel="noopener"&gt;https://github.com/alan-turing-institute/i_newspaper_rods/blob/epcc-master/newsrods/test/fixtures/2000_04_24.xml&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL type="disc" style="margin-top: 0cm;"&gt;
&lt;LI style="margin: 0px; color: #000000; font-family: 'Calibri',sans-serif; font-size: 11pt; font-style: normal; font-weight: normal;"&gt;&lt;SPAN style="margin: 0px;"&gt;METS and ALTO format &lt;/SPAN&gt;&lt;SPAN style="margin: 0px; font-family: Wingdings;"&gt;à&lt;/SPAN&gt;&lt;SPAN style="margin: 0px;"&gt; &lt;A href="https://es.wikipedia.org/wiki/ALTO_(XML)" target="_blank" rel="noopener"&gt;https://en.wikipedia.org/wiki/ALTO_(XML)&lt;/A&gt; and &lt;A href="https://veridiansoftware.com/knowledge-base/metsalto/" target="_blank" rel="noopener"&gt;https://veridiansoftware.com/knowledge-base/metsalto/&lt;/A&gt; . &lt;/SPAN&gt;&lt;SPAN style="margin: 0px;"&gt;Here you have an example: &lt;A href="https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml" target="_blank" rel="noopener"&gt;https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml&lt;/A&gt; .&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN style="margin: 0px;"&gt;These are kind of "digital archive" formats, focused on capturing and preserving historic documents, with a large focus on maintaining the "look" of the documents, and less on the content.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="margin: 0px;"&gt;I envisage having to parse large corpuses of these kinds of documents, and processing them into lots of discrete paragraph sized fragments, tagged with metadata on the file they came from, and the tagging from the input files themselves.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="margin: 0px;"&gt;It's not, from the looks of things, nicely behaved bodies of text that will extract neatly, unlike maybe mining emails or business reports.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="margin: 0px;"&gt;Any experience, insights, anything would be greatly received. Thanks in advance.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 14 May 2019 14:50:59 GMT</pubDate>
    <dc:creator>AngusLooney</dc:creator>
    <dc:date>2019-05-14T14:50:59Z</dc:date>
    <item>
      <title>Dealing with text in "newspaper archive formats"</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Dealing-with-text-in-quot-newspaper-archive-formats-quot/m-p/558360#M10019</link>
      <description>&lt;P&gt;Has anyone had to process scanned/OCRed text in any of the following formats?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL type="disc" style="margin-top: 0cm;"&gt;
&lt;LI style="margin: 0px; color: #000000; font-family: 'Calibri',sans-serif; font-size: 11pt; font-style: normal; font-weight: normal;"&gt;&lt;SPAN style="margin: 0px;"&gt;GALEN XML - &lt;/SPAN&gt;&lt;SPAN style="margin: 0px;"&gt;&lt;A href="https://github.com/alan-turing-institute/i_newspaper_rods/blob/epcc-master/newsrods/test/fixtures/2000_04_24.xml" target="_blank" rel="noopener"&gt;https://github.com/alan-turing-institute/i_newspaper_rods/blob/epcc-master/newsrods/test/fixtures/2000_04_24.xml&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL type="disc" style="margin-top: 0cm;"&gt;
&lt;LI style="margin: 0px; color: #000000; font-family: 'Calibri',sans-serif; font-size: 11pt; font-style: normal; font-weight: normal;"&gt;&lt;SPAN style="margin: 0px;"&gt;METS and ALTO format &lt;/SPAN&gt;&lt;SPAN style="margin: 0px; font-family: Wingdings;"&gt;à&lt;/SPAN&gt;&lt;SPAN style="margin: 0px;"&gt; &lt;A href="https://es.wikipedia.org/wiki/ALTO_(XML)" target="_blank" rel="noopener"&gt;https://en.wikipedia.org/wiki/ALTO_(XML)&lt;/A&gt; and &lt;A href="https://veridiansoftware.com/knowledge-base/metsalto/" target="_blank" rel="noopener"&gt;https://veridiansoftware.com/knowledge-base/metsalto/&lt;/A&gt; . &lt;/SPAN&gt;&lt;SPAN style="margin: 0px;"&gt;Here you have an example: &lt;A href="https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml" target="_blank" rel="noopener"&gt;https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml&lt;/A&gt; .&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN style="margin: 0px;"&gt;These are kind of "digital archive" formats, focused on capturing and preserving historic documents, with a large focus on maintaining the "look" of the documents, and less on the content.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="margin: 0px;"&gt;I envisage having to parse large corpuses of these kinds of documents, and processing them into lots of discrete paragraph sized fragments, tagged with metadata on the file they came from, and the tagging from the input files themselves.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="margin: 0px;"&gt;It's not, from the looks of things, nicely behaved bodies of text that will extract neatly, unlike maybe mining emails or business reports.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="margin: 0px;"&gt;Any experience, insights, anything would be greatly received. Thanks in advance.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 14 May 2019 14:50:59 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Dealing-with-text-in-quot-newspaper-archive-formats-quot/m-p/558360#M10019</guid>
      <dc:creator>AngusLooney</dc:creator>
      <dc:date>2019-05-14T14:50:59Z</dc:date>
    </item>
  </channel>
</rss>

