Digitization of Newspapers for Posterity

Posted by Managed Outsource
6
Apr 1, 2016
202 Views
Image
Newspapers have been prime candidates for document scanning and conversion mainly because they contain hard-to-access important information. Local newspapers are authoritative sources of news, government and community information, records of birth marriage and death notices spanning decades and eras. Digitization of newspapers offers a unique searchable digital edition of a community’s history and also provides access to valuable information about the locality. Digitization increases the amount of access that can be provided. Digitized newspapers can be accessed by multiple users, and are keyword searchable.

Digitizing Herald and Merritt News

The Nicola Valley Museum and Archive, TNRD library, TRU and Kamloops Museum and Archives are digitizing old copies of Herald and Merritt News. The project is called Newshound Digitization Project. Their objective is to digitize every newspaper that has ever been published within the TNRD library. About 100,000 pages have been digitized and images have been converted from various shades of brown to black and white. They have also photographed all the pages of the newspapers. The TRND chair John Ranta said that initially the newspapers were stored in sheds and shelves of the libraries that were damaged over time but with this digitization project millions of people will be able to access the newspapers easily.

What does newspaper digitization involve?

The Process of Digitization

The process of digitization involves the following steps:

  • Scanning of the original paper or microfilm copy
  • Web images and master files are generated
  • Allotting metadata for each issue of page and articles to enhance the search  results of newspaper
  • Running OCR software to get high resolution images
  • Importing OCR text, images and metadata into the digital library software   program

Scanning Microfilms

Microfilms of newspapers are scanned in batches at 300- 400dpi depending on the quality. Each microfilm roll consists of nearly 700 frames or images. The images are then cropped and skewed to around 34MB, and the formats of the images are decided.

Metadata

Each page image would have metadata that includes:

  • Publication date
  • Publication title
  • Volume and issue number
  • Page number


Each page image is segmented into articles and each article’s metadata would have:


  • Headline
  • Byline
  • Classification
  • Whether the article is a lead story

Metadata improves search accuracy and allows searchers to find certain sections of the paper. Hence metadata is important.

OCR Processing

The accuracy of OCR depends on many factors ranging from the quality of the source image to the complexity of layout. The font of the newspaper is usually small that requires high image resolution for optimal OCR performance. Excellent OCR software should be used for digitizing historic newspapers.
 
XML Conversion

The scanned data is converted into XML or SGML format depending on the Document Type Definition (DTD) given by the client. First, the source document text is converted into ASCII format using the OCR technique. This is followed by text content analysis, incorporation of XML / SGML tags, and parser validation of tagged data with appropriate application software.
 
Finally the digital objects consisting of the images, OCR text and metadata are assembled. Two items are sent to the clients, they are:

  • Objects for each issue of the newspaper consisting of PDF searchable image files are sent that describes the location of each word discovered and the metadata gathered during processing.
  • Archival 4- bit  grey scale TIFF images

Newspaper digitization is a challenging process, and is best performed by an experienced vendor. A reliable document scanning company that can handle large volume digitization while preserving the original newspapers without any damage would be the ideal partner. Make sure that the vendor shows you examples of their successful projects so that you can have an idea about the service standards you can expect.

Comments
avatar
Please sign in to add comment.