Text Processing

Quality at Every Level

At Language Computer, we've learned that high-quality text processing is the key to creating robust natural language processing applications.

NLP applications -- such as customizable information extraction or question answering applications -- often make use of a "pipeline" of text processing components in order to provide value from text. With these systems, the performance of each successive system depends on the performance of each of the components that preceded it in the pipeline. In this way, errors made by an "upstream" component (like a part-of-speech tagging system) can negatively impact the performance of each "downstream" system (such as a named entity recognizer or coreference resolution system).

In order to guarantee performance for each of our end-to-end applications, we employ dedicated error tracking systems which can estimate the numbers of errors made by each component in our text processing pipeline -- and take steps to minimize their impact on each "downstream" component. This type of error tracking allows all of Language Computer's applications to provide relatively consistently levels of performance, regardless of the type, domain, or genre of the documents you need to process.

LCC takes the performance of its core text processing systems very seriously. Since errors in core components can lead to a significant degradation in performance, we spend considerable time and effort in the testing and evaluation of each of our text annotation components in each of the languages we support.

Supporting Technologies

Language Computer provides a wide range of text annotation components, including:

  • Sentence Segmentation: Sentence Segmentation systems are responsible for breaking up documents (whether they be newswire documents, e-mail messages, chat logs, or blog posts) into sentences (or sentence-like) objects which can be processed and annotated by "downstream" components.
  • Tokenization: Tokenization systems break sentences into sets of word-like objects which represent the smallest unit of linguistic meaning considered by a natural language processing system.
  • Named Entity Recognition: Named Entity Recognition systems categorize phrases (referred to as entities) found in text with respect to a potentially large number of semantic categories, such as person, organization, or geopolitical location.
  • Coreference Resolution: Coreference Resolution systems identify the linguistic expressions which make reference to the same entity or individual within a single document -- or across a collection of documents.
  • Event Timestamping: Event Timestamping systems associate each mention of an event in a collection of documents with an exact date, time, and duration.
  • Geocoding: Geocoding systems associate each mention of a location in a collection of documents with its exact geospatial coordinates.

Related Products

CiceroLite

High-performance named entity recognition for English, Arabic, Chinese, Farsi, and Korean texts.

 
CiceroCoref

Accurate pronominal and nominal coreference resolution for English.

 
PinPoint

Temporal and spatial awareness for absolute or relative mentions of times, dates, or locations.

 

For More Information

For more information on how Language Computer can help your organization meet its text processing and annotation needs, contact us at (972) 231-0052 or e-mail us today.