The UIMA Analysis Engine interface provides support for developing and integrating algorithms that analyze unstructured data.
The Collection Processing Architecture defines additional components for reading raw data formats from data collections, preparing the data for processing by Analysis Engines, executing the analysis, extracting analysis results, and deploying the overall flow in a variety of local and distributed configurations.
The functionality defined in the Collection Processing Architecture is implemented by a Collection Processing Engine (CPE).
A CPE includes an Analysis Engine and adds a Collection Reader, a CAS Initializer (deprecated as of version 2), and CAS Consumers. The part of the UIMA Framework that supports the execution of CPEs is called the Collection Processing Manager, or CPM.
Collection Reader – interfaces to a collection of data items (e.g., documents) to be analyzed. Collection Readers return CASes that contain the documents to analyze, possibly along with additional metadata.
The CAS Initializer prepares an individual data item for analysis and loads it into the CAS.
CAS Consumer – consume the enriched CAS that was produced by the sequence of Analysis Engines before it, and produce an application-specific data structure, such as a search engine index or database.
Analysis Engines and CAS Consumers are both instances of CAS Processors. A Collection Processing Engine (CPE) may contain multiple CAS Processors. An Analysis Engine contained in a CPE may itself be a Primitive or an Aggregate (composed of other Analysis Engines). Aggregates may contain Cas Consumers. While Collection Readers and CAS Initializers always run in the same JVM as the CPM, a CAS Processor may be deployed in a variety of local and distributed modes, providing a number of options for scalability and robustness.
Deploy:
There are three deployment modes for CAS Processors (Analysis Engines and CAS Consumers) in a CPE:
Integrated (runs in the same Java instance as the CPM)
Managed (runs in a separate process on the same machine), and
Non-managed (runs in a separate process, perhaps on a different machine).
For both managed and non-managed CAS Processors, the CAS must be transmitted between separate processes and possibly between separate computers. This is accomplished using Vinci, a communication protocol used by the CPM and which is provided as a part of Apache UIMA. Vinci handles service naming and location and data transport. Service naming and location are provided by a Vinci Naming Service, or VNS. For managed CAS Processors, the CPE uses its own internal VNS. For non-managed CAS Processors, a separate VNS must be running.
To run the non-managed example, there are some additional steps.
Start a VNS service by running the startVNS script in the /bin directory, or using the Eclipse launcher “UIMA Start Vinci Service”.
Deploy the Meeting Detector Analysis Engine as a Vinci service, by running the startVinciService script in the /bin directory or using the Eclipse launcher for this, and passing it the location of the descriptor to deploy, in this case %UIMA_HOME%/examples/deploy/vinci/Deploy_MeetingDetectorTAE.xml, or if you're using Eclipse and have the uimaj-examples project in your workspace, you can use the Eclipse Menu → Run → Run... → and then pick the launch configuration “UIMA Start Vinci Service”.
Now, run the runCPE script (or if in Eclipse, run the launch configuration “UIMA Run CPE”), passing it the CPE for the non-managed version (%UIMA_HOME%/examples/descriptors/collection_processing_engine/ MeetingFinderCPE_NonManaged.xml ).