The “Ellip Workflows” solution provides a ‘Platform-as-a-Service’ environment, based on the HADOOP framework, to integrate and test scalable processing chains, with control of code, parameters & data flows, embedding standard APIs to stage data, run jobs & and package Apps.
The online developer documentation is available here:
Using this solution for your application integration, you need skills with software libraries integration and dependencies management, and you will work from within a Linux CentOS Virtual Machine.
See two examples on https://github.com/ec-coresyf/ of Research Applications integrated by Co-ReSyF partners via this solution:
The first steps require to picture the application workflow as a Data Pipeline, that helps you optimize the parallelization strategy and move data between different processing steps. In this application workflow, the processing steps are the nodes of a Directed Acyclic Graph (DAG). A directed graph may be used to represent a succession of processing elements; in this formulation, data enters a processing element through its incoming nodes and leaves the element through its outgoing nodes.
The processing elements can be executed as several tasks (providing parallelization by input split) or as a single processing task (aggregation of individual results). All the orchestration of the data flow is handled by the Hadoop framework. You only describe processing elements with your code.
At the design stage, will carefully think how to structure the workflow by answering questions such as:
- How many nodes do I need?
- Can the node execution be split in several tasks?
- What will each node read as inputs?
- What will each node write as outputs?
- What parameters does each node need?
- Is my workflow optimized in terms of I/O?
At the implementation phase, you will write the code invoked by your workflow steps (processing elements). Each task defined by a processing element triggers the execution of a run executable file (containing the logic of a processing step).
The overall workflow of the processing chain that is being integrated is defined by an application descriptor. With it you can define in a structured way the steps needed by your application in order to:
- Provide a service interface (based on the OGC WPS standard);
- Discover and download input data (based on the OGC OpenSearch Geo & Time extensions standard);
- Process data, producing intermediate and final results;
- Trigger remote web processing services;
- Publish result files.
The Co-ReSyF project welcomes new partnerships to collaborate on additional research applications.
You can contact us from here.
Learn more about Integrating Research Applications in Co-ReSyF