DataStage Adapter¶
The DataStage proxy connects an IBM DataStage environment (via Information Governance Catalog) to an Egeria Data Engine OMAS.
Ensure you have already completed the Setup steps before proceeding.
5. Configure the connector¶
a. Configure the OMAS¶
Egeria's OMAS's provide a set of typically more coarse-grained services through which specific consumers or producers of metadata can directly integrate. For DataStage, by implementing a data engine proxy, we are integrating through the Data Engine OMAS of Egeria. This allows the integration to submit a large number of objects together in a single request rather than the more fine-grained services of a repository connector (which would typically require many fine-grained requests).
i. Set OMAS repository¶
curl -k -X POST "https://localhost:9443/open-metadata/admin-services/users/admin/servers/omas_server/local-repository/mode/local-graph-repository"
Detailed explanation
OMAS's need to be configured with a local repository, in this example we are configuring the server where we will enable the Data Engine OMAS with Egeria's built-in graph repository.
Response from OMAS repository configuration
{"class":"VoidResponse","relatedHTTPCode":200}
ii. Configure the event bus¶
curl -k -X POST -H "Content-Type: application/json" \
--data '{"producer":{"bootstrap.servers":"localhost:9092"},"consumer":{"bootstrap.servers":"localhost:9092"}}' \
"https://localhost:9443/open-metadata/admin-services/users/admin/servers/omas_server/event-bus?connectorProvider=org.odpi.openmetadata.adapters.eventbus.topic.kafka.KafkaOpenMetadataTopicProvider"
Detailed explanation
The event bus is how Egeria coordinates communication amongst its various servers and repositories: for example, ensuring that any new type definitions are registered with each repository capable of handling them, notifying other repositories when the metadata in one repository changes, etc.
The URL parameter connectorProvider
defines the type of event bus to use (in
this case Apache Kafka).
The JSON payload gives details about how to connect to Apache Kafka, in this case assuming
it is running on local machine (localhost
) on its default port (9092
).
Response from event bus configuration
{"class":"VoidResponse","relatedHTTPCode":200}
iii. Enable OMAS's¶
curl -k -X POST "https://localhost:9443/open-metadata/admin-services/users/admin/servers/omas_server/access-services?serviceMode=ENABLED"
Detailed explanation
This call enables the access services that the omas_server
should run: for simplicity here
we are enabling all of the access services.
This ensures that the Data Engine OMAS will be running and available for the DataStage connector to communicate any metadata through.
Response from enabling OMAS's
{"class":"VoidResponse","relatedHTTPCode":200}
b. Configure DataStage connector¶
i. Set repository¶
curl -k -X POST "https://localhost:9443/open-metadata/admin-services/users/admin/servers/datastage_proxy/local-repository/mode/in-memory-repository"
Detailed explanation
Just like any other server, we configure the persistence (if any) of the DataStage connector itself.
In this example, we are configuring the in-memory-repository
, so in fact
will not have any persistence for the DataStage connector. For this connector, this is
fine since our persistence is actually handled through the Data Engine OMAS rather
than through this connector itself.
Response from connector repository configuration
{"class":"VoidResponse","relatedHTTPCode":200}
ii. Configure connector¶
curl -k -X POST -H "Content-Type: application/json" \
--data '{"class":"DataEngineProxyConfig","accessServiceRootURL":"https://localhost:9443","accessServiceServerName":"omas_server","dataEngineConnection":{"class":"Connection","connectorType":{"class":"ConnectorType","connectorProviderClassName":"org.odpi.egeria.connectors.ibm.datastage.dataengineconnector.DataStageConnectorProvider"},"endpoint":{"class":"Endpoint","address":"infosvr:9446","protocol":"https"},"userId":"isadmin","clearPassword":"isadmin"},"pollIntervalInSeconds":60}' \
"https://localhost:9443/open-metadata/admin-services/users/admin/servers/datastage_proxy/data-engine-proxy-service/configuration"
Detailed explanation
The JSON payload's contents define how this connector itself should be configured: specifically,
which Java class should be used. Here we can see the payload refers to the
DataStageConnectorProvider
, which therefore tells the proxy to use this class
-- specific to the DataStage connector -- in order to configure its connectivity to
DataStage as a data engine.
Response from connector configuration
{"class":"VoidResponse","relatedHTTPCode":200}
Be sure to replace hostname and credential information
If copy / pasting the command above, be sure to replace the hostname and credential information with the appropriate settings for your own environment before running it.
Required user permissions
To operate, the Information Server user credentials must have (at a minimum) the following roles:
- Suite User
- Information Governance Catalog User
- Information Governance Catalog Glossary Author
- Information Governance Catalog Asset Administrator (v11.7+)
- Information Governance Catalog Information Asset Author (when
includeVirtualAssets
is set to true)
Detailed explanation
The first two are both read-only, non-administrative roles, while the last allows synchronization objects to be created to track the last synchronization point of the DataStage job information.
For v11.7 and above, the Information Governance Catalog Asset Administrator role is necessary to automate the detection of lineage within IGC, prior to having a complete set of lineage for the DataStage connector itself to retrieve. (This is a necessary step to avoid potential race conditions between lineage being fully calculated within IGC and the DataStage connector polling for the lineage information.)
The Information Governance Catalog Information Asset Author role is needed to be able to retrieve the full details of virtual assets.
6. Start the server instances¶
curl -k -X POST "https://localhost:9443/open-metadata/admin-services/users/admin/servers/omas_server/instance"
curl -k -X POST "https://localhost:9443/open-metadata/admin-services/users/admin/servers/datastage_proxy/instance"
Detailed explanation
Up to this point we have only configured the connector, but have not actually started it.
These final API calls tell Egeria to start the servers needed: first for the OMAS's to which the connector will communicate, and second to start the connector itself.
Response from server instances startup
{
"class": "SuccessMessageResponse",
"relatedHTTPCode": 200,
"successMessage": "Thu Mar 11 12:54:46 GMT 2021 omas_server is running the following services: [Open Metadata Repository Services (OMRS), Connected Asset Services, Digital Service OMAS, Data Manager OMAS, Subject Area OMAS, Design Model OMAS, Glossary View OMAS, Asset Manager OMAS, Security Officer OMAS, IT Infrastructure OMAS, Data Science OMAS, Community Profile OMAS, Discovery Engine OMAS, Data Engine OMAS, Digital Architecture OMAS, Asset Owner OMAS, Stewardship Action OMAS, Governance Program OMAS, Asset Lineage OMAS, Analytics Modeling OMAS, Asset Consumer OMAS, Asset Catalog OMAS, DevOps OMAS, Software Developer OMAS, Project Management OMAS, Governance Engine OMAS, Data Privacy OMAS]"
}
It may take 10-15 seconds to complete, but the first response above indicates that the Egeria OMAS's are now running.
{
"class": "SuccessMessageResponse",
"relatedHTTPCode": 200,
"successMessage": "Thu Mar 11 13:07:35 GMT 2021 datastage_proxy is running the following services: [Open Metadata Repository Services (OMRS), Data Engine Proxy Services]"
}
After another 10-15 seconds to complete, the example response above indicates that the DataStage connector instance is now running.
Other startup information of potential interest
Back in the console where the server chassis is running, you should see the audit log printing out a large amount of information as the startup is running. Most of this is related to the registration of type definition details with the repository.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
|
These lines indicate that the Data Engine OMAS has been configured and started up, and the DataStage proxy has also been started.
Connector options¶
There are currently five configuration options for the connector itself:
Option | Description |
---|---|
pageSize |
An integer giving the number of results to include in each page of processing. |
includeVirtualAssets |
A boolean that indicates whether to include the creation of schemas for virtual assets (when true) or not (when false). |
createDataStoreSchemas |
A boolean that indicates whether to include the creation of data store-level schemas (when true) or not (when false). When the DataStage connector is used alone in a cohort, without an IGC proxy also running in the cohort, this should be set to true to ensure that the data stores used as sources or targets by DataStage exist in lineage. If an IGC proxy is also being used in the cohort, this should be left at the default value (false) to ensure that the IGC proxy remains the home metadata collection of data store entities. |
limitToProjects |
A list of projects to which any lineage information should be limited. When not specified, all projects will be included. When specified, only jobs within those projects will be included. |
limitToLineageEnabledJobs |
A boolean that indicates if the connector should only process lineage-enabled jobs. If this is set to true then only jobs having include_for_lineage set to true will be processed for lineage information. |
Example configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Details about how to connect to both the Data Engine OMAS and the IBM Information Server environment must also be provided in the connection's configuration:
- the URL and name of the OMAS server running within either this or another Egeria OMAG Server Platform (server chassis)
- the endpoint details covering the hostname and port of the Information Server environment's domain (services) tier, and the username and password through which we can access its REST APIs
For the additional options, if no values are provided in the configuration of the connector they will automatically default to the values given above.