DataStage Interview Questions

A list of top frequently asked DataStage Interview Questions and answers are given below.


1) What is IBM DataStage?

DataStage is one of the most powerful ETL tools. It comes with the feature of graphical visualizations for data integration. It extracts, transforms, and loads data from source to the target.

DataStage is an integrated set of tools for designing, developing, running, compiling, and managing applications. It can extract data from one or more data sources, achieve multi-part conversions of the data, and load one or more target files or databases with the resultant data.


2) Describe the Architecture of DataStage?

DataStage follows the client-server model. It has different types of client-server architecture for different versions of DataStage.

DataStage architecture contains the following components.

  • Projects
  • Jobs
  • Stages
  • Servers
  • Client Components
DataStage Interview Questions 1

3) Explain the DataStage Parallel Extender (PX) or Enterprise Edition (EE)?

DataStage PX is an IBM data integration tool. It is one of the most widely used extractions, transformation, and loading (ETL) tools in the data warehousing industry. This tool collects the information from various sources to perform transformations as per the business needs and load data into respective data warehouses.

DataStage PX is also called as DataStage Enterprise Edition.


4) Describe the main features of DataStage?

The main features of DataStage are as follows.

  • DataStage provides partitioning and parallel processing techniques which allow the DataStage jobs to process an enormous volume of data quite faster.
  • It has enterprise-level networking.
  • It's a data integration component of IBM InfoSphere information server.
  • It's a GUI based tool.
  • In DataStage, we need to drag and drop the DataStage objects, and also we can convert it to DataStage code.
  • DataStage is used to perform the various ETL operations (Extract, transform, load)
  • It provides connectivity with different sources & multiple targets at the same time

5) What are some prerequisites for DataStage?

For DataStage, following set ups are necessary.

  • InfoSphere
  • DataStage Server 9.1.2 or above
  • Microsoft Visual Studio .NET 2010 Express Edition C++
  • Oracle client (full client, not an instant client) if connecting to an Oracle database
  • DB2 client if connecting to a DB2 database

6) How to read multiple files using a single DataStage job if files have the same metadata?

  • Search if the metadata of files is different or same then specify file names in the sequential stage.
  • Attach the metadata with a sequential stage in its properties.
  • Select Read method as 'Specific File(s)'then add all files by selecting 'file' property from the 'available properties to add.'

It will look like:


7) Explain IBM InfoSphere information server and highlight its main features?

IBM InfoSphere Information Server is a leading data integration platform which contains a group of products that enable you to understand, filter, monitor, transform, and deliver data. The scalable solution facilitates with massively parallel processing capabilities to help you to manage small and massive data volumes. It assists you in forwarding reliable information to your key business goals such as big data and analytics, data warehouse modernization, and master data management.

Features of IBM InfoSphere information server

  • IBM InfoSphere can connect with multiple source systems as well as write to various target systems. It acts as a single platform for data integration.
  • It is based on centralized layers. All the modules of the suit can share the baseline architecture of the suite.
  • It has some additional layers for the unified repository, for integrated metadata services, and sharing a parallel engine.
  • It has tools for analysis, monitoring, cleansing, transforming and delivering data.
  • It has extremely parallel processing capabilities that provide high-speed processing.

8) What is IBM DataStage Flow Designer?

IBM DataStage Flow Designer allows you to create, edit, load, and run jobs in DataStage. DFD is a thin client, web-based version of DataStage. Its a web-based UI for DataStage than DataStage Designer, which is a Window-based thick client.


9) How do you run DataStage job from the command line?

To run a DataStage job, use command"dsjob" command as follows.


10) What are some different alternative commands associated with "dsjob"?

Many alternative optional commands can be used with dsjob command to perform any specific task. These commands are used in the below format.

A list of commonly used alternative options of dsjob command is given below.

Stop: it is used to stop the running job

Lprojects: it is used to list the projects

ljobs: it is used to list the jobs in project

lparams: it is used to list the parameters in a job

paraminfo: it returns the parameters info

Linkinfo: It returns the link information

Logdetail: it is used to display details like event_id, time, and message

Lognewest: it is used to display the newest log id.

log: it is used to add a text message to log.

Logsum: it is used to display the log.

lstages: it is used to list the stages present in the job.

Llinks: it is used to list the links.

Projectinfo: it returns the project information (hostname and project name)

Jobinfo: it returns the job information (Job-status, job runtime,end time, etc.)

Stageinfo: it returns the stage name, stage type, input rows, etc.)

Report: it is used to display a report which contains Generated time, start time, elapsed time, status, etc.

Jobid: it is used to provide Job id information.


11) What is a Quality Stage in DataStage tool?

A Quality Stage helps in integrating different types of data from multiple sources.

It is also termed as the Integrity Stage.


12) What is the process of killing a job in DataStage?

To kill a job, you must destroy the particular processing ID.


13) What is a DS Designer?

DataStage Designer is used to design the job. It also develops the work area and adds various links to it.


14) What are the Stages in DataStage?

Stages are the basic structural blocks in InfoSphere DataStage. It provides a rich, unique set of functionality to perform advanced or straightforward data integration task. Stages hold and represent the processing steps that will be performed on the data.


15) What are Operators in DataStage?

The parallel job stages are made on operators. A single-stage might belong to a single operator or a number of operators. The number of operators depends on the properties you have set. During compilation, InfoSphere DataStage estimates your job design and sometimes will optimize operators.


16) Explain connectivity between DataStage with DataSources?

IBM InfoSphere Information Server supports connectors and enables jobs for data transfer between InfoSphere Information Server and data sources.

IBM InfoSphere DataStage and QualityStage jobs can access data from enterprise applications and data sources such as:

  • Relational databases
  • Mainframe databases
  • Enterprise Resource Planning (ERP) or Customer Relationship Management (CRM) databases
  • Online Analytical Processing (OLAP) or performance management databases
  • Business and analytic applications

17) Describe Stream connector?

The Stream connector allows integration between the Streams and the DataStage. InfoSphere Stream connector is used to send data from a DataStage job to a Stream job and vice versa.

InfoSphere Streams can perform close to real-time analytic processing in parallel to the data loading into a data warehouse. Alternatively, the InfoSphere Streams job performs RTAP processing. After RTAP processing, it forwards the data to InfoSphere DataStage to transform, enrich, and store the details for archival purposes.


18) What is the use of HoursFromTime() Function in Transformer Stage in DataStage?

HoursFromTime Function is used to return hour portion of the time. Its input is time, and Output is hours (int8).

Examples: If myexample1.time contains the time 22:30:00, then the following two functions are equivalent and return the integer value 22.


19) What is the Difference between Informatica and DataStage?

The DataStage and Informatica both are powerful ETL tools. Both tools do almost the same work in nearly the same manner. In both tools, the performance, maintainability, and learning curve are similar and comparable. Below are the few differences between both tools.

ParameterDataStageInformatica
Multiple PartitionsDataStage's pipeline partitioning uses multiple partitions.Informatica offers to partition as dynamic partitioning.
User InterfaceDataStage offers 3 GUIs
IBM DataStage Designer
Job Sequence Designer(workflow design)
Director (for monitoring)
Informatica offers 4 GUIs
Informatica PowerDesigner
Repository Manager
Workflow Designer
Workflow Manager.
Data EncryptionData encryption needs to be done before reaching the DataStage Server.Informatica allows "Data Masking Transformation" inside PowerCenter Designer as a separate transformation.
Transformations_DataStage becomes a powerful transformation engine by using functions (Oconv and IConv) and routines. It offers about 40 data transforming stages/objects. Almost any transformation can be performed in DataStage.Informatica allows about 30 necessary transformations to process incoming data.
ReusabilityWe can achieve re-usability of a job in DataStage by using containers(local&shared). To re-use a Job Sequence, you will have to make a copy, compile it, and run.It offers access of re-usability through Mapplets and Worklets for re-using mappings and workflows. Reusability improves performance.

20) How We Can Covert Server Job To A Parallel Job?

We can convert a server job into a parallel job by using Link Collector and IPC Collector.


21) What are the different layers in the information server architecture?

The different layers of information server architecture are as follows.

  • Unified user interface
  • Common services
  • Unified parallel processing
  • Unified Metadata
  • Common connectivity

22) If you want to use the same piece of code in different jobs, how will you achieve it?

DataStage facilitates with a feature called shared containers which allows sharing the same piece of code for a different job. The containers are shared for reusability. A shared container consists of a reusable job element of stages and links. We can call a shared container in, unlike DataStage jobs.


23) How many types of Sorting methods are available in DataStage?

There are two types of sorting methods available in DataStage for parallel jobs.

  • Link sort
  • Standalone Sort stage

24) Describe Link Sort?

The Link sort supports fewer options than other sorts. It is easy to maintain in a DataStage job as there are only few stages in the DataStage job canvas.

Link sort is used unless a specific option is needed over Sort Stage. Most often, the Sort stage is used to specify the Sort Key mode for partial sorts.

Sorting on a link option is available on the input/partitioning stage options. We cannot specify a keyed partition if we use auto partition method.


25) Which commands are used to import and export the DataStage jobs?

We use the following commands for the given operations.

For Import: we use the dsimport.exe command

For Export, we use the dsexport.exe command


26) Describe routines in DataStage? Enlist various types of routines.

Routine is a set of tasks which are defined by the DS manager. It is run via the transformer stage.

There are three kinds of routines

  • Parallel routines
  • Mainframe routines
  • Server routines

27) What is the different type of jobs in DataStage?

There are two types of jobs in DataStage

  • Server jobs: These jobs run in a sequential manner
  • Parallel jobs: These jobs get executed in a parallel way

28) State the difference between an Operational DataStage and a Data Warehouse?

An Operational DataStage can be considered as a presentation area for user processing and real-time analysis. Thus, operational DataStage is a temporary repository. Whereas the Data Warehouse is used for durable data storage needs and holds the complete data of the entire business.


29) What is the importance of the exception activity in DataStage?

The reason behind the importance of exception activity is that during the job execution, exception activity handles all the unfamiliar error activity.


30) What is "Fatal Error/RDBMS Code 3996" error?

This error occurs while testing jobs in DataStage 8.5 during Teradata 13 to 14 upgrade.

It is because the user tries to assign a longer string to a shorter string destination, and sometimes if the length of one or more range boundaries in a RANGE_N function is a string literal with a length higher than that of the test value.




Latest Courses