DATA INGESTION PLUGIN GUIDE


Table of Contents

About

This document is provided as a reference in order for you to utilize all the built-in plugins; as well as assisting in the development and implementation of your own customized plugins.

Data Ingestion Pipeline

The Data Ingestion component enables the capability of the Microsoft CityNext Big Data Solution Accelerator to ingest arbitrary data from arbitrary channels and convert the ingested data into a unified form before storing into City Artifacts Management (CAM). There are two types of ingestions: pull ingestion and push ingestion. As shown in the images below, the ingestion pipeline is at the core of data ingestion being extensible and configurable -which is key-value paired and blob based.

The extensibility of the ingestion pipeline is based on plugins. There are 20+ built-in plugins already integrated within the Microsoft CityNext Big Data Solution Accelerator. In addition, you can develop your own plugins to extend the functionalities of the solution accelerator when customizing it for your own intended purposes.

2 Data Ingestion Pipeline.png

2 Data Ingestion Pipeline 2.png

Back to top


Channel Driver Plugins

Shared Channel Creation Time Properties

3.1 Shared Channel Creation Time Properties.png

Data Table Channel Driver Plugins

DB Channel Driver

Description

The database channel driver is used to connect to database sources and retrieve data from those sources. It can support SQL Server, Oracle, any DB with OLEDB driver, and any DB with ODBC driver.

Creation time properties
DB Channel Driver - 3.2.1.1.1 Creation time properties.png

Ingestion time properties

DB Channel Driver - 3.2.1.2 Ingestion time properties.png

Scenarios and examples

Scenario: I want to ingest sales order data from the legacy SQL Server database, and I want to ingest only completed orders.

Configuration example:
Creation Time Properties
  • ConnectionString: “Data Source=MyDBServer;Initial Catalog=AdventureWorks; Integrated Security=true;”
  • Provider: “SqlClient”

Ingestion Time Properties
  • SQLQuery: “SELECT * FROM dbo.SalesOrders WHERE isCompleted <> 1”
  • TableName: <empty>
  • BatchSize: 1000
Back to top

Email IMAP Channel Driver

Description

The Email IMAP Channel Driver is used to retrieve emails from email service providers such as Hotmail. It uses IMAP as the message exchange protocol.

Creation time properties

Email IMAP Channel Driver - 3.2.2.2 Creation time properties.png

Ingestion time properties

Email IMAP Channel Driver - 3.2.2.3 Ingestion time properties.png

Scenarios and examples

In case you want to ingest emails received by a Hotmail account. First, you should create a pull channel with the creation time properties as follows:

Host: imap-mail.outlook.com
Port: empty (use the default value 993)
SSL Protocol: empty(use the default value 'default')
Login Name: cignycoem@hotmail.com
Password: (password)
For Hotmail, the header for the thread ID is 'Thread-Index'.

Second, you need to provide information for ingestion time properties when creating a pull source based on the above email pull channel. If you want to ingest emails into the folder 'Inbox/News', you need to set the value of the folder path to 'Inbox/News'. If you do not want to filter emails using the subject, you can just leave the subject filter empty.

Back to top


Stream Channel Driver Plugins

Web Channel Driver

Description

Ingests arbitrary data from a bunch of web URIs, and then passes the ingested data as streams into the ingestion pipeline for further processing.

Creation time properties

Web Channel Driver - 3.3.1.2 Creation time properties.png

Web Channel Driver - 3.3.1.2 Creation time properties 2.png

Ingestion time properties

Web Channel Driver - 3.3.1.3 Ingestion time properties.png

Scenarios and examples

Scenario: You want to ingest NYC 311 historic data from the NYC open data website, and you would also like to ingest all 311 requests created in the past two days. By default, the rest API provided by the source only returns 1000 records per request. So, you need to send multiple requests with paging parameters in the URL to download the whole data set.

Here you need to solve two problems:
  1. How to put the day before yesterday in the URL, and the date is dynamically changed at ingestion time?
  2. How to send multiple requests in order to download all records until there are no more records left?

The web channel supports embedded immediately-invoked function expressions (IIFE) in the URL, and the expressions in the URL will be dynamically interpreted during runtime.

For problem one, you can add a “time” IFFE to represent the day before yesterday for New York time. The expression is: “::time(eastern standard time,1,-2,yyyy-MM-dd)”
  • The first parameter tells the function to generate NYC local time (time zone name “eastern standard time”)
  • The second parameter tells the function to only generate the date by rounding the date time to “1” day.
  • The third parameter tells the function to subtract the time by “2” days.
  • The fourth parameter tells the function to format the computed time with the format “yyyy-MM-dd”.

For problem two, you can define a sequence in the web channel, and set a paging parameter in the URL. Since every HTTP request can only retrieve 1000 records, you should generate a sequence from 0 to Max 311 requests in 2 days in multiples of 1000. We assume the number of 311 requests in two days won’t exceed 100,000, so the sequence is: “0, 100000, 1000”, and set the paging parameter to “$offset=::sequence()” in the query string.

Configuration Example:
Creation Time Properties
  • SiteRoot: “http://data.cityofnewyork.us/”

Ingestion Time Properties
  • URLs: “http://data.cityofnewyork.us/resource/bxtq-h7wr.xml?$wherecreated_date >=’::time(eastern standard time,1,-2,yyyy-MM-dd)’&$offset=::sequence()”
  • Sequence: 0,100000,1000
  • EndTheSequenceWhenLastOutputIsEmpty: true
Back to top

File Channel Driver

Description

Ingests arbitrary data from a bunch of file paths and then transfers the ingested data as streams into the ingestion pipeline for further processing.

Creation time properties

File Channel Driver - 3.3.2.2 Creation time properties.png

Ingestion time properties

File Channel Driver - 3.3.2.3 Ingestion time properties.png
File Channel Driver - 3.3.2.3 Ingestion time properties 2.png

Scenarios and examples

Imagine that there is a share folder \\Server101\Share0\Folder1\, which has the following files:

File Channel Driver - 3.3.2.4 Scenarios and examples.png

Now, you want to ingest all the CSV files in the folder above, including all CSV files from its sub-folders, but don’t want to ingest any file whose name is suffixed with temp. The folder authorizes access to user user101, whose password is Password01!.

First, you need to create a custom channel using the File Channel Driver, with the following channel creation time property values:

File Channel Driver - 3.3.2.4 Scenarios and examples 2.png

After that, you have a custom channel named FileChannel101, which can be used by multiple data sources to ingest files from folder \\Server101\Share0\Folder1. For example, one data source can ingest all the CSV files, while another only ingests the xml files. If the credentials to the share folder were changed, you only need to update the user name and password for FileChannel101, and you won’t need to change all of the data sources that use FileChannel101.

Second, you can create a data source that uses FileChannel101, with the following channel ingestion time property values:

File Channel Driver - 3.3.2.4 Scenarios and examples 3.png

Finally, you will have the following files ingested:

File Channel Driver - 3.3.2.4 Scenarios and examples 4.png

The ingestion pipeline will have two streams that contain the two files above for further processing. You can also utilize stream filters/stream converters/data table filters to process them further.

Just note, the two files above are supposed to be in the same schema if you want to use a format specific stream converter, such as CSV converter. However, you won’t have this type of limitation if you choose to use the format irrelevant stream converter, such as Blob converter.

Back to top

MQ Channel Driver

Description

Ingests arbitrary data from the IBM message queue server into the ingestion pipeline for further processing.

Creation time properties

MQ Channel Driver - 3.3.3.2 Creation time properties.png

Ingestion time properties

MQ Channel Driver - 3.3.3.2 Creation time properties 2.png

Scenarios and examples

Scenario: You have some data that you want in the remote computer and want to ingest it into the solution accelerator.

Example steps:
  1. Send the data from the IBM queue.
  1. Create a MQ Channel:
    1. Set properties as IBM queue management and queue parameters
    2. Start channel driver process
Note: The MQ channel driver will receive all data from the remote queue one time, and it will delete any data from the queue while receiving.

Back to top

Stream Filter Plugins

Decompressor

Description

The Decompressor Stream Filter is used to decompress the zip format stream.

Ingestion Time Properties

Stream Filter Plugins - 4.1.2 Ingestion Time Properties.png


Scenarios and Examples

Scenario: You have a zip file and want to ingest it into the solution accelerator.

Configuration example:
If you want all files in this zip package to be ingested: Filter = .

If you only want all xml files to be ingested: Filter = *.xml

Back to top

XSL Transformer

Description

XML transformer class is used to apply XSL transform on XML streams.

Ingestion Time Properties

XSL Transformer - 4.2.2 Ingestion Time Properties.png

Scenarios and Examples

Scenario: You have a source with XML format, but the source XML files have multiple schemas. You want to convert them to a unified format, then ingest the data into the solution accelerator.

Assuming that you have three schemas A, B and C, and you want to convert them to schema X. So create 3 XSL files:
  • A-X.xslt
  • B-X.xslt
  • C-X.xslt

In order to make these files accessible for the data ingestion pipeline, you should put them into a network share folder: \\fileserver\schemas

Each of the source schemas include a special XML namespace, for example:
  • http://xxx.com/schema/a
  • http://xxx.com/schema/b
  • http://xxx.com/schema/c

Configuration Example:
Transform Rules:
  • http://xxx.com/schema/a,\\fileserver\schemas\A-X.xslt
  • http://xxx.com/schema/b,\\fileserver\schemas\A-X.xslt
  • http://xxx.com/schema/c,\\fileserver\schemas\A-X.xslt
Back to top

Checksum Calculator

Description

Computes the checksum of raw stream data. This is the identity of stream data.

Ingestion Time Properties

Checksum Calculator - 4.3.2 Ingestion Time Properties.png

Scenarios and Examples

Scenario: You need to ingest one file, but you want to bring in new records only if the file has been changed. If no one has changed this file, then there should not be new records in the solution accelerator. If someone has changed some information, then you should ingest the whole file.

Example: You need to ingest Test.xlsx. Test.xlsx contains three rows.

Test.xlsx: (Old file)
Checksum Calculator - Test.xlsx (Old file).png

Test.xlsx: (New file)
Checksum Calculator - Test.xlsx (New file).png

When you ingest data, and you use “Student” as the identity, there should be three records listed in the solution accelerator. Even if there are some changes, there should only be three records listed. The new record will overwrite the old version (assume that there is no new student record in the file).

After ingesting data two times, there should be three records listed in the solution accelerator as shown below:

Checksum Calculator 3.png

If you use the Checksum Calculator to process data, then use “Checksum” and “Student” as the identity, then you can obtain more than three records. Once there are changes in the file, there will be three more records listed in the solution accelerator (assume that there is no new student record in the file).

After ingesting data two times, there should be six records listed in the solution accelerator as shown below:

Checksum Calculator 4.png

Checksum Calculator should be used when you want to record snapshots of the changes to the source. When you want to use this value as the identity, you should follow the steps below:
  1. Use the “Checksum Calculator” stream filter.
  2. Set the “IncludeAdvancedSourceProperties” property value to “true”.
  3. (Optional) Set the “AdvancedSourcePropertiesColumnNamesPrefix” property value.
  4. Add “Checksum” to the identity.
Note: The properties in steps 2 and 3 (“IncludeAdvancedSourceProperties” and “AdvancedSourcePropertiesColumnNamesPrefix”) are the ingestion time properties of stream converters which were inherited from the “AdvancedPropertiesConverter”.

AdvancedPropertiesConverter is an abstract class. It is used to add advanced source properties (such as source metadata and checksum, etc.) into the output data table of stream converters. There are two extra properties if your stream converter is inherited from this base converter:

Extra ingestion time properties of stream converters which were inherited from the “AdvancedPropertiesConverter”

Checksum Calculator 5.png

Back to top


Stream Converter Plugins

Blob Converter

Description

The blob converter will convert blob data in a stream and extract some metadata from blob, such as the blob name, length and URI.

Ingestion time properties

BLOB Converter - 5.1.2 Ingestion time properties.png

Scenarios and Examples

Scenario: If you have some blob data (e.g. .txt, .bat, etc.) and you do not care which data format is being used.

Example: Similar to the scenario above, you need to create a blob converter and set the column name or just use the default name.

Results of key-value pairs as shown below:

BLOB Converter - 5.1.3 Scenarios and Examples.png

Back to top


JSONPath Converter

Description

Breaks down the JSON data source contents into a stream via JSONPath into key-value pairs. The JSONPath provided is used to select the entity, whose properties will be extracted as the basic key-value pairs.

Ingestion Time Properties

JSON Path Converter - 5.2.2 Ingestion Time Properties.png

Scenarios and Examples

There are JSON data source contents, similar to the following below:
JSONPath Converter - 5.2.3 Scenarios and Examples.png

If you want to select all the books only with its title and author, then the ingestion time properties should be:

JSONPath Converter - 5.2.3 Scenarios and Examples 2.png

Then the result should look similar to that of the listing below:

JSONPath Converter - 5.2.3 Scenarios and Examples 3.png

Back to top


CSV Converter

Description

Breaks down the CSV data source contents into a stream of key-value pairs. Typically the first row is considered as the header row for the column definition.

Ingestion Time Properties

CSV Converter - Ingestion Time Properties.png

CSV Converter - Ingestion Time Properties 2.png

Scenarios and Examples

There are CSV data source contents such as:

ID,CREATIONDATE,CREATED_BY, TIMEZONE
aaaaaa,2011-07-21 13:51:49.583,TestUserCreate, EST5EDT
bbbbbb,2011-07-21 13:51:49.583,NULL, NULL
cccccc,2011-07-21 13:51:49.583,TestUserCreate, EST5EDT_
The CSV contents should naturally represent a table, except you should take care of the null value. Below is a sample of configured properties:

CSV Converter - 5.3.3 Scenarios and Examples.png

Then the result should be displayed similar to the table below:

CSV Converter - 5.3.3 Scenarios and Examples 2.png

Back to top


Excel Converter

Description

The excel converter converts the excel format file into a stream of key-value pairs.

Ingestion Time Properties

Excel Converter - 5.4.2 Ingestion Time Properties.png

Scenarios and Examples

If you have an excel file and the data in the sheets are similar to the table below:

Excel Converter - 5.4.3 Scenarios and Examples.png

Then after excel converts the file, the default parameters will be shown similar to the table below:

Excel Converter - 5.4.3 Scenarios and Examples 2.png

Back to top


Image Converter

Description

Converts the image data contents into a stream of key-vale pairs.

Ingestion Time Properties

Image Converter - 5.5.2 Ingestion Time Properties.png

Scenarios and Examples

If you have a JPEG file format image and there are many details listed such as: dimensions, width, height, horizontal resolution, vertical resolution, and so on. Then, after the image converter converts the file you will receive stream data and detailed information in key-value pairs similar to that in the table below:

Image Converter - 5.5.3 Scenarios and Examples.png

Back to top


HTML XPath Converter

Description

Converts the HTML data source contents into a stream of key-value pairs by XPath.

Ingestion Time Properties

HTML XPath Converter - 5.6.2 Ingestion Time Properties.png

Scenarios and Examples

There are HTML data source contents similar to the following below:

Cannot resolve image macro, invalid image name or id.

If you want to select AQI forecast table contents, then the ingestion time properties should be:

HTML XPath Converter - 5.6.3 Scenarios and Examples 2.png

Then the result should be shown similar to the table below:

HTML XPath Converter - 5.6.3 Scenarios and Examples 3.png

Back to top


JSON LINQ Converter

Description

Converts JSON data source contents into key-value pairs according to the specified Language-Integrated Query (LINQ).

Ingestion Time Properties

JSON LINQ Converter - 5.7.2 Ingestion Time Properties.png

Scenarios and Examples

There are JSON data source contents similar to the following below:

JSON LINQ Converter - 5.7.3 Scenarios and Examples script.png
JSON LINQ Converter - 5.7.3 Scenarios and Examples script 2.png

If you want to select latitude/longitude and the address, then the ingestion time properties should be:

JSON LINQ Converter - 5.7.3 Scenarios and Examples.png

Then the result should be displayed similar to the table below:

JSON LINQ Converter - 5.7.3 Scenarios and Examples 2.png

If you can’t provide the LINQ, then the entity selected should be displayed similar to the table below:

JSON LINQ Converter - 5.7.3 Scenarios and Examples 3.png

Back to top


XML LINQ Converter

Description

Breaks down the XML data source contents into a stream of key-value pairs via LINQ.

Ingestion Time Properties

XML LINQ Converter - 5.8.2 Ingestion Time Properties.png

Scenarios and Examples

There are XML data source contents similar to the following below:

XML LINQ Converter - 5.8.3 Scenarios and Examples script.png
XML LINQ Converter - 5.8.3 Scenarios and Examples script 2.png
XML LINQ Converter - 5.8.3 Scenarios and Examples script 3.png

If you want to select latitude/longitude and the street name, then the ingestion time properties should be:

XML LINQ Converter - 5.8.3 Scenarios and Examples.png

Then the result should be similar to the table below:

XML LINQ Converter - 5.8.3 Scenarios and Examples 2.png

Back to top


XML XPath Converter

Description

Breaks down the XML data source contents into a stream of key-value pairs via XPath.

Ingestion Time Properties

XML XPath Converter - 5.9.2 Ingestion Time Properties.png
XML XPath Converter - 5.9.2 Ingestion Time Properties 2.png

Scenarios and Examples

There are XML data source contents similar to the following below:

XML XPath Converter - 5.9.3 Scenarios and Examples script.png

If you want to select all the books, then the ingestion time properties should be:

XML XPath Converter - 5.9.3 Scenarios and Examples 2.png

Then the result should be similar to the table below:

XML XPath Converter - 5.9.3 Scenarios and Examples 3.png

Back to top


Data Table Filter Plugins

Rename a Column

Description

This filter is used for renaming an existing column in a data table which was the output of the data table channel or the stream channel + stream converter. Please note that multiple data table filters could be applied to a data table consecutively; the rear filter will be applied to the front filter’s result, like a chained list.

Ingestion Time Properties

Cannot resolve image macro, invalid image name or id.

Scenarios and Examples

There is a column of ingested content named CreatedDate and you want to rename it to Timestamp, then the ingestion time properties should be:

Rename a Column - 6.1.3 Scenarios and Examples.png

Then after applying this filter, the original column named CreatedDate should be renamed as Timestamp.

Back to top


Computed Column Filter

Description

This data table filter is used for adding the Computed Column Expression (CCX) during ingestion.

Ingestion Time Properties

Computed Column Filter - 6.2.2 Ingestion Time Properties.png

Scenarios and Examples

There is an output data table similar to the table below:

Computed Column Filter - 6.2.3 Scenarios and Examples.png

If you want to compute the product name without brand and percentage strings, then the ingestion time properties should be:

Computed Column Filter - 6.2.3 Scenarios and Examples 2.png

Then the result should be similar to the table below:

Computed Column Filter - 6.2.3 Scenarios and Examples 3.png

Here is a list of built-in operators and functions for computed columns (Note: the function name is case-insensitive).
  1. Operators
Computed Column Filter - Operators.png
  1. Functions
Computed Column Filter - Functions.png
Computed Column Filter - Functions 2.png
Back to top


Column Filter

Description

This data-table filter is used to remove (filter-out) the existing columns in a data-table which was the output of data-table channel or stream channel + stream converter. Please note that multiple data-table filters could be applied on a data-table consecutively, the posterior filter will be applied on the anterior filter’s result, like a chained list.

Ingestion Time Properties

Column Filter - 6.3.2 Ingestion Time Properties.png

Scenarios and Examples

There is an output data table similar to the table below:

Column Filter - 6.3.3 Scenarios and Examples.png

If you know the latitude and longitude then the street name should be extracted via geotagging; moreover making street name values null values. If you decide to remove the useless column StreetName, then the ingestion time properties should be:

Column Filter - 6.3.3 Scenarios and Examples 2.png

Then the result should be similar to the table below (without the StreetName column):

Column Filter - 6.3.3 Scenarios and Examples 3.png

Back to top


Time Stamper

Description

This data table filter is used to mark the ingested data with a timestamp value by adding a new column to the data table.

Ingestion Time Properties

Time Stamper - 6.4.2 Ingestion Time Properties.png

Scenarios and Examples

There is an output data table, similar to the one below:

Time Stamper - 6.4.3 Scenarios and Examples.png

Since the data table does not contain a ingestion time field, we decided to add the timestamp column, as a result the ingestion time properties should look similar to the table below:

Time Stamper - 6.4.3 Scenarios and Examples 2.png

Then after applying this data table filter, the result should look similar to the table below (please note that the timestamp is the same value within the same data table):

Time Stamper - 6.4.3 Scenarios and Examples 3.png

Back to top


Keyword Extractor

Description

Keyword extractor is used to extract keywords from the text stored in a column of a data table.

Ingestion Time Properties

Keyword Extractor - 6.5.2 Ingestion Time Properties.png

Scenarios and Examples

In case you want to extract keywords from email content, you need to provide a CSV file containing the list of concerned keywords. One keyword occupies one line together with its weight. An example of the CSV file is as follows:

Keyword Extractor - 6.5.3 Scenarios and Examples.png

You also need to specify whether the keyword extraction is case sensitive or not. For an email data table, the source column name should be the column name for the email body. The keywords column name is used to specify the newly added column where the extracted keywords should be stored.

Back to top


Date/Time Adjuster

Description

The date/time adjuster is used to convert the date/time from the specified time zone to UTC. The ingestion pipeline will always try to convert the date/time values to UTC before it is stored to CAM, but if the source date/time doesn’t contain time zone information, then the converted time will not be correct. In this case, you need to explicitly apply this plugin to help the pipeline with the date/time conversion.

Ingestion Time Properties

Cannot resolve image macro, invalid image name or id.

Scenarios and Examples

Scenario: You have the data source of a CSV file, and there is one date/time column named “timestamp”. The timestamp values don’t include the time zone information, but you know it is Eastern Standard Time.

Configuration Example:
Columns to Adjust: timestamp, Eastern Standard Time

Back to top


Regex Validator

Description

This data table filter is used to validate the output data table’s value according to the expected pattern, in order to make sure that there is no corrupted data that will be inserted into the solution accelerator. If the data does not fit the user’s expected pattern, the whole ingestion pipeline will stop and report an error.

Ingestion Time Properties

Regex Validator - 6.7.2 Ingestion Time Properties.png

Scenarios and Examples

There is an output data table similar to the table below:

Regex Validator - 6.7.3 Scenarios and Examples.png

Since the raw data table contains a column named “Age” with a string data type, you may later want to change its type to integer, without expecting it to have an abnormal value. Therefore validation capability was added to the ingestion pipeline. The ingestion time properties should be defined as:

Regex Validator - 6.7.3 Scenarios and Examples 2.png

“Age,NotNull,\d{1,3}” means you expect the value in column ‘Age’ to be 1~3 digital characters and its value should not be null. According to the previous example above, the age of User1 “-1”, and the age of User2 “forty-one” are not the expected value. In effect, the pipeline will not process this kind of value.

Back to top


Strong Type Transformer

Description

This data table filter is used to convert the existing column’s value as a new data type.

Ingestion Time Properties

Strong Type Transformer - 6.8.2 Ingestion Time Properties.png

Scenarios and Examples

There is an output data table similar to the table below:

Strong Type Transformer - 6.8.3 Scenarios and Examples.png

You may want to convert the age’s data type from string to integer and convert the birthday’s data type from string to date/time. Therefore the ingestion time properties should be defined as:

Strong Type Transformer - 6.8.3 Scenarios and Examples 2.png

Then after applying this data table filter, the result should look similar to the table below:

Strong Type Transformer - 6.8.3 Scenarios and Examples 3.png

Back to top

Last edited Feb 17, 2015 at 8:51 AM by gheadd, version 10