Assistance with Open Source adoption

Syndicate content Talend Real-Time Open Source Data Integration Software
Updated: 19 hours 27 min ago

Introducing Talend API Services: Providing Best in Class Purpose-Built Applications

Thu, 11/15/2018 - 12:50

Have you heard?  Talend Fall ’18 is here and continues on Talend’s plan to meet the challenges of today’s data professionals around organizing, processing and sharing data at scale. Earlier Jean-Michel Franco wrote up about the Data Catalog portion of this exciting Fall 2018 launch. In this blog I’d like to focus on our new API features.

For many organizations, APIs are no longer just technological creations of engineers to connect components of a distributed system. Today, APIs are directly driving business revenues, enabling innovation, and are the source of connectivity with partners.

With our Fall ‘18 Launch, Talend Cloud will include a new API delivery platform, Talend Cloud API Services. Our delivery platform provides best in class purpose-built applications for API-first creation, testing, and documentation. Essentially, this platform enables organizations to be more agile in their API development. The platform provides productivity gains compared to hand coding through easy to use graphical design supporting both technical and less-technical personal.

Additional enhancements to existing tools for API implementation and operation ensures organizations have a comprehensive approach to building user-friendly data APIs. And finally, Talend Cloud’s full support for open standards such as OAS, Swagger, and RAML makes the Talend API delivery platform complementary to existing third-party API gateways and catalogs. Allowing for easy implementation with your existing gateway or catalog.

Talend Cloud API Designer

With Fall ’18, Talend Cloud provides a new purpose-built application for designing API contracts visually instead of having to go the traditional route of hand coding. Developers can start from scratch or import an existing OAS / RAML definition. The interface allows developers to define API data types, resources, operations, parameters, responses, media types, and errors.

Once a developer is finished defining their contract, the API designer will generate the OAS / RAML definition for you! This can be used later as part of the service(s) creation or imported into an existing API gateway / API catalog. I took a quick screenshot to show what the interface is going to look like.

Now I know how much everyone likes to write documentation (or maybe not). Thankfully with the API designer, the basic documentation is auto-generated for you. Users can then host it on Talend Cloud and easily share it with end consumers in a public or private mode. Talend Cloud API Designer also provides users with the ability to extend the generated documentation through an included rich text editor. Below is an example of the documentation generated by Talend Cloud API Designer.

Talend Cloud API Designer also provides Automatic API mocking that can act as a live preview for end consumers, decoupling support for consumer application development while the backend services are developed. Mocked API’s can return data specified during the contract design or automatically generate the data based on the defined data structure.

This mock is kept up to date throughout the development process and enabled using the interface below. This will be a huge benefit for the consumers of my API, they won’t have to wait for me to finish building the back end before they start writing their applications. It’s pretty easy to turn on inside API designer. A single click and users are off to the races.

Talend Cloud API Tester

Fall ’18 also includes a new application within Talend Cloud called Talend Cloud API Tester. Though this interface, QA / DevOps teams can easily call and inspect any HTTP based API. It works with complex JSON or XML responses enabling teams to validate the API’s behavior. Calls made are stored so I can easily look back into my history for what I’ve done before. An example of the interface is shown below.

My favorite feature of Talend Cloud API Tester is the ability to chain API calls together to create scenarios. These chained requests can utilize data returned from a previous call as parameters in the next call enabling teams to create real-world examples of how the API will be used. Thankfully this will keep my notepad++ tabs down to a minimum. An example of this scenario design is provided below.

Throughout the testing process, I can define assertions to help validate API responses. Responses can check payloads for completeness, how timely a response was or even if a field has a specific value. Here’s an example of an assertion I made recently.

The last benefit I’d like to highlight is that test cases created using API Tester can be exported for use within a continuous integration / continuous delivery pipeline ensuring further updates to the API’s maintain consistency.

Talend Studio for API implementation

We’ve made it simple to start working with the API’s built-in API designer. There’s a new metadata group called REST API definitions.  A couple clicks and I’ve downloaded the API and am ready to start building.

We can use this contract to bootstrap a service using the contract’s defined URIs, media types, parameters, etc. This approach expedites delivery of the backend service by reducing the complexity of defining the various functions.

There are also some updates to Talend Data Mapper, I can use the defined schema from the API definition as the return schema from TDM!  It is a lot easier converting data into the expected media type/structure. 

Talend Cloud for API Operation     

Yep, Talend cloud can now manage the services you’ve built in the studio, just like we manage data integration jobs. If this is your first time hearing about Talend Cloud, it provides a managed environment that enables developers to publish services developed in Talend studio into the Talend Management Console’s artifact repository or a third-party repository and manage the various environments the service needs to be deployed as part of the QA / DevOps process. An example of this management can be seen in the snippet

As you can see there is a mountain of functionality available in the new Talend Cloud API services. If you’d like to see more of this stay tuned we have a series of videos/enablement material to get you all up to speed.

I look forward to hearing about the API’s you plan to build and stay tuned to upcoming blogs from Talend if you’re looking for some inspiration. I’ll be following this blog up with a series of use cases we’ve seen and are hearing about!

The post Introducing Talend API Services: Providing Best in Class Purpose-Built Applications appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Simplifying Data Warehouse Optimization

Sat, 11/10/2018 - 13:04

When I hear the phrase “Data Warehouse Optimization”, shivers go down my spine.  It sounds like such a complicated undertaking.  After all, data warehouses are big, cumbersome and complex systems that can store terabytes and even petabytes of data that people depend on to make important decisions on the way their business is run.  The thought of any type of tinkering with such an integral part of a modern business would make even the most seasoned CIO’s break out into cold sweats.

However, the value of optimizing a data warehouse isn’t often disputed.  Minimizing costs and increasing performance are mainstays on the “to-do” lists of all Chief Information Officers.  But that is just the tip of the proverbial iceberg.  Maximize availability.  Increase data quality.  Limit data anomalies.  Eliminate depreciating overhead.  These are the challenges that become increasingly more difficult to achieve when stuck with unadaptable technologies and confined by rigid hardware specifications.

The Data Warehouse of the Past

Let me put it into some perspective.  Not long ago many of today’s technologies (i.e. Big Data Analytics, Spark engines for processing and Cloud Computing and storage) didn’t exist,  yet the reality of balancing the availability of quality data with the efforts required to cleanse and load the latest information proved a constant challenge.  Every month, IT was burdened with loading the latest data into the data warehouse for the business to analyze.  However, often the loading itself took days to complete and if the load failed, or worse, the data warehouse became corrupted, recovery efforts could take weeks.  By the time last month’s errors were corrected, this month’s data needed to be loaded. 

It was an endless cycle that produced little value.  Not only was the warehouse out-of-date with its information, but it was also tied up in data loading and data recovery processes, thus making it unavailable to the end user.  With the added challenges of today’s continuously increasing data volumes, a wide array of data sources and more demands from the business for real-time data in their analysis, the data warehouse needs to be a nimble and flexible repository of information, rather than a workhorse of processing power.

Today’s Data Warehouse Needs

In this day and age, CIO’s can rest easy knowing that optimizing a data warehouse doesn’t have to be so daunting.  With the availability of Big Data Analytics, lightning-quick processing with Apache Spark, and the seemingly limitless and instantaneous scalability of the cloud, there are surely many approaches one can take to address the optimization conundrum.  But I have found the most effective approach to simplifying data warehouse optimization (and providing the biggest return on investment) is to remove unnecessary processing (i.e. data processing, transformation and cleansing) from the warehouse itself.  By removing the inherent burden of ETL processes, the warehouse has nearly instantaneously increased availability and performance.  This is commonly referred to as “Offloading ETL”. 

This isn’t to say that the data doesn’t need to be processed, transformed and cleansed.  On the contrary, data quality is of utmost importance.  But relying on the same systems that serve up the data to be responsible for processing and transforming the data is robbing the warehouse of its sole purpose; providing accurate, reliable and up-to-date analysis to end-users in a timely fashion, with minimal downtime.  By utilizing Spark and it’s in-memory processing architecture, you can shift the burden of ETL onto other in-house servers designed for such workloads. Or better yet, shift the processing to the cloud’s scalable infrastructure and not only optimize your data warehouse, but ultimately cut IT spend by eliminating the capital overhead of unnecessary hardware.

Talend Big Data & Machine Learning Sandbox

In the new Talend Big Data and Machine Learning Sandbox, one such example illustrates how effective ETL Offloading can be.  Utilizing Talend Big Data and Spark, IT can work with business analysts to perform Pre-load analytics – analyzing the data in its raw form, before it is loaded into a warehouse – in a fraction of the time of standard ETL.  Not only does this give business users insight into the quality of the data before it is loaded into the warehouse, it also allows IT a sort of security checkpoint to prevent poor data from corrupting the warehouse and causing additional outages and challenges.

Optimizing a data warehouse can surely produce a fair share of challenges.  But sometimes the best solution doesn’t have to be the most complicated.  That is why Talend offers industry leading data quality, native Spark connectivity and subscription-based affordability, giving you a jump-start on your optimization strategy.  Further, Data Integration tools need to be as nimble as the systems they are integrating.  Therefore, leveraging Talend’s future-proof architecture means you will never be out of style with the latest technology trends; giving you piece of mind that today’s solutions won’t become tomorrow’s problems.

Download the Talend Big Data and Machine Learning Sandbox today and dive into our cookbook

The post Simplifying Data Warehouse Optimization appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

4 Ways You Should be Using the Talend tMap Component

Fri, 11/09/2018 - 12:36

At Talend one of my “Shadow IT” jobs is reporting on the component usage. If you have ever used Talend Studio (either the open source Talend Open Studio or the commercial version you most likely know the component tMap

It is the most used component by a long shot. Why? Simply put, it’s because it is extremely versatile and useful.  When I was first asked to write this article I thought, “Why not, this will be easy.”  But, as I started actually going about picking out use cases I quickly realized that there are so many more than just 4 to choose from! I actually challenge everyone to respond and tell me below what you think the top features are of the tMap component. I would love to hear from you.

I will start out by listing the most obvious but most needed and then get into some more advanced uses.  (I will try and sneak a couple in together so I can have more than 4, hopefully, our editor doesn’t catch it.)

Editors note: I did catch it, but that’s just fine Mark.

#1 Mapping, of course. 

The tMap’s most basic use is to map inputs to outputs.  This can be as simple as Source fields to Target fields of the data integration job.  It can also be from some other input components like aggregators, matching or data quality components. 

With the tMap you can also limit the fields mapped from left to right, basically filtering unneeded columns.  You can create new columns coming out of the tMap, say for example adding sequence keys, or concatenating multiple input columns into a new column, for example address fields into one column to make a single mailing data field.   This leads into the next big use case I want to cover….

#2 Expression Builder. 

Within the tMap on any column or variable you can open the expression builder wizard where you get access to hundreds of Talend functions, and if you can’t find a Talend function that meets your needs then you can fall back on a native Java function (don’t fear, if you don’t know Java just Google it).

If you happen to know some Java, you can easily build custom Java routines which will then be available within the expression builder.   Also, expression builder allows you do complex math on multiple fields if needed. You can extract parts of dates, do data conversions, even case statements to build in “what if” logic.  The same conditional statements can be used to determine if a row should pass through the tMap at all, acting as a filter.  As you can see with the tMap Expression Builder you get great transformation powers.

#3 Lookups. 

The tMap is where you can do what many Data Integration specialists refer to as “lookups” on data. For those unfamiliar, this is basically joining data from one source to another source.  The tMap has a lot of functionality on lookups, like inner join or outer join, reject if join is not found, cache the lookup data and much more.  Lookups are critical to data transformation process as you often need to pull in reference data or get expanded views of records. 

With Talend the lookup source can be anything that can be sourced into a Talend job.  This has almost endless possibilities.  To illustrate, let’s imagine a multi-cloud scenario quickly. Let’s say you have customer data in AWS on S3 and some other critical data on Azure Blob storage. in a single Talend job using tMap and lookups you can easily join the two sources together and write your data anywhere you need, like say Google BigQuery just to be crazy!

#4 Route Multiple Outputs. 

The tMap can only have one input (not counting Lookups) but you can have multiple outputs with any number of columns as outputs on each stream. 

This becomes a fast and powerful way to route errors down a different flow or to just duplicate the data flows to different streams one flow can go to an aggregation component while another output could go direct to your target outputs and a third have a conditional statement looking for errors.  All this becomes extremely useful as your data flows become complex with multiple outputs, error processing and conditional outputs.


There you have it, a quick and (hopefully) helpful introduction to our most popular component in Talend. If you want some more tMap knowledge, let me know in the comments below and I’ll spin up my next article around some more advanced mapping functionalities. Happy connecting!


The post 4 Ways You Should be Using the Talend tMap Component appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

It’s Official! Talend to Welcome Stitch to the Family!

Wed, 11/07/2018 - 16:12
The Acquisition

Today, Talend announced it has signed a definitive agreement to acquire Stitch, a self-service cloud data integration company that provides an exceptionally quick, easy, and intuitive experience in modern cloud environments. With Stitch, Talend will be able to provide both SMB and enterprise customers with a highly efficient way to move data from cloud sources to cloud data warehouses and buy in a frictionless manner.

As companies standardize on using the cloud for analytics, Talend Cloud has become a compelling solution for customers of all sizes to meet their complete data and application needs. We have developed Talend Cloud into the ideal choice by bringing together a broad range of functionality in one platform – from native big data and embedded data quality to enterprise-level CI/CD capabilities and data governance. Now, with the addition of Stitch, Talend customers have an even more comprehensive tool to complete their cloud-first and digital transformation mission.

Why Stitch?

To improve customer experiences, companies need to collect and analyze vast amounts of data across cloud and on-premises systems. When we look deeper at the challenge, it is often the data analysts and business analysts to data engineers who want to want to collect data from their cloud apps, such as Salesforce, Marketo and Google Analytics, and put it into a cloud data warehouse such as Amazon Redshift and Snowflake. And when we look across a company, each department, from marketing to finance to HR to manufacturing has this need to collect more data and derive more insight.

Unfortunately, many companies struggle to efficiently collect data which means they cannot reach their data-driven potential. There may be an IT bottleneck, it may take time to get the systems up and running, or the line of business may be doing it with inefficient hand-coding. This is the problem that Stitch addresses, it provides self-service tools, in the cloud, that automate loading data into cloud data warehouses. And the process of getting started and loading data just takes minutes. So now anyone in a company can easily and quickly load data into a cloud data lake or data warehouse.

At Talend, we saw this emergence of a new data integration category and how it would immediately benefit our customers. Talend provides tools that address all types of integration complexity, where you build data pipelines to collect, govern, transform and share data.  Stitch provides a complementary solution that will enable many more people in an organization to collect more data, which can then be governed, transformed and shared with Talend, which will mean faster and better insight for all.

Stitch is Available for Free Trial Now

Over the next few months, we will build out more features and services that are part of our focus on addressing any integration use cases by connecting any data and application with Talend Cloud in a seamless and frictionless manner.

Stitch is available for purchase or evaluation today. Sign up for a free trial at For complex integration use cases, try Talend Cloud for 30 days for free at

The post It’s Official! Talend to Welcome Stitch to the Family! appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Continuous Integration Best Practices – Part 2

Tue, 10/30/2018 - 11:36

This is the second part of my blog series on CI/CD best practices. For those of you who are new to this blog, please refer to Part 1 of the same series and for those who want to see the first 10 best practices. Also, I want to give a big thank you for all the support and feedback! In my last blog, we saw the first ten best practices when working with Continuous integration. In this blog, I want to touch on some more best practices. So, with that, let’s jump right in!

Best Practice 11 – Run Your Fastest Tests First

The CI/CD system serves as a channel for all changes entering your system and hence discovering failures as early as possible is important to minimize the resources devoted to problematic builds. It is recommended you to prioritize and run your fastest tests first and save the complex, long-running tests for later. Once you have validated the build with smaller, quick-running tests and have ensured that the build looks good initially, you could test your complex and long-running test cases.

Following this best practice will keep your CI/CD process healthy by enabling you to understand the performance impact of individual tests as well as complete most of your tests early. It also increases the likelihood of fast failures. Which means that problematic changes can be reverted or fixed before blocking another members’ work.

Best Practice 12 – Minimize Branching

One of the version control best practices is integrating changes into the parent branch/shared repository early and often. This helps avoid costly integration problems down the line when multiple developers attempt to merge large, divergent, and conflicting changes into the main branch of the repository in preparation for release.

To take advantage of the benefits that CI provides, it is best to limit the number and scope of branches in your repository. It is suggested that developers merge changes from their local branches to remote branch at least once a day. The branches which are not being tracked by your CI/CD systems contain untested code and it should be discarded as it is a threat to the stable code.

Best Practice 13 – Run Tests Locally

Another best practice (and it is related to my earlier point about discovering failures early) is that teams should be encouraged to run as many tests locally prior to committing to the shared repository. This ensures that the piece of code you are working on is good and it will also make it easy to recognize unforeseen problems after integrating the code to the master code.

Best Practice 14 – Write extensive test cases

To be successful with CI/CD one needs a test suit to test every code which gets integrated. As a best practice it is recommended to develop a culture of writing tests for each code change which is integrate into the master branch. The test cases should include unit, integration, regression, smoke tests or any kind of test which covers the end to end project.

Best Practice 15 – Rollback

This is probably the most important part of any implementation. As a next best practice, always think of an easy way to roll back the changes if something goes wrong. Normally I have seen organizations doing a rollback via redeploying the previous release or redoing the build from the stable code.

Best Practice 16 – Deploy the Same Way Every Time

I have seen organizations having multiple ways of CI/CD pipeline. It varies from using multiple tools to multiple mechanisms. Though not a very prominent one but as a best practice, I would say deploy the code same way every time to avoid unnecessary issues like configuration and maintenance across environments. If you are using same deploy method every time, then the same set of steps are triggered in all environments making it easier to understand and maintain. 

Best Practice 17 – Automate the build and deployment

Although the manual build and deployment can work, I would recommend you automate the pipeline. Automation to start with eliminates all the manual errors but further to that ensures that the development team only works on the latest source code from the repository and compiles it every time to build the final product. With lots of tools available like Jenkins, Bamboo, BitBucket etc it is very easy to automate the create the workspace, compile the code, convert it to binaries and publish it to Nexus.

Best Practice 18 – Measure Your Pipeline Success

It is a good practice to track the CI/CD pipeline success rate. You could measure these both ways, before & after you start automation and compare the results. Although the metrics for evaluating CI/CD success rate depends on organizations, some of the points to be considered are:

  • Number of jobs deployed monthly/weekly/daily to Dev/TEST/Pre-PROD/PROD
  • Number of Successful/Failure Builds
  • Time is taken for each build
  • Rollback time
Best Practice 19 – Backup

Your CI/CD pipeline has lot of process/steps involved and as the next best practice I recommend you take periodic backup of your pipeline. If using Jenkins, this can be accomplished via Backup Manager as shown below.

Best Practice 20 – Clean up CI/CD environment

Lots of builds can make your CI/CD system too clumsy and over a period of time and it might impact the overall performance. As the next best practice, I recommend you to clean the CI/CD server periodically. This cleaning could include CI pipeline, temporary workspace, nexus repositories etc.


And with this, I come to an end of the two-part series blog. I hope these best practices are helpful and you would embed these while working with CI/CD.  

The post Continuous Integration Best Practices – Part 2 appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Getting Started with Talend Open Studio: Preparing Your Environment

Mon, 10/29/2018 - 09:56

With millions of downloads, Talend Open Studio is the leading open source data integration solution. Talend makes it easy to design and deploy integration jobs quickly with graphical tools, native code generation, and hundreds of pre-built components and connectors. Sometimes people like to get some more resources they can use to get started with Talend Open Studio, so we have put some of them together in this blog and in a webinar on-demand titled “Introduction to Talend Open Studio for Data Integration.”

In this first blog, we will be discussing how to prepare your environment to ensure you have a smooth download and install process. Additionally, we will go through a quick tour of the tool so that you can what it has to offer as you begin to with Talend Open Studio.

Remember, if you want to follow along with us in this tutorial you can download Talend Open Studio here.

Getting Ready to Install Talend Open Studio

Before installing, there are a couple prerequisites to address: First, make sure you have enough memory and disk space to complete the install (see the documentation for more information). Second, make sure that you have downloaded the most recent version of the Java 8 JDK Oracle (as found within the Java SE Runtime Environment 8 Downloads page on their website) as Java 9 is not yet supported.

If you’re strictly a Mac user, you need to have Java 8 version 151 installed. To find out which version you have currently on your machine, open your command prompt and search for java-version. Here, you can see that Java 8 is already installed.

You can also discover the version a couple other ways, including from within the “About Java” section from within Java’s control panel.

So now we need to set up a JAVA_HOME environment variable. To do this, right click on This PC and open the properties. Then select Advanced System Settings and under Environment Variables, click New to create a new variable. Name your new variable JAVA_HOME, enter the path to your Java environment and click ok. Under System Variables, add the same variable information and then click ok.

Installing Talend Open Studio

Alright, now the environment is ready for an Open Studio install. If you haven’t already, go to Talend’s official download page and locate the “download free tool” link for Open Studio for Data Integration. Once the installation folder is downloaded, open it and save the extracted files to a new folder. To follow best practice, create a new folder within your C drive. From here you can officially begin the install.

Once it’s finished, open the newly created TOS folder on your C drive and drill in to locate the application link you need to launch Open Studio. If you have the correct java version installed, have enough available memory and RAM and completed all the prerequisites you should easily be able to launch Talend Open Studio on your machine.

When launching the program for the first time, you are presented with the User License Agreement, which you need to read and accept the terms. Now you’re given the chance to create a new project, import a demo project or import an existing project. To explore pre-made projects, feel free to import a demo project or to start with your own project right away, create a new project.  

Upon opening Studio for the first time, you will need to install some required additional Talend Packages—specifically the package containing all required third-party libraries. These are external modules that are required for the software to run correctly. Before clicking Finish, you must accept all licenses for each library you install.

Getting to Know Your New Tool

Next, let’s walk through some of Open Studio’s features. The initial start-up presents a demo project for you to play around with in order to get familiar with Studio. To build out this project, we need to start within the heart of Open Studio: the Repository. The Repository—found on the left side of the screen— is where data is gathered related to the technical items used to design jobs. This is where you can manage metadata (database connections, database tables, as well as columns) and jobs once you begin creating them.  

You can drag and drop this metadata from the Repository into the “Design Workspace”. This is where you lay out and design Jobs (or “flows”). You can view the job graphically within the Designer tab or use the Code tab to see the code generated and identify any possible errors.

To the right, you can access the Component Pallet, which contains hundreds of different technical components used to build your jobs, grouped together in families. A component is a preconfigured connector used to perform a specific data integration operation. It can minimize the amounts of hand-coding required to work on data from multiple heterogeneous sources.

As you build out your job, you can reference the job details within the “Component Tabs” below the Design Workspace. Each following tab displays the properties of the selected element within the design workspace. It’s here that component properties and parameters can be added or edited. Finally, next to the Components Tab, you can find the Run tab, which lets you execute your job. We hope this has been useful, and in our next blog, we will build a simple job moving data into a cloud data warehouse. Want to see more tutorials? Comment below to share what videos would be most helpful to you when starting your journey with Talend Open Studio for Data Integration.

The post Getting Started with Talend Open Studio: Preparing Your Environment appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Continuous Integration Best Practices – Part 1

Tue, 10/23/2018 - 18:00

In this blog, I want to highlight some of the best practices that I’ve come across as I’ve implemented continuous integration with Talend. For those of you who are new to CI/CD please go through the part 1 and part 2 of my previous blogs on ‘Continuous Integration and workflows with Talend and Jenkins’. This blog would also introduce you to some basic guidance on how to implement and maintain a CI/CD system. These recommendations will help in improving the effectiveness of CI/CD.

Without any further delay – let’s jump right in!

Best Practice 1 – Use Talend Recommended Architectures

For every product Talend offers, there is also a recommended architecture. For details on the recommended architecture for our different products please refer to our Git repo: . This repository has details on every product Talend offers. It’s truly a great resource, however, for this blog I am focusing only on the CI/CD aspect with Data Integration platform.

In the architecture for Talend Data Integration, it’s recommended to have a separate Software Development Life Cycle (SDLC) environment. This SDLC environment should typically consist of an artifact repository like Nexus, a version control, server tool like Git, Talend Commandline, Maven, and the Talend CI Builder Plugin. The SDLC server would typically look as follows (the optional components are marked in yellow):

As the picture shows, the recommendation suggests having separate servers for Nexus, Version control system and CI server.  One Nexus is shared across all environments. All environments access the binaries stored in Nexus for job execution.  The version control system is needed only in the development environment. This server is also not accessed from other environments.

Best Practice 2 –  Version Control System

Continuous integration requires working with a version control system and hence it is important to have a healthy, clean system for the CI work fast. Talend recommends a few best practices while working with Git. Please go through the links below on best practices while working with Git.

Best Practice 3 –  Continuous Integration Environment

Talend recommends 4 environments with a continuous integration set up (see below). The best practice is illustrated with Git, Jenkins and Nexus as an example. The GIT in the SDLC server is only accessible from the Development Environment.  Other environments cannot access the GIT.

All the coding activities take place in the development environment and are pushed to the version control system Git. The CI/CD process takes the code from Git, converts it into binaries and publishes the code to Nexus Snapshot repository. All non-prod environments have access to Nexus Release and Snapshot repository, however, the production environment has access only to the Release repository.

It is important to note that One nexus is shared among all environments.

Best Practice 4 –  Maintaining Nexus

The artifact repository Nexus plays a very vital role in continuous integration. Nexus is used by Talend not only for providing software updates/patches but is also used as an artifact repository to hold the job binaries.

These binaries are then called via the Talend Administrator Center or Talend Cloud to execute the jobs. If your Talend Studio is connected with the Talend Administration Centre, all the Nexus artifact repository settings are automatically retrieved from the Talend Administration Center. You can choose to use the retrieved settings to publish your Jobs or configure your own artifact repositories.

If your Studio is working on a local connection, all the fields are pre-filled with the locally-stored default settings.

Now, with respect to CI/CD, as a best practice, it is recommended to upload the CI builder plugin and all the external jar files used by the project to the third-party repository in Nexus. The third-party folder will look as given below once Talend ci builder is uploaded.

Best Practice 5 –  Release and Snapshot Artifacts

To implement a CI/CD pipeline it is important to understand the difference between release and snapshot Artifacts/Repositories. Release artifacts are stable and everlasting in order to guarantee that builds which depend on them are repeatable over time. By default, the Release repositories do not allow redeployment. Snapshots capture a work in progress and are used during development. By default, snapshot repositories allow redeployment.

As a best practice, it is recommended that development teams learn the difference between the repositories and to implement the pipeline in such a way that the development artifacts refer to the Snapshot repository and rest of the environments refer only to the Release Repository.

Best Practice 6 – Isolate and Secure the CI/CD Environment

The Talend Recommended Architecture talks about “isolating” the CI/CD environment. The CI/CD system typically has some of the most critical information, complete access to your source/target, has all credentials and hence it is critically important to secure and safeguard the CI/CD system.

As a best practice, the CI/CD system should be deployed to internal and protected networks. Setting up VPNs or other network access control technology is recommended to ensure that only authenticated operators are able to access the system. Depending on the complexity of your network topology, your CI/CD system may need to access several different networks to deploy code to different environments. The important point to keep in mind is that your CI/CD systems are highly valuable targets, and in many cases, they have a broad degree of access to your other vital systems. Protecting all external access to the servers and tightly controlling the types of internal access allowed will help reduce the risk of your CI/CD system being compromised.

The image given below shows Talend’s recommended architecture where the CI server, SDLC and the Version control system (git) are isolated and secured.

Best Practice 7 –  Maintain Production like environment

It is a best practice to have one environment (QA environment) as close to that of the production environment. This includes infrastructure, operating system, databases, patches, network topology, firewalls and configuration.

Having an environment close to the production environment and validating code changes in this environment helps in ensuring that the integration accurately reflects how the change would behave in production. It would identify mismatches and last-minute surprises which can be effectively eliminated, and code can be released safely to production at any time. Note that the more differences between your production environment and the QA environment, the less chances are that your tests will measure how the code will perform when released. Some differences between QA and production are expected but keeping them manageable and making sure they are well-understood is essential.

Best Practice 8 – Keep Your Continuous Integration process fast

CI/CD pipelines help in driving the changes through automated testing cycles to different environments like test, stage and finally to production. Making the CI/CD pipeline fast is very important for the team to be productive.

If a team has to wait long for the build to finish, then it defeats the whole purpose. Since all the changes must follow this CI/CD process, keeping the pipeline fast and dependable is very important or else it would diminish the purpose. There are multiple aspects to keep the CD/CD process fast. To being with the CI/CD infrastructure must be good enough not only to suit the current need but also to scale out if needed. Also, it is important to visit the test cases in a timely manner to ensure that no test cases are adding any overhead to the system

Best Practice 9 –  CI/CD should be the only way to deploy to Production

Promoting code through the CI/CD pipelines validates each change so that it fits the organization’s standards and doesn’t introduce any bugs to the existing code. Any failures in a CI/CD pipeline are immediately visible and it should stop the further code integration/deployment. This is a gatekeeping mechanism that safeguards the important environments from untrusted code.

To utilize these advantages, it is important to ensure that every change in the production environment goes through only the CI/CD pipeline. The CI/CD pipeline should be the only mechanism by which code enters the production environment. This CI/CD pipeline could be automated or via a manual trigger.

Generally, a team follows all this until a production fix or a show stopper error occurs. When the error is critical, there is a pressure to resolve them quickly. It is recommended that even in such scenarios the fix should be introduced to other environments via the CI/CD pipeline. Putting your fix through the pipeline (or just using the CI/CD system to rollback) will also prevent the next deployment from erasing an ad hoc hotfix that was applied directly to production. The pipeline protects the validity of your deployments regardless of whether this was a regular, planned release, or a fast fix to resolve an ongoing issue. This use of the CI/CD system is yet another reason to work to keep your pipeline fast.

As an example, the image below shows the option to publish the job via studio. It is not recommended to use this approach. The recommended approach is to use the pipeline. The example here shows Jenkins.

Best Practice 10 – Build binaries only once

A primary goal of a CI/CD pipeline is to build confidence in your changes and minimize the chance of unexpected impact. If your pipeline requires building, packaging, or bundling step, that step should be executed only once, and the resulting output should be reused throughout the entire pipeline. This practice helps prevent problems that arise while the code is being compiled or packaged. Also, if you have test cases written and you are building the same code for the different environments this ensures you are replicating the testing effort and time in each environment.

To avoid this problem, CI systems should include a build process as the first step in the pipeline that creates and packages the software in a clean environment/temporary workspace. The resulting artifact should be versioned and uploaded to an artifact storage system to be pulled down by subsequent stages of the pipeline, ensuring that the build does not change as it progresses through the system.


In this blog, we’ve started with CI/CD best practices. Hopefully, it has been useful. My next blog in this series will focus on some more best practice in CI/CD world so keep watching and happy reading. Until next time!

The post Continuous Integration Best Practices – Part 1 appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Introduction to the Agile Data Lake

Thu, 10/18/2018 - 09:41

Let’s be honest, the ‘Data Lake’ is one of the latest buzz-words everyone is talking about. Like many buzzwords, few really know how to explain what it is, what it is supposed to do, and/or how to design and build one.  As pervasive as they appear to be, you may be surprised to learn that Gartner predicts that only 15% of Data Lake projects make it into production.  Forrester predicts that 33% of Enterprises will take their attempted Data Lake projects off life-support.  That’s scary!  Data Lakes are about getting value from enterprise data, and given these statistics, its nirvana appears to be quite elusive.  I’d like to change that and share my thoughts and hopefully providing some guidance for your consideration on how to design, build, and use a successful Data Lake; An Agile Data Lake.  Why agile? Because to be successful, it needs to be.

Ok, to start, let’s look at the Wikipedia definition for what a Data Lake is:

“A data lake is a storage repository that holds a vast amount of raw data in its native format, incorporated as structured, semi-structured, and unstructured data.”

Not bad.  Yet considering we need to get value from a Data Lake this Wikipedia definition is just not quite sufficient. Why? The reason is simple; you can put any data in the lake, but you need to get data out and that means some structure must exist. The real idea of a data lake is to have a single place to store of all enterprise data, ranging from raw data (which implies an exact copy of source system data) through transformed data, which is then used for various business needs including reporting, visualization, analytics, machine learning, data science, and much more.

I like a ‘revised’ definition from Tamara Dull, Principal Evangelist, Amazon Web Services, who says:

“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data, where the data structure and requirements are not defined until the data is needed.”

Much better!  Even Agile-like. The reason why this is a better definition is that it incorporates both the prerequisite for data structures and that the stored data would then be used in some fashion, at some point in the future.  From that we can safely expect value and that exploiting an Agile approach is absolutely required.  The data lake therefore includes structured data from relational databases (basic rows and columns), semi-structured data (like CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and even binary data (typically images, pictures, audio, & video) thus creating a centralized data store accommodating all forms of data.  The data lake then provides an information platform upon which to serve many business use cases when needed.  It is not enough that data goes into the lake, data must come out too.

And, we want to avoid the ‘Data Swamp’ which is essentially a deteriorated and/or unmanaged data lake that is inaccessible to and/or unusable by its intended users, providing little to no business value to the enterprise.  Are we on the same page so far?  Good.

Data Lakes – In the Beginning

Before we dive deeper, I’d like to share how we got here.  Data Lakes represent an evolution resulting from an explosion of data (volume-variety-velocity), the growth of legacy business applications plus numerous new data sources (IoT, WSL, RSS, Social Media, etc.), and the movement from on-premise to cloud (and hybrid). 

Additionally, business processes have become more complex, new technologies have recently been introduced enhancing business insights and data mining, plus exploring data in new ways like machine learning and data science.  Over the last 30 years we have seen the pioneering of a Data Warehouse (from the likes of Bill Inmon and Ralph Kimball) for business reporting all the way through now to the Agile Data Lake (adapted by Dan Linstedt, yours truly, and a few other brave souls) supporting a wide variety of business use cases, as we’ll see.

To me, Data Lakes represent the result of this dramatic data evolution and should ultimately provide a common foundational, information warehouse architecture that can be deployed on-premise, in the cloud, or a hybrid ecosystem. 

Successful Data Lakes are pattern based, metadata driven (for automation) business data repositories, accounting for data governance and data security (ala GDPR & PII) requirements.  Data in the lake should present coalesced data and aggregations of the “record of truth” ensuring information accuracy (which is quite hard to accomplish unless you know how), and timeliness.  Following an Agile/Scrum methodology, using metadata management, applying data profiling, master data management, and such, I think a Data Lake must represent a ‘Total Quality Management” information system.  Still with me?  Great!

What is a Data Lake for?

Essentially a data lake is used for any data-centric, business use case, downstream of System (Enterprise) Applications, that help drive corporate insights and operational efficiency.  Here are some common examples:

  • Business Information, Systems Integration, & Real Time data processing
  • Reports, Dashboards, & Analytics
  • Business Insights, Data Mining, Machine Learning, & Data Science
  • Customer, Vendor, Product, & Service 360

How do you build an Agile Data Lake? As you can see there are many ways to benefit from a successful Data Lake.  My question to you is, are you considering any of these?  My bet is that you are.  My next questions are; Do you know how to get there?  Are you able to build a Data Lake the RIGHT way and avoid the swamp?  I’ll presume you are reading this to learn more.  Let’s continue…

There are three key principles I believe you must first understand and must accept:

  • ⇒ A PROPERLY implemented Ecosystem, Data Models, Architecture, & Methodologies
  • ⇒ The incorporation of EXCEPTIONAL Data Processing, Governance, & Security
  • ⇒ The deliberate use of Job Design PATTERNS and BEST PRACTICES

A successful Data Lake must also be agile which then becomes a data processing and information delivery mechanism designed to augment business decisions and enhance domain knowledge.  A Data Lake, therefore, must have a managed lifecycle.  This life cycle incorporates 3 key phases:

    • Extracting raw source data, accumulating (typically written to flat files) in a landing zone or staging area for downstream processing & archival purposes
    • Loading & Transformation of this data into usable formats for further processing and/or use by business users
    • Data Aggregations (KPI’s, Data-points, or Metrics)
    • Analytics (actuals, predictive, & trends)
    • Machine Learning, Data Mining, & Data Science
    • Operational System Feedback & Outbound Data Feeds
    • Visualizations, & Reporting

The challenge is how to avoid the swamp.  I believe you must use the right architecture, data models, and methodology.  You really must shift away your ‘Legacy’ thinking; adapt and adopt a ‘Modern’ approach.  This is essential.  Don’t fall into the trap of thinking you know what a data lake is and how it works until you consider these critical points.

Ok then, let’s examine then these three phases a bit more.  Data Ingestion is about capturing data, managing it, and getting it ready for subsequent processing.  I think of this like a box crate of data, dumped onto the sandy beach of the lake; a landing zone called a ’Persistent Staging Area’.  Persistent because once it arrives, it stays there; for all practical purposes, once processed downstream, becomes an effective archive (and you don’t have to copy it somewhere else).  This PSA will contain data, text, voice, video, or whatever it is, which accumulates.

You may notice that I am not talking about technology yet.  I will but, let me at least point out that depending upon the technology used for the PSA, you might need to offload this data at some point.  My thinking is that an efficient file storage solution is best suited for this 1st phase.

Data Adaptation is a comprehensive, intelligent coalescence of the data which must adapt organically to survive and provide value.  These adaptations take several forms (we’ll cover them below) yet essentially reside 1st in a raw, lowest level of granulation, data model which then can be further processed, or as I call it, business purposed, for a variety of domain use cases.  The data processing requirements here can be quite involved so I like to automate as much of this as possible.  Automation requires metadata.  Metadata management presumes governance.  And don’t forget security.  We’ll talk about these more shortly.

Data Consumption is not just about business users, it is about business information, the knowledge it supports, and hopefully, the wisdom derived from it.  You may be familiar with the DIKW Pyramid; Data > Information > Knowledge > Wisdom.  I like to insert ‘Understanding’ after ‘Knowledge’ as it leads wisdom.

Data should be treated as a corporate asset and invested as such.  Data then becomes a commodity and allows us to focus on the information, knowledge, understanding, and wisdom derived from it.  Therefore, it is about the data and getting value from it. 

Data Storage Systems: Data Stores

Ok, as we continue to formulate the basis for building a Data Lake, let’s look at how we store data.  There are many ways we do this.  Here’s a review:

    • ROW: traditional Relational Database System (RDBMS) (ie: Oracle, MS SQL Server, MySQL, etc)
  • COLUMNAR: relatively unknown; feels like a RDBMS but optimized for Columns  (ie: Snowflake, Presto, Redshift, Infobright, & others)
  • NoSQL – “Not Only SQL”:
    • Non-Relational, eventual consistency storage & retrieval systems (ie: Cassandra, MongoDB, & more)
    • Distributed data processing framework supporting high data Volume, Velocity, & Variety (ie: Cloudera, Hortonworks, MapR, EMR, & HD Insights)
  • GRAPH – “Triple-Store”:
    • Subject-Predicate-Object, index-free ‘triples’; based upon Graph theory (ie: AlegroGraph, & Neo4J)
    • Everything else under the sun (ie: ASCII/EBCDIC, CSV, XML, JSON, HTML, AVRO, Parquet)

There are many ways to store our data, and many considerations to make, so let’s simplify our life a bit and call them all ‘Data Stores’, regardless of them being Source, Intermediate, Archive, or Target data storage.  Simply pick the technology for each type of data store as needed.

Data Governance

What is Data Governance?  Clearly another industry enigma.  Again, Wikipedia to the rescue:

Data Governance is a defined process that an organization follows to ensure that high quality data exists throughout the complete lifecycle.”

Does that help?  Not really?  I didn’t think so.  The real idea of data governance is to affirm data as a corporate asset, invest & manage it formally throughout the enterprise, so it can be trusted for accountable & reliable decision making.  To achieve these lofty goals, it is essential to appreciate Source through Target lineage.  Management of this lineage is a key part of Data Governance and should be well defined and deliberately managed.  Separated into 3 areas, lineage is defined as:

  • ⇒ Schematic Lineage maintains the metadata about the data structures
  • ⇒ Semantic Lineage maintains the metadata about the meaning of data
  • ⇒ Data Lineage maintains the metadata of where data originates & its auditability as it changes allowing ‘current’ & ‘back-in-time’ queries

It is fair to say that a proper, in-depth discussion on data governance, metadata management, data preparation, data stewardship, and data glossaries are essential, but if I did that here we’d never get to the good stuff.  Perhaps another blog?  Ok, but later….

Data Security

Data Lakes must also ensure that personal data (GDPR & PII) is secure and can be removed (disabled) or updated upon request.  Securing data requires access policies, policy enforcement, encryption, and record maintenance techniques.  In fact, all corporate data assets need these features which should be a cornerstone of any Data Lake implementation.  There are three states of data to consider here:

  • ⇒ DATA AT REST in some data store, ready for use throughout the data lake life cycle
  • ⇒ DATA IN FLIGHT as it moves through the data lake life cycle itself
  • ⇒ DATA IN USE perhaps the most critical, at the user-facing elements of the data lake life cycle

Talend works with several technologies offering data security features.  In particular, ‘Protegrity Cloud Security’ provides these capabilities using Talend specific components and integrated features well suited for building an Agile Data Lake.  Please feel free to read “BUILDING A SECURE CLOUD DATA LAKE WITH AWS, PROTEGRITY AND TALEND” for more details.  We are working together with some of our largest customers using this valuable solution.

Agile Data Lake Technology Options

Processing data into and out of a data lake requires technology, (hardware/software) to implement.  Grappling with the many, many options can be daunting.  It is so easy to take these for granted, picking anything that sounds good.  It’s only after or until better understanding the data involved, systems chosen, and development efforts does one find that the wrong choice has been made.  Isn’t this the definition of a data swamp?  How do we avoid this?

A successful Data Lake must incorporate a pliable architecture, data model, and methodology.  We’ve been talking about that already.  But picking the right ‘technology’ is more about the business data requirements and expected use cases.  I have some good news here.  You can de-couple the data lake designs from the technology stack.  To illustrate this, here is a ‘Marketecture’ diagram of depicting the many different technology options crossing through the agile data lake architecture.

As shown above, there are many popular technologies available, and you can choose different capabilities to suit each phase in the data lake life cycle.  For those who follow my blogs you already know I do have a soft spot for Data Vault.  Since I’ve detailed this approach before, let me simply point you to some interesting links:

You should know that Dan Linstedt created this approach and has developed considerable content you may find interesting.  I recommend these:

I hope you find all this content helpful.  Yes, it is a lot to ingest, digest, and understand (Hey, that sounds like a data lake), but take the time.  If you are serious about building and using a successful data lake you need this information.

The Agile Data Lake Life Cycle

Ok, whew – a lot of information already and we are not quite done.  I have mentioned that a data lake has a life cycle.  A successful Agile Data Lake Life Cycle incorporates the 3 phases I’ve described above, data stores, data governance, data security, metadata management (lineage), and of course: ‘Business Rules’.  Notice that what we want to do is de-couple ‘Hard’ business rules (that transform physical data in some way) from ‘Soft’ business rules (that adjust result sets based upon adapted queries).  This separation contributes to the life cycle being agile. 

Think about it, if you push physical data transformations upstream then when the inevitable changes occur, the impact is less to everything downstream.  On the flip side, when the dynamics of business impose new criteria, changing a SQL ‘where’ clauses downstream will have less impact on data models it pulls from.  The Business Vault provides this insulation from the Raw Data Vault as it can be reconstituted when radical changes occur.

Additionally, a Data Lake is not a Data Warehouse but in fact, encapsulates one as a use case.  This is a critical takeaway from this blog.  Taking this further, we are not creating ‘Data Marts’ anymore, we want ‘Information Marts’.   Did you review the DIKW Pyramid link I mentioned above?  Data should, of course, be considered and treated as a business asset.  Yet simultaneously, data is now a commodity leading us to information, knowledge, and hopefully: wisdom.

This diagram walks through the Agile Data Lake Life Cycle from Source to Target data stores.  Study this.  Understand this.  You may be glad you did.  Ok, let me finish to say that to be agile a data lake must:

    • Data Models should be additive without impact to existing model when new sources appear
    • Especially for Big Data technologies where Updates & Deletes are expensive
    • Hybrid infrastructures can offer extensive capabilities
    • Metadata, in many aspects, can drive the automation of data movement
    • A key aspect of Data Lineage

And finally, consider that STAR Schemas are, and always were, designed to be ‘Information Delivery Mechanisms’, a misunderstanding some in the industry has fostered for many years.  For many years we have all built Data Warehouses using STAR schemas to deliver reporting and business insights.  These efforts all too often resulted in raw data storage of the data warehouse in rigid data structures, requiring heavy data cleansing, and frankly high impact when upstream systems are changed or added. 

The cost in resources and budget has been a cornerstone to many delays, failed projects, and inaccurate results.  This is a legacy mentality and I believe it is time to shift our thinking to a more modern approach.  The Agile Data Lake is that new way of thinking.  STAR schemas do not go away, but their role has shifted downstream, where they belong and always intended for.


This is just the beginning, yet I hope this blog post gets you thinking about all the possibilities now.

As a versatile technology and coupled with a sound architecture, pliable data models, strong methodologies, thoughtful job design patterns, and best practices, Talend can deliver cost-effective, process efficient and highly productive data management solutions.  Incorporate all of this as I’ve shown above and not only will you create an Agile Data Lake, but you will avoid the SWAMP!

Till next time…

The post Introduction to the Agile Data Lake appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Introducing Talend Data Catalog: Creating a Single Source of Trust

Tue, 10/16/2018 - 06:07

Talend Fall ’18 is here! We’ve released a big update to the Talend platform this time around including support for APIs, as well as new big data and serverless capabilities. You will see blogs from my colleagues to highlight those major new product and features introductions. On my side, I’ve been working passionately to introduce Talend Data Catalog, which I believe has the potential to change the way data is consumed and managed within our enterprise. Our goal with this launch is to help our customers deliver insight-ready data at scale so they can make better and faster decisions, all while spending less time looking for data or making decisions with incomplete data.

You Can’t Be Data Driven without a Data Catalog

Before we jump into features, let’s look at why you need a data catalog. Remember the early days of the Internet? Suddenly, it became so easy and cheap to create content and publish it to anyone that everybody actually did it. Soon enough, that created a data sprawl, and the challenge was not any more to create content but to find it. After two decades we know that winners in the web economy are those that created a single point of access to content in their category: Google, YouTube, Baidu, Amazon, Wikipedia.

Now, we are faced with a similar data sprawl in out data-driven economy. IDC research has found that today data professionals are spending 81% of their time searching, preparing, and protecting data with little time left to turn it into business outcomes. It has become crucial that organizations establish this same single source of access to their data to be in the winner’s circle.

Although technology can help to fix the issue, and I’ll come back on it later in the article, among these, enterprises need to set up a discipline to organize their data at scale, and this discipline is called data governance. But traditional data governance must be re-invented with this data sprawl:  according to Gartner, “through 2022, only 20% of organizations investing in information will succeed in scaling governance for digital business.” Given the sheer number of companies that are awash in data, that percentage is just too small.

Modern data governance is not only about minimizing data risks but also about maximizing data usage, which is why traditional authoritative data governance approaches are not enough. There is a need for a more agile, bottom-up approach. That strategy starts with the raw data, links it to its business context so that it becomes meaningful, takes control of its data quality and security, and fully organizes it for massive consumption.

Empowering this new discipline is the promise of data catalogs, leveraging modern technologies like smart semantics and machine learning to organize data at scale and turns data governance into a team sport by engaging anyone for social curation. 

With the newly introduced Talend Data Catalog, companies can organize their data at scale to make data accessible like never before and address challenges head-on. By empowering organizations to create a single source of trusted data, it’s a win for both the business with the ability to find the right data, as well as the CIO and CDO who can now control data better to improve data governance. Now let’s dive into some details on what the Talend Data Catalog is.

Intelligently discover your data

Data catalogs are a perfect fit for companies that modernized their data infrastructures with data lakes or cloud-based data warehouses, where thousands of raw data items can reside and can be accessed at scale. The catalog acts as the fish finder for that data lake, leveraging crawlers across different file systems, traditional, Hadoop, or cloud, and across typical file format. Then automatically extracts metadata and profiling information, for referencing, change management classification and accessibility.

Not only can it bring all of those metadata together in a single place, but it can also automatically draw the links between datasets and connect them to a business glossary. In a nutshell, this allows businesses to:

  • Automate the data inventory
  • Leverage smart semantics for auto-profiling, relationships discovery and classification
  • Document and drive usage now that the data has been enriched and becomes more meaningful

The goal of the data catalog is to unlock data from the application where they reside.

Orchestrate data curation

Once the metadata has been automatically harvested in a single place, data governance can be orchestrated in a much more efficient way. Talend Data Catalog allows businesses to define the critical data elements in its business glossary and assign data owners for those critical data elements. The data catalog then relates those critical data elements to the data points that refer it across the information system.

Now data is in control and data owners can make sure that their data is properly documented and protected. Comments, warnings, or validation can be crowdsourced from any business user for collaborative, bottom-up governance. Finally, the data catalog draws end-to-end data lineage and manages version control. It guarantees accuracy and provides a complete view of the information chain, which are both critical for data governance and data compliance.

Easy search-based access to trusted data

Talend Data Catalog makes it possible for businesses to locate, understand, use, and share their trusted data faster by searching and verifying data’s validity before sharing with peers. Its collaborative user experience enables anyone to contribute metadata or business glossary information.

Data governance is most often associated with control. A discipline that allows businesses to centrally collect data, process, and consume under certain rules and policies. The beauty of Talend Data Catalog is that not only does it control data but liberates it for consumption as well. This allows data professionals to find, understand, and share data ten times faster. Now data engineers, scientists, analysts, or even developers can spend their time on extracting value from those data sets rather than searching for them or recreating them – removing the risk of your data lake turning into a data swamp.

A recently published IDC report, “Data Intelligence Software for Data Governance,” advocates the benefits of modern data governance and positions the Data Catalog as the cornerstone of what they define as Data Intelligence Software. In the report, IDC calls it a “technology that supports enablement through governance is called data intelligence software and is delivered in metadata management, data lineage, data catalog, business glossary, data profiling, mastering, and stewardship software.”

For more information, check out the full capabilities of the Talend Data Catalog here.

The post Introducing Talend Data Catalog: Creating a Single Source of Trust appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Astrazeneca: Building the Data Platform of the Future

Mon, 10/15/2018 - 06:28

AstraZeneca plc is a global, science-led biopharmaceutical company that is the world’s seventh-largest pharmaceutical business, with operations in more than 100 countries. The company focuses on the discovery, development, and commercialization of prescription medicines, which are used by millions of patients worldwide.

It’s one of the few companies to span the entire lifecycle of a medicine; from research and development to manufacturing and supply, and the global commercialization of primary care and specialty care medicines.

Beginning in 2013, AstraZeneca was faced with industry disruption and competitive pressure. For business sustainability and growth, AstraZeneca needed to change their product and portfolio strategy.

As the starting point, they needed to transform their core IT and finance functions. Data is at the heart of these transformations. They had a number of IT-related challenges, including inflexible and non-scalable infrastructure; data silos and diverse data models and file sizes within the organization; a lack of enterprise data governance; and infrastructure over-provisioning for required performance.

The company had grown substantially, including through mergers and acquisitions, and had data dispersed throughout the organization in a variety of systems. Additionally, financial data volume fluctuates depending on where they are in the financial cycle and peaks at month-end, quarters or financial year end are common.

In addition to causing inconsistencies in reporting, silos of information prevented the company and its Science and Enabling Unit division from finding insights hiding in unconnected data sources.

For transforming their IT and finance function and accelerating financial reporting, AstraZeneca needed to put in place a modern architecture that could enable a single source of the truth. As part of its solution, AstraZeneca began a move to the cloud, specifically Amazon Web Services (AWS), where it could build a data lake to hold data from a range of source systems, The potential benefits of a cloud-based solution included increased innovation and accelerated time to market, lower costs, and simplified systems.

But the AWS data lake was only part of the answer. The company needed a way to capture the data, and that’s where solutions such as Talend Big Data and Talend Data Quality come into play. AstraZeneca selected Talend for its AWS connectivity, flexibility, and licensing model, and valued its ability to scale rapidly without incurring extra costs.

The Talend technologies are responsible for lifting, shifting, transforming, and delivering data into the cloud, extracting from multiple sources, and then pushing that data into Amazon S3.

Their IT and Business Transformation initiative was successful and has paved the way for Business Transformation initiatives across five business units and they are leveraging this modern data platform for driving new business opportunities.

Attend this session at Talend Connect UK 2018 to learn more about how AstraZeneca transformed its IT and finance functions by developing an event-driven, scalable data-platform to support massive month-end peak activity, leading to financial reporting in half the time and half the cost.

The post Astrazeneca: Building the Data Platform of the Future appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Elsevier: How to Gain Data Agility in the Cloud

Fri, 10/12/2018 - 12:50

Presenting at Talend Connect London 2018 is Reed Elsevier (part of RELX Group), a $7 billion data and analytics company with 31,000 employees, serving scientists, lawyers, doctors, and insurance companies among its many clients. The company helps scientists make discoveries, lawyers win cases, doctors save lives, insurance companies offer customers lower prices, and save taxpayers money by preventing fraud.

Standardizing business practices for successful growth

As the business grew over the years, different parts of the organization began buying and deploying integration tools, which created management challenges for central IT. It was a “shadow IT” situation, where individual business departments were implementing their own integrations with their own different tools.

With lack of standardization, integration was handled separately between different units, which made it more difficult for different components of the enterprise to share data. Central IT wanted to bring order to the process and deploy a system that was effective at meeting the company’s needs as well as scalable to keep pace with growth.

Moving to the cloud

One of the essential requirements was that any new solution be a cloud-based offering. Elsevier a few years ago became a “cloud first” company, mandating that any new IT services be delivered via the cloud and nothing be hosted on-premises. It also adopted agile methodologies and a continuous deployment approach, to become as nimble as possible when bringing new products or releases to market.

Elsevier selected Talend as a solution and began using it in 2016. Among the vital selection factors were platform flexibility, alignment with the company’s existing infrastructure, and its ability to generate Java code as output and support microservices and containers.

In their Talend Connect session, Delivering Agile integration platforms, Elsevier will discuss how it got up and running rapidly with Talend despite having a diverse development environment. And, how it’s using Talend, along with Amazon Web Services, to build a data platform for transforming raw data into insight at scale across the business. You’ll learn how Elsevier created a dynamic platform using containers, serverless data processing and continuous integration/continuous development to reach a level of agility and speed.

Agility is among the most significant benefits of their approach using Talend. Elsevier spins up servers as needed and enables groups to independently develop integrations on a common platform without central IT being a bottleneck. Since building the platform, internal demand has far surpassed the company’s expectations—as it is delivering cost savings and insight at a whole new level.

Attend this session to learn more about how you can transform your integration environment.


The post Elsevier: How to Gain Data Agility in the Cloud appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Bitwise: Cloud Data Warehouse Modernization – Inside Look at Talend Connect London

Thu, 10/11/2018 - 09:28

With expectations of business users evolving beyond limitations of traditional BI capabilities, we see a general thrust of organizations developing a cloud-based data strategy that enterprise users can leverage to build better analytics and make better business decisions. While this vision for cloud strategy is fairly straightforward, the journey of identifying and implementing the right technology stack that caters to BI and analytical requirements across the enterprise can create some stumbling blocks if not properly planned from the get-go.

As a data management consulting and services company, Bitwise helps organizations with their modernization efforts. Based on what we see at our customers when helping to consolidate legacy data integration tools to newer platforms, modernize data warehouse architectures or implement enterprise cloud strategy, Talend fits as a key component of a modern data approach that addresses top business drivers and delivers ROI for these efforts.

For this reason, we are very excited to co-present “Modernizing Your Data Warehouse” with Talend at Talend Connect UK in London. If you are exploring cloud as an option to overcome limitations you may be experiencing with your current data warehouse architecture, this session is for you. Our Talend partner is well equipped to address the many challenges with the conventional data warehouse (that will sound all too familiar to you) and walk through the options, innovations, and benefits for moving to cloud in a way that makes sense to the traditional user.

For our part, we aim to show “how” people are moving to cloud by sharing our experiences for building the right business case, identifying the right approach, and putting together the right strategy. Maybe you are considering whether Lift & Shift is the right approach, or if you should do it in one ‘big bang’ or iterate – we’ll share some practical know-how for making these determinations within your organization.

With so many tools and technologies available, how do you know which are the right fit for you? This is where vendor neutral assessment and business case development, as well as ROI assessment associated with the identified business case, becomes essential for getting the migration roadmap and architecture right from the start. We will highlight a real-world example for going from CIO vision to operationalizing cloud assets, with some lessons learned along the way.

Ultimately, our session is geared to help demonstrate that by modernizing your data warehouse in cloud, you not only get the benefits of speed, agility, flexibility, scalability, cost efficiency, etc. – but it puts you in a framework with inherent Data Governance, Self-Service and Machine Learning capabilities (no need to develop these from scratch on your own), which are the cutting-edge areas where you can show ROI for your business stakeholders…and become a data hero.

Bitwise, a Talend Gold Partner for consulting and services, is proud to be a Gold Sponsor of Talend Connect UK. Be sure to visit our booth to get a demo on how we convert ANY ETL (such as Ab Initio, OWB, Informatica, SSIS, DataStage, and PL/SQL) to Talend with maximum possible automation.

About the author:

Ankur Gupta

EVP Worldwide Sales & Marketing, Bitwise

The post Bitwise: Cloud Data Warehouse Modernization – Inside Look at Talend Connect London appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

New Talend APAC Cloud Data Infrastructure Now Available!

Wed, 10/10/2018 - 16:37

As businesses in the primary economic hubs in Asia such as Tokyo, Banglore, Sydney and Singapore are growing at a historical level, they are moving to the cloud like never before. For those companies, their first and foremost priority is to fully leverage the value of their data while meeting strict local data residency, governance, and privacy requirements. Therefore, keeping data in a cloud data center that’s on the other side of the globe simply won’t be enough.

That’s why Talend is launching a new cloud data infrastructure in Japan, in addition to its US data center and the EU data center across Frankfurt and Dublin, in a secure and highly scalable Amazon Web Services (AWS) environment, to allow APAC customers to get cloud data integration and data management services closer to where the data is stored. This is most beneficial to local enterprise businesses and foreign companies who have plans to open up offices in the local region.

There are several benefits Talend Cloud customers can expect from this launch.

Accelerating Enterprise Cloud Adoption

Whether your cloud-first strategy is about modernizing legacy IT infrastructure, leveraging a hybrid cloud architecture, or building a multiple cloud platform, Talend new APAC cloud data infrastructure will allow your transition to the cloud become more seamless. With a Talend Cloud instance independently available in APAC, companies can build a cloud data lake or a cloud data warehouse for faster, more scalable and more agile analytics with more ease.

More Robust Performance

For customers who are using Talend Cloud services in the Asia Pacific, this new cloud data infrastructure will lead to faster extract, transform and load time despite of the data volume. Additionally, it will boost performance for customers using AWS services such as Amazon EMR, Amazon Redshift, Amazon Aurora and Amazon DynamoDB.

Increased Data Security with Proximity

Maintaining data within the local region means the data do not have to make a long trip outside of the immediate area, which can reduce the risk of data security breaches at rest, in transit,  and in use and ease companies’ worries about security measures.

Reduced Compliance and Operational Risks

Because the new data infrastructure offers an instance of Talend Cloud that is deployed independently from the US or the EU, companies can maintain higher standards regarding their data stewards, data privacy, and operational best practices.

For Japan customers, they are likely to be better compliant with Japan’s stringent data privacy and security standards. In the case of industry and government regulation adjustments, Talend Cloud customers would still be able to maintain flexibility and agility to keep up with the changes.

If you are a Talend customer, you will soon have the opportunity to migrate your site to the new APAC data center. Log in or contact your account manager for more information.

Not a current Talend Cloud customers? Test drive Talend Cloud for 30 days free of charge or learn how Talend Cloud can help you connect your data from 900+ data sources to deliver big data cloud analytics instantly.




The post New Talend APAC Cloud Data Infrastructure Now Available! appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

5 Questions to Ask When Building a Cloud Data Lake Strategy

Tue, 10/09/2018 - 15:41

In my last blog post, I shared some thoughts on the common pitfalls when building a data lake. As the movement to the cloud gets more and more common, I’d like to further discuss some of the best practices when building a cloud data lake strategy. When going beyond the scope of integration tools or platforms for your cloud data lake, here are 5 questions to ask, that can be used as a checklist:

1. Does your Cloud Data Lake strategy include a Cloud Data Warehouse?

As many differences as there are between the two, people often times compare the two types of technology approaches. Data warehouses being the centralization of structured data, and Data Lakes often times being the holy grail of all types of data. (You can read more about the two approaches here.)

Not to confuse the two, as these technology approaches should actually be brought together. You will need a data lake to accommodate all types of data that your business deal with today, make it structured, semi-structured or unstructured, on-premise or in the cloud, or those newer types of data such as IoT data. The data lake often time has a landing zone and staging zone for raw data – data at this stage are not yet consumable, but you may want to keep them for future discovery or data science projects. On the other hand, a cloud data warehouse will be in the picture after data is cleansed, mapped and transformed, so that it is more consumable for business analysts to access and make the use of data for reporting or other analytical use. Data at this stage is often time highly processed to adjust to the data warehouse.

If your approach currently only works with a cloud data warehouse, then often time you are losing raw and some formats of data already, it is not so helpful for any prescriptive or advanced analytics projects, or machine learning and AI initiatives as some meanings within the data is already lost. Vice versa, if you don’t have a data warehouse alongside with your data lake strategy, you will end up with a data swamp where all data is kept with no structure, and not consumable by analysts.

From the integration perspective, make sure your integration tool work with both data lake and data warehouse technologies, which will lead us to the next question. 

Download Data Lakes: Purposes, Practices, Patterns, and Platforms now.
Download Now

2. Does your integration tool have ETL & ELT?

As much as you may know about ETL in your current on-premises data warehouse, moving it to the cloud is a different story, not to mention in a cloud data lake context. Where and how data is processed really depends on what you need for your business.

Similar to what we described in the first question, sometimes you need to keep more of the raw nature of the data, and other times you need more processing. This would require your integration tool to cope with both ETL and ELT capabilities, where the data transformation can be handled either before the data is loaded to your final target, e.g. a cloud data warehouse, or after data is landed there. ELT is more often leveraged when the speed of data ingestion is key to your project, or when you want to keep more intel about your data. Typically, cloud data lakes have a raw data store, then a refined (or transformed) data store. Data scientists, for example, prefer to access the raw data, whereas business users would like the normalized data for business intelligence.

Another use of ELT refers to the massive parallel processing capabilities coming with big data technologies such as Spark and Flink. If your use case requires such a strong processing power, then ELT is a better choice where the processing has more scalability.

3. Can your cloud data lake handle both simple ETL tasks and complex big data ones?

This may look like an obvious question but when you ask about this question, put yourself in the users’ shoes and really think through if your choice of tool can meet both requirements.

Not all of your data lake usage will be complex ones that require advanced processing and transformation, many of them can be simple activities such as ingesting new data into the data lake. Often times, the tasks go beyond the data engineering or IT team as well. So ideally the tool of your choice should be able to handle simple tasks fast and easy, but also can scale to the complexity to meet the requirements of advanced use cases.  Building a data lake strategy that can cope with both can help you make your data lake more consumable and practical for various types of users for different purposes.

4. How about batch and streaming needs?

You may think your current architecture and technology stack is good enough, and your business is not really in the Netflix business where streaming is a necessity. Get it? Well think again.

Streaming data has become a part of our everyday lives whether you realize it or not. The “Me” culture has put everything at the moment of now. If your business is on social media, you are in streaming. If IoT and sensor is the next growth market for your business, you are in streaming. If you have a website for customer interaction, you are in streaming. In IDC’s 2018 Data Integration and Integrity End User Survey, 93% of the respondents indicate the plan to use streaming technology by 2020. Real-time and streaming analytics have become a must for modern businesses today to create that competitive edge. So, this naturally raises the questions: can your data lake handle both your batch and streaming needs? Do you have the technology and people to work with streaming, which is fundamentally different from typical batch needs?

Streaming data is particularly challenging to handle because it is continuously generated by an array of sources and devices as well as being delivered in a wide variety of formats.

One prime example of just how complicated streaming data can be comes from the Internet of Things (IoT). With IoT devices, the data is always on; there is no start and no stop, it just keeps flowing. A typical batch processing approach doesn’t work with IoT data because of the continuous stream and the variety of data types it encompasses.

So make sure your data lake strategy and data integration layer can be agile enough to work with both use cases.

You can find more tips on streaming data, here.

5. Can your data lake strategy help cultivate a collaborative culture?

Last but not least, collaboration.

It may take one person to implement the technology, but it will take a whole village to implement it successfully. The only way to make sure your data lake is a success is to have people use it, improving the workflow one way or another.

In a smaller scope, the workflow in your data lake should be able to be reused and leveraged among data engineers. Less recreation will be needed, and operationalization can be much faster. In a bigger scope, the data lake approach can help improve the collaboration between IT and business teams. For example, your business teams are the experts of their data and they know the meaning and the context of data better than anyone else. Data quality can be much improved if the business team can work on the data for business rule transformations, while IT still governs that activity. Defining such a line with governance in place is a delicate work and no easy task. But you may think through your data lake approach, whether it’s governed but open at the same time to encourage not only final consume /usage of the data, but the improvement of data quality in the process, and be recycled to be available to a broader organization.

To summarize, there we go the 5 questions I would recommend asking when thinking about building a cloud data lake strategy. By no means are these the only questions you should think, but hopefully it initiates some thinking outside of your typical technical checklist. 

The post 5 Questions to Ask When Building a Cloud Data Lake Strategy appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

How to Implement a Job Metadata Framework using Talend

Tue, 10/09/2018 - 14:39

Today, data integration projects are not just about moving data from point A to point B, there is much more to it. The ever-growing volumes of data, the speed at which the data changes presents a lot of challenges in managing the end-to-end data integration process. In order to address these challenges, it is paramount to track the data-journey from source to target in terms of start and end timestamps, job status, business area, subject area, and the individuals responsible for a specific job. In other words, metadata is becoming a major player in data workflows. In this blog, I want to review how to implement a job metadata framework using Talend. Let’s get started!

Metadata Framework: What You Need to Know

The centralized management and monitoring of this job metadata are crucial to data management teams. An efficient and flexible job metadata framework architecture requires a number of things. Namely, a metadata-driven model and job metadata.

A typical Talend Data Integration job performs the following tasks for extracting the data from source systems and loading them into target systems.

  1. Extracting data from source systems
  2. Transforming the data involves:
    • Cleansing source attributes
    • Applying business rules
    • Data Quality
    • Filtering, Sorting, and Deduplication
    • Data aggregations
  3. Loading the data into a target systems
  4. Monitoring, Logging, and Tracking the ETL process

Figure 1: ETL process

Over the past few years, the job metadata has evolved to become an essential component of any data integration project. What happens when you don’t have job metadata in your data integration jobs? It may lead to incorrect ETL statistics and logging as well as difficult to handle errors occurred during the data integration process. A successful Talend Data Integration project depends on how well the job metadata framework is integrated with the enterprise data management process.

Job Metadata Framework

The job metadata framework is a meta-data driven model that integrates well with Talend product suite. Talend provides a set of components for capturing the statistics and logging information during the flight of the data integration process.

Remember, the primary objective of this blog is to provide an efficient way to manage the ETL operations with a customizable framework. The framework includes the Job management data model and the Talend components that support the framework.

Figure 2: Job metadata model

Primarily, the Job Metadata Framework model includes:

  • Job Master
  • Job Run Details
  • Job Run Log
  • File Tracker
  • Database High Water Mark Tracker for extracting the incremental changes

This framework is designed to allow the production support to monitor the job cycle refresh and look for the issues relating to job failure and any discrepancies while processing the data loads. Let’s go through each of piece of the framework step-by-step.

Talend Jobs

Talend_Jobs is a Job Master Repository table that manages the inventory of all the jobs in the Data Integration domain.




Unique Identifier to identify a specific job


Job Name is the name of the job as per the naming convention (<type>_<subject area>_<table_name>_<target_destination>


Business Unit / Department or Application Area


Job author Information


Additional Information related to the job


The last updated date

Talend Job Run Details

Talend_Job_Run_Details registers every run of a job and its sub jobs with statistics and run details such as job status, start time, end time, and total duration of main job and sub jobs.




Unique Identifier to identify a specific job run


Business Unit / Department or Application Area


Job author Information


Unique Identifier to identify a specific job


Job Name is the name of the job as per the naming convention (<type>_<subject area>_<table_name>_<target_destination>


Unique Identifier to identify a specific sub job


Sub Job Name is the name of the sub job as per the naming convention (<type>_<subject area>_<table_name>_<target_destination>


Main Job Start Timestamp


Main Job End Timestamp


Main Job total job execution duration


Sub Job Start Timestamp


Sub Job End Timestamp


Sub Job total job execution duration


Sub Job Status (Pending / Complete)


Main Job Status (Pending / Complete)


The last updated date

Talend Job Run Log

Talend_Job_Run_Log logs all the errors occurred during particular job execution. Talend_Job_Run_Log extracts the details from the Talend components specially designed for catching logs (tLogCatcher) and statistics (tStatCacher).

Figure 3: Error logging and Statistics

The tLogCatcher component in Talend operates as a log function triggered during the process by one of these components: Java exceptions, tDie or tWarn. In order catch exceptions coming from the job, tCatch function needs to be enabled on all the components.

The tStatCatcher component gathers the job processing metadata at the job level.




Unique Identifier to identify a specific job run


Unique Identifier to identify a specific job


The time when the message is caught


The Process ID of the Job


The Parent process ID


The root process ID


The system process ID


The name of the project


The name of the Job


The ID of the Job file stored in the repository


The version of the current Job


The Name of the current context


The priority sequence


The name of the component if any


Begin or End


The error message generated by the component when an error occurs. This is an After variable. This variable functions only if the Die on error checkbox is cleared.




Time for the execution of a Job or a component with the tStatCaher Statistics check box selected


Record counts


Job references


Log thresholds for managing error handling workflows

Talend High Water Marker Tracker

Talend_HWM_Tracker helps in processing delta and incremental changes of a particular table. The High Water Tracker is helpful when the “Change Data Capture” is not enabled and the changes are extracted based on specific conditions such as “last_updated_date_time” or ‘revision_date_time.” In some cases, the High Water Mark relates to the highest sequence number when the records are processed based on the sequence number.




Unique Identifier to identify a specific source table


Unique Identifier to identify a specific job


The name of the Job


The name of the source table


The source table environment


The source table database type


High Water Field (Datetime)


High Water Field (Number)


High Water SQL Statement

Talend File Tracker

Talend_File_Tracker registers all the transactions related to file processing. The transaction details include source file location, destination location, file name pattern, file name suffix, and the name of the last file processed.




Unique Identifier to identify a specific source file


Unique Identifier to identify a specific job


The name of the Job


The file server environment


The file name pattern


The source file location


The target file location


The file suffix


The name of the last file processed for a specific file


The override flag to re-process the file with the same name


The last updated date


This brings to the end of the implementing Job metadata framework using Talend. The following are key takeaways from this blog:

  1. The need and the importance of Job metadata framework
  2. The data model to support the framework
  3. The customizable data model to support different types of job patterns.

As always – let me know if you have any questions below and happy connecting!

The post How to Implement a Job Metadata Framework using Talend appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Cloudera 2.0: Cloudera and Hortonworks Merge to form a Big Data Super Power

Thu, 10/04/2018 - 18:39

We’ve all dreamed of going to bed one day and waking up the next with superpowers – stronger, faster and even perhaps with the ability to fly.  Yesterday that is exactly what happened to Tom Reilly and the people at Cloudera and Hortonworks.   On October 2nd they went to bed as two rivals vying for leadership in the big data space. In the morning they woke up as Cloudera 2.0, a $700M firm, with a clear leadership position.  “From the edge to AI”…to infinity and beyond!  The acquisition has made them bigger, stronger and faster. 

Like any good movie, however, the drama is just getting started, innovation in the cloud, big data, IoT and machine learning is simply exploding, transforming our world over and over, faster and faster.  And of course, there are strong villains, new emerging threats and a host of frenemies to navigate.

What’s in Store for Cloudera and Hortonworks 2.0

Overall, this is great news for customers, the Hadoop ecosystem and the future of the market.  Both company’s customers can now sleep at night knowing that the pace of innovation from Cloudera 2.0 will continue and accelerate.  Combining the Cloudera and Hortonworks technologies means that instead of having to pick one stack or the other, now customers can have the best of both worlds. The statement from their press release “From the Edge to AI” really sums up how complementary some of the investments that Hortonworks made in IoT complement Cloudera’s investments in machine learning.  From an ecosystem and innovation perspective, we’ll see fewer competing Apache projects with much stronger investments.  This can only mean better experiences for any user of big data open source technologies.

At the same time, it’s no secret how much our world is changing with innovation coming in so many shapes and sizes.  This is the world that Cloudera 2.0 must navigate.  Today, winning in the cloud is quite simply a matter of survival.  That is just as true for the new Cloudera as it is for every single company in every industry in the world.  The difference is that Cloudera will be competing with a wide range of cloud-native companies both big and small that are experiencing explosive growth.  Carving out their place in this emerging world will be critical.

The company has so many of the right pieces including connectivity, computing, and machine learning.  Their challenge will be, making all of it simple to adopt in the cloud while continuing to generate business outcomes. Today we are seeing strong growth from cloud data warehouses like Amazon RedshiftSnowflakeAzure SQL Data Warehouse and Google Big Query.  Apache Spark and service players like Databricks and Qubole are also seeing strong growth.  Cloudera now has decisions to make on how they approach this ecosystem and they choose to compete with and who they choose to complement.

What’s In Store for the Cloud Players

For the cloud platforms like AWS, Azure, and Google, this recent merger is also a win.  The better the cloud services are that run on their platforms, the more benefits joint customers will get and the more they will grow their usage of these cloud platforms.  There is obviously a question of who will win, for example, EMR, Databricks or Cloudera 2.0, but at the end of the day the major cloud players will win either way as more and more data, and more and more insight runs through the cloud.

Talend’s Take

From a Talend perspective, this recent move is great news.  At Talend, we are helping our customers modernize their data stacks.  Talend helps stitch together data, computing platforms, databases, machine learning services to shorten the time to insight. 

Ultimately, we are excited to partner with Cloudera to help customers around the world leverage this new union.  For our customers, this partnership means a greater level of alignment for product roadmaps and more tightly integrated products. Also, as the rate of innovation accelerates from Cloudera, our support for what we call “dynamic distributions” means that customers will be able to instantly adopt that innovation even without upgrading Talend.  For Talend, this type of acquisition also reinforces the value of having portable data integration pipelines that can be built for one technology stack and can then quickly move to other stacks.  For Talend and Cloudera 2.0 customers, this means that as they move to the future, unified Cloudera platform, it will be seamless for them to adopt the latest technology regardless of whether they were originally Cloudera or Hortonworks customers. 

You have to hand it to Tom Reilly and the teams at both Cloudera and Hortonworks.  They’ve given themselves a much stronger position to compete in the market at a time when people saw their positions in the market eroding.  It’s going to be really interesting to see what they do with the projected $125 million in annualized cost savings.  They will have a lot of dry powder to invest in or acquire innovation.  They are going to have a breadth in offerings, expertise and customer base that will allow them to do things that no one else in the market can do. 

The post Cloudera 2.0: Cloudera and Hortonworks Merge to form a Big Data Super Power appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Why Cloud-native is more than software just running on someone else’s computer

Thu, 10/04/2018 - 10:17

The cloud is not “just someone else’s computer”, even though that meme has been spreading so fast on the internet. The cloud consists of extremely scalable data centers with highly optimized and automated processes. This makes a huge difference if you are talking about the level of application software.

So what is “cloud-native” really?

“Cloud-native” is more than just a marketing slogan. And a “cloud-native application” is not simply a conventionally developed application which is running on “someone else’s computer”. It is designed especially for the cloud, for scalable data centers with automated processes.

Software that is really born in the cloud (i.e. cloud-native) automatically leads to a change in thinking and a paradigm shift on many levels. From the outset, cloud-native developed applications are designed with scalability in mind and are optimized with regard to maintainability and agility.

They are based on the “continuous delivery” approach and thus lead to continuously improving applications. The time from development to deployment is reduced considerably and often only takes a few hours or even minutes. This can only be achieved with test-driven developments and highly automated processes.

Rather than some sort of monolithic structure, applications are usually designed as a loosely connected system of comparatively simple components such as microservices. Agile methods are practically always deployed, and the DevOps approach is more or less essential. This, in turn, means that the demands made on developers increase, specifically requiring them to have well-founded “operations” knowledge.

Download The Cloud Data Integration Primer now.
Download Now

Cloud-native = IT agility

With a “cloud-native” approach, organizations expect to have more agility and especially to have more flexibility and speed. Applications can be delivered faster and continuously at high levels of quality, they are also better aligned to real needs and their time to market is much faster as well. In these times of “software is eating the world”, where software is an essential factor of survival for almost all organizations, the significance of these advantages should not be underestimated.

In this context: the cloud certainly is not “just someone else’s computer”. And the “Talend Cloud” is more than just an installation from Talend that runs in the cloud. The Talend Cloud is cloud-native.

In order to achieve the highest levels of agility, in the end, it is just not possible to avoid changing over to the cloud. Potentially there could be a complete change in thinking in the direction of “serverless”, with the prospect of optimizing cost efficiency as well as agility.  As in all things enterprise technology, time will tell. But to be sure, cloud-native is an enabler on the rise.

About the author Dr. Gero Presser

Dr. Gero Presser is a co-founder and managing partner of Quinscape GmbH in Dortmund. Quinscape has positioned itself on the German market as a leading system integrator for the Talend, Jaspersoft/Spotfire, Kony and Intrexx platforms and, with their 100 members of staff, they take care of renowned customers including SMEs, large corporations and the public sector. 

Gero Presser did his doctorate in decision-making theory in the field of artificial intelligence and at Quinscape he is responsible for setting up the business field of Business Intelligence with a focus on analytics and integration.

The post Why Cloud-native is more than software just running on someone else’s computer appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

California Leads the US in Online Privacy Rules

Tue, 10/02/2018 - 12:12

With California often being looked to as the state of innovation, the newly enforced California Consumer Privacy Act (CCPA) came as no surprise. This new online privacy law gives consumers the right to know what information companies are collecting about them, why they are collecting that data, and who they are sharing it with.

Some specific industries such as Banking or Health Sciences had already considered this type of compliance at the core of their digital transformation. But as the CCPA applies to potentially any company, no matter its size or industry, anyone serious about personalizing interactions with their visitors, prospects, customers, and employees needs to pay attention.

Similarities to GDPR

Although there are indeed some differences between GDPR and the CCPA, in terms of the data management and governance frameworks that needs to be established, the two are similar. These similarities include:

  • You need to know where your personal data is across your different system, which means that you need to run a data mapping exercise
  • You need to create a 360° view of your personal data and manage consent at a fine grain, although CCPA looks more permissive on consent than GDPR
  • You need to publish a privacy notice where you tell the regulation authorities, customers and other stakeholders what you are doing with the personal information within your database. You need to anonymize data (i.e. through data masking) for any other systems that includes personal data, but that you want to scope out from your compliance effort and privacy notice.   
  • You need to foster accountabilities so that the people in the companies that participate in the data processing effort are engaged for compliance
  • You need to know where your data is, including when it is shared or processed through third parties such as business partners or cloud providers. You need to control cross border data transfers and potential breaches while transparently communicating in cases of breaches 
  • You need to enact the data subject access rights, such as the right for data access, data rectification, data deletion, and data portability. CCPA allows a little more time to answer to a request, 45 days versus 1 month.  

Download Data Governance & Sovereignty: 16 Practical Steps towards Global Data Privacy Compliance now.
Download Now

Key Takeaways from the CCPA

The most important takeaway is that data privacy regulations are burgeoning for companies all over the world. With the stakes getting higher and higher, from the steep fines to the reputation risks, compliance consumers that can negatively affect the benefits of digital transformation).

While this law in its current state is specific to California, the idea of a ripple effect at the federal level might not be far off.  So instead of seeing it as a burden, such regulations should be taken as an opportunity. In fact, one of the side effects of all those regulations, today with data scandals now negatively impacting millions of consumers, is that data privacy now makes the headlines. Consumers are now understanding how valuable their data can be and how damaging the impact of losing control over personal data could be.

The lesson learned is that, although regulatory compliance is often what triggers a data privacy compliance project, it shouldn’t be the only driver. The goal is rather to establish a system of trust with your customers for their personal data. In a recent benchmark, where we exercised our right of data access and privacy against more than 100 companies, we could demonstrate that most company are very low on their maturity for achieving that goal. But it demonstrated as well that the best in class are setting the standards for turning it into a memorable experience.






The post California Leads the US in Online Privacy Rules appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Making the Bet on Open Source

Fri, 09/28/2018 - 17:20

Today, Docker or Kubernetes are obvious choices. But, back in 2015, these technologies were just emerging and hoping for massive adoption. How do tech companies make the right open source technology choices early?

As a CTO today, if you received an email from your head of Engineering saying, “Can we say that Docker is Enterprise production ready now?,” Your answer would undoubtedly be “yes”. If you hadn’t started leveraging Docker already, you would be eager to move on the technology that Amazon and so many other leading companies are now using as the basis of their applications’ architectures. However, what would your reaction be if you had received that email four years ago when Docker was still far from stable, lacked integration, support or tooling with all the major operating systems and Enterprise platforms, On-Premise or Cloud? Well, that is the situation that we at Talend were facing in 2015.

By sharing our approach and our learnings from choosing to develop with Docker and Kubernetes, I hope that we can help other CTOs and tech companies’ leaders with their decisions to go all-in with today’s emerging technologies.

Increasing Docker use from demos to enterprise-ready products

Back in 2014, as we were architecting our next generation Cloud Integration platform, micro-services and containerization were two trends that we closely monitored.

Talend, which is dedicated to monitoring emerging projects and technologies identified Docker as a very promising containerization technology that we could use to run our micro-services. That same year, one of our pre-sales engineers had heard about Docker at a Big Data training and learned about its potential to accelerate the delivery of product demos to its prospects as a more efficient alternative to VMWare or Virtual Box images.

From that day, Docker usage across Talend has seen an explosive growth, from the pre-sales use case of packaging demos to providing reproduction environments to tech support or quality engineering and of course its main usage around service and application containerization for R&D and Talend Cloud.

During our evaluation, we did consider some basic things like we would with any up-and-coming open source technology. First, we needed to determine the state of the security features offered by Docker. Luckily, we found that we didn’t need to build anything on top of what Docker already provided which was a huge plus for us.

Second, like many emerging open source technologies, Docker was not as mature as it is today, so it was still “buggy.” Containers would sometimes fail without any clear explanation, which would mean that we would have to invest time to read through the logs to understand what went wrong—a reality that anyone who has worked with a new technology understands well. Additionally, we had to see how this emerging technology would fit with our existing work and product portfolio, and determine whether they would integrate well. In our case, we had to check how Docker would work with our Java-based applications & services and evaluate if the difficulties that we ran into there would be considered a blocker for future development.

Despite our initial challenges, we found Docker to be extremely valuable and promising as it greatly improved our development life cycle by facilitating the rapid exchange and reuse of pieces of work between different teams. In the first year of evaluation, Docker quickly became the primary technology used by QA to rapidly setup testing environments at a fraction of the cost and with better performance compare to the more traditional Virtual environments (VMWare or VirtualBox).

After we successfully used Docker in our R&D processes, we knew we had made the right choice and that it was time to take it to the next level and package our own services for the benefit of our customers. With the support of containers and more specifically Docker by major cloud players such as AWS, Azure or Google, we had the market validation that we needed to completely “dockerize” our entire cloud-based platform, Talend Cloud.

While the choice to containerize our software with Docker was relatively straightforward, the choice to use Kubernetes to orchestrate those containers was not so evident at the start.

Talend’s Road to Success with Kubernetes

In 2015, Talend started to consider technologies that might orchestrate containers which were starting to make up our underlying architectures, but the technology of choice wasn’t clear. At this point, we faced a situation that every company has experienced: deciding what technology to work with and determining how to decide what technology would be the best fit.

At Talend, portability and agility are key concepts, and while Docker was clearly multiplatform, each of the cloud platform vendors had their own flavor of the orchestration layer.

We had to bet on an orchestration layer that would become the de facto standard or be compatible with major cloud players. Would it be Kubernetes, Apache Mesos or Docker Swarm?

Initially, we were evaluating both Mesos and Kubernetes. Although Mesos was more stable than Kubernetes at the time and its offering was consistent with Talend’s Big Data roadmap, we were drawn to the comprehensiveness of the Kubernetes applications. The fact that Google was behind Kubernetes gave us some reassurance around its scalability promises.

At the time, we were looking for container orchestration offerings, using Mesos required that we bundle several other applications for it to have the functionality we needed. On the other hand, Kubernetes’ applications had everything we needed already bundled together. We also thought about our customers: We wanted to make sure we chose the solution that would be the easiest for them to configure and maintain. Last—but certainly not least—we looked at the activity of the Kubernetes community. We found it promising that many large companies were not only contributing to the project but were also creating standards for it as well. The comprehensive nature of Kubernetes and the vibrancy of its community led us to switch gears and go all-in with Kubernetes.

As with any emerging innovative technology, there are constant updates and project releases with Kubernetes, which results in several iterations of updates in our own applications. However, this was a very small concession to make to use such a promising technology.

Similar to our experience with Docker, I tend to believe that we made the right choice with Kubernetes. Its market adoption (AWS EKS, Azure AKS, OpenShift Kubernetes) proved us right. The technology has now been incorporated into several of our products, including one of our recent offerings, Data Streams.

Watching the team go from exploring a new technology to actually implementing it was a great learning experience that was both exciting and very rewarding.

Our Biggest Lessons in Working with Emerging Technologies

Because we have been working with and contributing to the open source community since we released our open source offering Talend Open Studio for Data Integration in 2006, we are no strangers to working with innovative, but broadly untested technologies or sometimes uncharted territories. However, this experience with Docker and Kubernetes has emphasized some of the key lessons we have learned over the years working with emerging technologies:

  • Keep your focus: During this process, we learned that working with a new, promising technology requires that you keep your end goals in mind at all times. Because the technologies we worked with are in constant flux, it could be easy to get distracted by any new features added to the open source projects. It is incredibly important to make sure that the purpose of working with a particular emerging technology remains clear so that development won’t be derailed by new features that could be irrelevant to the end goal.
  • Look hard at the community: It is incredibly important to look to the community of the project you choose to work with. Be sure to look at the roadmap and the vision of the project to make sure it aligns with your development (or product) vision. Also, pay attention to the way the community is run—you should be confident that it is run in a way that will allow the project to flourish.
  • Invest the time to go deep into the technology: Betting on innovation is hard and does not work overnight. Even if it is buggy, dive into the technology because it can be worth it in the end. From personal experience, I know it can be a lot of work to debug but be sure to keep in mind that the technology’s capabilities—and its community—will grow, allowing your products (and your company) to leverage the innovations that would be very time consuming, expensive and difficult to build on your own.

Since we first implemented Docker and Kubernetes, we have made a new bet on open source: Apache Beam. Will it be the next big thing like Docker and Kubernetes? There’s no way to know at this point—but when you choose to lead with innovation, you can never take the risk-free, well-travelled path. Striving for innovation is a never-ending race, but I wouldn’t have it any other way.

The post Making the Bet on Open Source appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

From Dust to Trust: How to Make Your Salesforce Data Better

Wed, 09/26/2018 - 17:15

Salesforce is like a goldmine. You own it but it’s up to you to extract gold out of it. Sound complicated? With Dreamforce in full swing, we are reminded that trusted data is the key to success for any organization.

According to a Salesforce survey, “68% of sales professionals say it is absolutely critical or very important to have a single view of the customer across departments/roles. Yet, only 17% of sales teams rate their single view of the customer capabilities as outstanding.”

As sales teams are willing to change into high-performing trusted advisors, they are still spending most of their time on non-selling activities. The harsh reality is that sales people cannot wait to get clean, complete, accurate and consistent data into their systems.  They often end up spending lots of time on their own correcting bad records and reuniting customer insights. To minimize their time spent on data and boost their sales numbers, they need your help to rely on single customer view filled with trusted data.

Whether you’re working for a nonprofit that’s looking for more donors or at a company looking to get qualified leads, managing data quality in your prospects or donator CRM pipeline is crucial.

Watch Better Data Quality for All now.
Watch Now

Quick patches won’t solve your data quality problem in the long run

Salesforce was intentionally designed to digitally transform your business processes but was unfortunately not natively built to process and manage your data. As data is exploding, getting trusted data is becoming more and more critical. As a result, lots of Incubators’ apps started emerging on the Salesforce Marketplace. You may be tempted to use them and patch your data with quick data quality operations. 

But you may end up with separate features built by separate companies with different levels of integration, stability, and performance. You also take the risk of having the app not supported over the long term, putting your data pipeline and operations at risk. This in turn, will only make things worse by putting all the data quality on your shoulders whereas you may rely on your sales representative to resolve data. And you do not want to become the bottleneck of your organization.

After the fact Data Quality is not your best option

Some Business Intelligence Solutions have started emerging, further allowing you to prepare your data at the Analytical Level. But this is often a one-shot option for one single need and not solving the fulfilling the full need. You will still have bad data to input into Salesforce. Salesforce Data can be used in multiple scenarios by multiple persons. Operating Data Quality directly into Salesforce Marketing, Service or Commerce Cloud is the best approach to deliver trusted data at its source so that everybody can benefit from it.

The Rise of Modern Apps to boost engagement:

Fortunately, Data Quality has evolved to become a team activity rather than a single isolated job. You then need to find ways and tools to engage your sales org into data resolution initiatives. Modern apps are key here to make that it a success.

Data Stewardship to delegate errors resolution with business experts

Next-generation data stewardship tools such as Talend Data Stewardship give you the ability to reach everyone who knows the data best within the organization. In parallel, business experts will be comfortable editing and enriching data within UI friendly tool that makes the job easier. Once you captured tacit knowledge from end users, you can scale it to millions of records thru built in machine learning capabilities within Talend Data Stewardship.

Data Preparation to discover and clean data directly with Salesforce

Self-service is the way to get data quality standards to scale. Data analyst spend 60% of their time cleaning data and getting it ready to use. Reduced time and effort mean more value and more insight to be extracted from data. Talend Data Preparation deals with this problem. It is a self – service application that allows potentially anyone to access a data set and then cleanse, standardize, transform, or enrich the data. With it’s ease of use, Data Preparation helps to solves  organizational pain points where often times employees are spending so much time crunching data in Excel or expecting their colleagues to do that on their behalf.

Here are two use cases to learn from:

Use Case 1: Standardizing Contact Data and removing duplicates from Salesforce

Duplicates are the bane of CRM Systems. When entering data into Salesforce, Sales Rep can be in a rush and create duplicates that stay for long. Let them pollute your CRM and it will impact every user and sales rep confidence in your data.

Data Quality here has a real direct business impact on your sales productivity and your marketing campaigns too.

Bad Data mean unreachable customers or untargeted prospects that escape from your customized campaigns leading to low conversion rate and lower revenue. 

With Talend Data Prep, you can really be a game changer: Data Prep allows you to connect natively and directly to your Salesforce platform and perform some ad-hoc data quality operations.

  • By entering your SDFC Credentials, you will get native access to customer fields you want to clean
  • Once data is displayed into Data Prep, Quality Bar and smart assistance will allow you to quickly spot your duplicates
  • Click the header of any column containing duplicates from your dataset.
  • Click the Table tab of the functions panel to display the list of functions that can be applied on the whole table
  • Point your mouse over the Remove duplicate rows function to preview its result and click to apply it
  • Once you perform this operation, your duplicates can be removed
  • You can also register this as a recipe you may want to apply it to other data sources
  • You also have some options in Data Prep to certify your dataset so other team members know this data source can be trusted
  • Collaborate with IT to expand your jobs with Talend Studio to fully automate your data quality operations and proceed with advanced matching operations

Use case 2:  Real time Data Masking into Salesforce

The GDPR defines pseudonymization as “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information.” Pseudonymization or anonymization therefore, may significantly reduce the risks associated with data processing, while also maintaining the data’s utility.

Using Talend Cloud, you can process it directly into Salesforce. Talend Data Preparation enables any business users to obfuscate data the easy way. After native connection with Salesforce Dataset:

  • Click the header of any column containing data to be masked from your dataset
  • Click the Table tab of the functions panel to display the list of functions that can be applied
  • Point your mouse over the Obfuscation function and click to apply it
  • Once you perform this operation, data will be masked and anonymized

When confronted with in-depth fields and more sophisticated data masking techniques, data engineers will take the lead operating pattern data masking techniques directly into Talend Studio and perform them into Salesforce within personal fields such as Security Numbers or Credit Cards.  You can still easily spot data to be masked into Data Prep and ask data engineers to perform anonymization techniques into Talend Studio in a second phase.


Without data quality tools and methodology, you will then end up with unqualified, unsegmented or unprotected customers’ accounts leading to lower revenue, lower marketing effectiveness and more importantly frustrated sales rep spending their time for trusted client data.  As strong as it may be, your Salesforce goldmine can easily transform itself into dust if you don’t put trust into your systems. Only platforms such as Talend Cloud with powerful data quality solutions can help you to extract hidden gold from your Salesforce data and deliver it trusted to the whole organization.

Want to know more? Go to Talend Connect London on October 15th & 16th or Talend Connect Paris on October 17th & 18th to learn from real business cases such as Greenpeace, Petit Bateau.

Whatever your background, technical or not, there will be a session that meets your needs.  We have plenty of use cases and data quality jobs we’ll expose both in technical and customer tracks.





The post From Dust to Trust: How to Make Your Salesforce Data Better appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL