AdoptOS

Assistance with Open Source adoption

ETL

Is self-service big data possible?

SnapLogic - Tue, 08/28/2018 - 07:30

By now, we all know about and are experiencing the rise in the volume of data generated and available to an organization and the issues it can cause. One can see that there is little end in sight to the data tsunami which is largely in part due to the increased variety of data from[...] Read the full article here.

The post Is self-service big data possible? appeared first on SnapLogic.

Categories: ETL

Four sessions to attend at the Strata Data Conference in New York

SnapLogic - Mon, 08/27/2018 - 13:17

The Strata Data Conference in New York is where thousands of cutting-edge companies deep dive into emerging big data technologies and techniques. From hot topics like AI and machine learning to implementing data strategy, this seven-year running conference series is a hotbed for new ideas and strategies to tackle the challenges that have emerged in the[...] Read the full article here.

The post Four sessions to attend at the Strata Data Conference in New York appeared first on SnapLogic.

Categories: ETL

Proven ways to speed data integration

SnapLogic - Fri, 08/24/2018 - 12:58

Data integration is a challenge that keeps getting more difficult. It’s no surprise considering the explosion of cloud-based tools, the proliferation of devices that consume and produce information, and the way information is shared between systems and from systems to humans. Plus, IDC predicts that the volume of data will reach around 40 Zettabytes (1[...] Read the full article here.

The post Proven ways to speed data integration appeared first on SnapLogic.

Categories: ETL

SnapLogic Snaplex management: Powered by Mesosphere

SnapLogic - Thu, 08/23/2018 - 13:01

As enterprises continue to invest heavily to optimize their IT infrastructure to keep up with business demands and become more agile, many embrace container technology and seek out applications that are compatible with it. In previous releases, SnapLogic introduced its support for Dockerization of Snaplex nodes to add flexibility to existing environments. Recently, the SnapLogic[...] Read the full article here.

The post SnapLogic Snaplex management: Powered by Mesosphere appeared first on SnapLogic.

Categories: ETL

Integrating Workday with ERP systems

SnapLogic - Wed, 08/22/2018 - 12:30

As companies modernize their enterprise software environments, many are turning to SaaS applications like Workday for key financial and Human Capital Management (HCM) capabilities. However, the new platforms must still co-exist with old-guard ERP systems, which demands a new approach to seamless integration. Workday HCM covers a lot of ground, including traditional back-office HR functions[...] Read the full article here.

The post Integrating Workday with ERP systems appeared first on SnapLogic.

Categories: ETL

Why data is no longer just an IT function

Talend - Tue, 08/21/2018 - 12:56

Data – or at least the collection, storage, protection, transfer and processing of it – has traditionally been seen as the role of a modern data-driven technical division. However, as data continues to explode in both volume and importance, it is not enough to gather huge amounts of disparate data into a data lake and expect that it will be properly consumed. With data becoming the defining factor of a business’s strategy, this valuable gold dust needs to be in the hands of the right business function, in the right form, at the right time, to be at its most effective. This means that traditional roles within the organization need to adapt, as CIOs and CTOs oversee digital transformation projects across the business landscape.

The aim of digital transformation is to create an adaptive, dynamic company that is powered by digital technology – it is the perfect marriage of the business and IT function and requires both to collaborate to successfully harness the data at a company’s disposal. This will be imperative to deliver the types of rapid growth and customer-centric developments that modern businesses are determined to achieve. In recent years, the groundwork for this has already been delivered in the increasing use of cloud within businesses – which the Cloud Industry Forum revealed earlier this year stands at 88% in the UK, with 67% of users expecting to increase their cloud usage over the coming years. However, while the cloud provides the perfect platform for scalable, agile digitization, three further challenges stand between organizations and digital transformation success, and the business and IT functions need to work together to ensure their business emerges victorious at the other end.

Watch Put More Data to Work: Talend Spring '18 now.
Watch Now Challenge 1: Business Wants Data, But IT Can’t Keep Up

With cloud applications, sensors, online data streams and new types of technology emerging week on week, businesses are seeing an explosion of data – both in volume and variety. At the same time, consumers are expecting the very latest products, with personalized services, in real-time. The data businesses have access to can help but frequently ends up siloed, out of context, or of bad quality. Industry estimates predict that working on flawed data costs a business in the region of 10x more than working on perfect data.

Traditionally, employees within the business have maintained this data, but this is no longer feasible in the face of the sheer volume of information that businesses receive. Instead, businesses will need to be empowered by modern technologies such as Big Data and machine learning to ensure that as much of data preparation, cleansing and analysis is guided or automated. Without a combined data landscape of high-quality data, businesses risk missing opportunities by simply not successfully analyzing their own data… or even drive improper insights and related actions.

Being data-driven is a mandate for modern business, and the strain cannot be placed on IT to simply keep pace with the latest technological innovations. Instead, the business function must support in creating a digital strategy, focused on the latest business objectives, in order for the company to succeed.

Challenge 2: Digitization is Changing the Job Description

In the not-too-distant past, IT resources were centralized, with a core IT organization managing on-premises data using legacy systems. While this was an effective way of keeping data safe and organized, it resulted in the data being hard to access and even harder to use. As recently as 2015, BARC statistics stated that from a sample of over 2,000 responses, 45% of business users say their companies have less than 10% of employees using business intelligence (BI).

However, in today’s data-centric world where surveys estimate that 38% of overall job postings require digital skills, empowering 10% of employees to be self-sufficient with data is nowhere near enough. Furthermore, Gartner research asserts that by 2019, citizen data scientists will surpass data scientists in terms of the amount of advanced analysis they produce. The roles of everyone throughout the business, from the CIO to the business process analyst, are emerging to need data right at the user’s fingertips. These figures need access to data to ensure they can strategize, execute and deliver for the business with the most relevant and up-to-date insights available. This means the business must fully equip its employees and at every level to empower their decision-making with highly available and insightful data. As well as providing self-service technologies and applications which provide a turnkey solution to mining insight from data, this involves using training and internal communications to define a data-driven culture throughout business divisions.

Challenge 3: The threats to data, and to businesses, are increasing by the day

The knee-jerk reaction to this might be to make as much data as possible available to as many people as possible. However, any well-versed CIO knows this is not viable. With regulations like the GDPR, organizations have an increasing obligation to make sure only the right people have access to every piece of information or place their entire organization at risk. This is especially important given a backdrop where 71% of users admit to having access to data they should not according to the Ponemon Institute.

The solution to this is successfully implemented self-service IT solutions, which automates functions such as data access requests and data preparation. This is fundamental to allowing business employees quicker access to the right data, as well as providing clear lineages of who accessed what information, when – which will be crucial to monitor under the GDPR. At the same time, automated data preparation tools are essential to reduce the burden on the IT team, performing manual cleansing and formatting tasks. This, in turn, will enable the IT team to focus on delivering new technologies for the organization, rather than troubleshooting legacy issues.

The rise of the cloud has created the possibility for every person in every business to be data driven – but to date, this has not been the case. Instead, organizations experience siloing and limits on innovation. The key is creating an approach to data that is built with the business objectives in mind. A successful digital transformation project is centered on achieving real business outcomes, which is then operationalized by IT – making both vital players in evolving the role and use of data within an organization.

The post Why data is no longer just an IT function appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Talking about AI, Machine Learning, and APIs at API World in September

SnapLogic - Tue, 08/21/2018 - 12:44

While artificial intelligence, machine learning, and APIs are often talked about, many companies are just beginning to work with them. That’s not the case with SnapLogic as we’ve been working with the technologies for years and have productized them with our Iris AI platform beginning which was released about a year and half ago. It’s[...] Read the full article here.

The post Talking about AI, Machine Learning, and APIs at API World in September appeared first on SnapLogic.

Categories: ETL

New Sources, New Data, New Tools: Large Health Insurance Group Confronts the Future of Financial Data Management in Healthcare

Talend - Fri, 08/17/2018 - 15:42

The world of healthcare never rests and one of the largest health insurance groups in the U.S. is not alone in wanting to provide, “quality care, better patient outcomes, and more engaged consumers.”   They achieve this through careful consideration as to how they manage their financial data.  This is the backbone of providing that quality care while also managing cost.  The future of healthcare data integration involves fully employing cloud technology to manage financial information while recognizing traditional patterns. For this, Talend is this insurance company’s ETL tool of choice.

In shifting from PeopleSoft to Oracle Cloud, the company realized that they needed a stronger infrastructure when it came to managing their data and also meeting the demands of a cost-benefit analysis.  They needed the right tools and the ability to use them in a cost-efficient manner.  This is where they made the investment in upgrading from using Talend’s Open Studio to Talend’s Big Data Platform.

Download The Forrester Wave™: Big Data Fabric, Q2 2018 now.
Download Now

The Question of Integration

The company’s financial cloud team receives a large quantity of data in over six different data types that come from a variety of sources (including in-house and external vendors).  They are handling XML, CSB, text, and mainframe files, destined for the Oracle Cloud or the mainframe in a variety of formats.  These files must be integrated and transferred between the cloud and their mainframe as inbound and outbound jobs.  The ability to blend and merge this data for sorting is necessary for report creation.

It is imperative that the company be able to sort and merge files, blending the data into a canonical format that can be handled in batch and in real-time.  Source targets vary, as do the file types that they require, and these must be able to be drawn from either the mainframe or Oracle Cloud.  This is an around-the-clock unidirectional process of communication involving a multitude of user groups in their disparate locations.

Amid all of this, they must also anticipate the demands of the future, which will escalate the number of new types of data, sources, and destinations as the company grows.  A seamless integration of new data sources and destinations will reduce, if not eliminate, downtime and loss of revenue.  Ultimately, it is departmental building with a global impact.

 

From Open Studio to Big Data

The company started off like many Talend users, employing Open Studio at the design stage.  It is important to note that they did not have to train in an entirely new skillset to move their infrastructure to Oracle Cloud.  They used the same skillset in the same way that the on-premises integrations had always been accomplished since Talend natively works with any cloud architecture or data landscape.  This helps companies with creating an effective prototype.  However, while Talend’s Open Studio is the most flexible tool in the open-source market, the company ultimately needed something for design execution, hence their switch to Talend’s Big Data Platform.

It was also critical that the financial cloud team was able to run their testing locally.  Fostering a DevOps culture has been critical to many IT teams because they can do their development locally.  Talend allows for local development as well as remote management.  Project development can be physically separated from the web, and project management can be handled remotely for global execution.  It can also be used anywhere and does not require a web browser, negating the need for an internet connection at this stage, and adding to the level of flexibility.

It is vital for continued development that developers do not have to depend on internet access; they should be able to work independently and from anywhere.  When the team gets together after the development stage is concluded, they can migrate the data and then upload it to another system easily.  Even managing all their tools can be done remotely.  There is limited need for an extensive infrastructure team to manage operations and this leads to further IT efficiency.

 

Utilizing User Groups

Within three weeks of introduction to Talend, the team from this health insurance group was competent in use.  Zero to sixty learning was achieved through CBT web-based training bolstered with in-person instruction.  The benefits of migrating to Talend are many but the company most values the Talend User Groups as a source of continuing education.

The company realized that User Groups offer a source of practical experience that no manual could ever fully embrace.  The local (Chicago) user group meetups offer in-person assistance and a wealth of practical information and best-practices.  According to the team at this company, taking advantage of the Talend User Groups is the prescription for success.

The post New Sources, New Data, New Tools: Large Health Insurance Group Confronts the Future of Financial Data Management in Healthcare appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

CI/CD support with SnapLogic’s GitHub Cloud Integration feature

SnapLogic - Thu, 08/16/2018 - 12:32

In my previous blog post, “How to practice CI/CD the SnapLogic way,” I provided three approaches that pipeline developers and DevOps engineers can implement to support their organization’s continuous integration and continuous delivery (CI/CD) methodologies. Several approaches that are widely used today include Project Import/Export, Project Import/Export via the SnapLogic public API, and CI/CD through[...] Read the full article here.

The post CI/CD support with SnapLogic’s GitHub Cloud Integration feature appeared first on SnapLogic.

Categories: ETL

SnapLogic August 2018 Release: Intelligent connectivity meets continuous integration

SnapLogic - Wed, 08/15/2018 - 07:45

We are pleased to announce the general availability of the SnapLogic Enterprise Integration Cloud (EIC) August 2018 4.14 Release. This release brings new levels of integration and added benefits for DevOps processes, including continuous integration and continuous delivery support, new support for container management, added API integration functionalities, and expanded intelligent connectivity – all critical[...] Read the full article here.

The post SnapLogic August 2018 Release: Intelligent connectivity meets continuous integration appeared first on SnapLogic.

Categories: ETL

Data silos are the greatest stumbling block to an effective use of firms’ data

SnapLogic - Mon, 08/13/2018 - 12:43

Originally published on lse.ac.uk. Greater access to data has given business leaders real, valuable insights into the inner workings of their organizations. Those who have been ahead of the curve in utilizing the right kinds of data for the right purposes have reaped the rewards of better customer engagement, improved decision-making, and a more productive[...] Read the full article here.

The post Data silos are the greatest stumbling block to an effective use of firms’ data appeared first on SnapLogic.

Categories: ETL

Going Serverless with Talend through CI/CD and Containers

Talend - Mon, 08/13/2018 - 04:20
Why should we care about CI/CD and Containers?

Continuous integration, delivery and deployment, known as CI/CD, has become such a critical piece in every successful software project that we cannot deny the benefits it can bring to your project. At the same time, containers are everywhere right now and are very popular among developers. In practice, CI/CD delivery allows users to gain confidence in the applications they are building by continuously test and validate them. Meanwhile, containerization gives you the agility to distribute and run your software by building once and being able to deploy “anywhere” thanks to a standard format that companies adopt. These common DevOps practices avoid the “it works on my machine!” effect. Direct consequences are a better time to market and more frequent deliveries.

How does it fit with Talend environment?

At Talend, we want to give you the possibility to be part of this revolution giving you access to the best of both worlds. Since the release of Talend 7.0 you can now build your Talend Jobs within Docker images thanks to standard Maven build. In addition, we also help you to smoothly plug this process into your CI/CD pipeline.

What about serverless?

The serverless piece comes at the end of it. This is the way we will deploy our Talend jobs. In fact, by shipping our jobs in containers we now have the freedom to deploy integration jobs anywhere. Among all the possibilities, a new category of services that are defined as serverless is raising. Most of the major cloud providers are starting to offer their serverless services for containers such AWS Fargate or Azure Container Instances to name a few. They allow you to run containers without the need to manage any infrastructure (servers or clusters). You are only billed for your container usage.

These new features have been presented at Talend Connect US 2018 during the main keynote and have been illustrated with a live demo of a whole pipeline from the build to the run of the job in AWS Fargate as well as Azure ACI. In this blog post, we are going to take advantage of Jenkins to create a CI/CD pipeline. It consists of building our jobs within Docker images, making the images available in a Docker registry, and eventually calling AWS Fargate and Azure ACI to run our images.

Let’s see how to reproduce this process. If you would like to follow along, please make sure you fulfill the following requirements.

Requirements
  • Talend Studio 7.0 or higher
  • Talend Cloud Spring 18’ or higher
  • Have a Jenkins server available
  • Your Jenkins server needs to have access to a Docker daemon installed on the same machine
  • Talend CommandLine installed on the Jenkins’ machine
  • Nexus version 2 or 3
  • Have installed the Talend CI Builder configured along with your Talend CommandLine.

* All Talend components here are available in the Talend Cloud 30 Day Trial

Talend Environment

If you are new to Talend then let me walk you through a high-level overview of the different components.  You need to start out with Talend Cloud and create a project in the Talend Management Console (TMC), you will then need to configure your project to use a Git repository to store your job artifacts.  You also need to configure the Nexus setting in TMC to point to your hosted Nexus server to store any 3rd party libraries you job may need.  Talend Cloud account provides the overall project and artifact management in the TMC. 

Next, you need to install the Studio and connect it to the Talend Cloud account. The Studio is the main Talend design environment where you build integration jobs. When you log in with the Studio to the cloud account following these steps you will see the projects from the TMC and use the project that is configured to the Git repository.  Follow the steps below to add the needed plugins to the Studio.

The last two components you need are the CI builder and Talend CommandLine or cmdline (if using cloud trial, the CommandLine tool is included with the studio install files). When installing or using the CommandLine tool the first time you will need to give the CommandLine tool a license as well, you can use the same license from your Studio. The CommandLine tool and the CI builder tool are the components that allow you to take the code from the job in Studio (really in Git) and build and deploy fully executables processes to environments via scripts. The CI builder along with the profile in the studio is what determines if that is going to say a Talend Cloud runtime environment or a container. Here are the steps to get started!

1) Create a project in Talend Cloud

First, you need to create a project in your Talend Cloud account and link your project to a GitHub repository. Please, have a look at the documentation to perform this operation.

Don’t forget to configure your Nexus in your Talend Cloud account. Please follow the documentation for configuring your Nexus with Talend cloud. As a reminder your Nexus needs to have the following repositories:

  • releases
  • snapshots
  • talend-custom-libs-release
  • talend-custom-libs-snapshot
  • talend-updates
  • thirdparty
2) Add the Maven Docker profiles to the Studio

We are going to configure the Studio by adding the Maven Docker profiles to the configuration of our project and job pom files. Please find the two files you will need here under project.xml and standalone-job.xml.

You can do so in your studio, under the menu Project Properties -> Build -> Project / Standalone Job.

You just need to replace them with the ones you had copied above. No changes are needed.

In fact, what we are really doing here is adding a new profile called “docker” using the fabric8 Maven plugin. When building our jobs with this Maven profile, openjdk:8-jre-slim will be used as a base image, then the jars of our jobs are going to be added to this image along with a small script to indicate how to run the job. Please be aware that Talend does not support OpenJDK nor Alpine Linux. For testing purposes only, you can keep the openjdk:8-jre-slim image, but for production purposes, you will have to build your own Java Oracle base image. For more information please refer to our supported platforms documentation.

3) Set up Jenkins

The third step is to set up our Jenkins server. In this blog post, the initial configuration of Jenkins will not be covered. If you have never used it before please follow the Jenkins Pipeline getting started guide. Once the initial configuration is completed, we will be using the Maven, Git, Docker, Pipeline and Blue Ocean plugins to achieve our CI/CD pipeline.

We are going to store our Maven settings file in Jenkins. In the settings of Jenkins (Manage Jenkins), go to “Managed files” and create a file with ID “maven-file”. Copy this file in it as in the screenshot below. Make sure to modify the CommandLine path according to your own settings and to specify your own nexus credentials and URL.

What you also need to achieve before going into the detail of the pipeline is define some credentials. To do so go to “Manage Jenkins” and “Configure credentials” then on the left “Credentials”. Look at the screenshot below:

Create four credentials for GitHub, Docker Hub, AWS and Azure. If you only plan to use AWS, you don’t need to specify your Azure credentials and conversely. Make sure you set your ACCESS KEY as username and SECRET ACCESS KEY as password for the AWS credentials.

Finally, and before going through the pipeline, we must get two CLI Docker images available on the Jenkins machine. Indeed, Jenkins will use docker images with AWS and Azure CLIs to perform CLI commands to the different services. This is an easy way to use these CLIs without the need to install them on the machine. Here are the images we will use:

  • vfarcic/aws-cli (docker pull vfarcic/aws-cli:latest; docker tag vfarcic/aws-cli:latest aws-cli)
  • microsoft/azure-cli:latest (docker pull microsoft/azure-cli:latest; docker tag microsoft/azure-cli:latest azure-cli)

You can of course use different images at your convenience.

These Docker images need to be pulled on the Jenkins machine, this way in the pipeline we can use the Jenkins Docker plugin to use the “withDockerContainer(‘image’)” function to execute the CLI commands as you will see later. You can find more information about running build steps inside a Docker container in the Jenkins documentation here.

Now that all the pre-requisites have been fulfilled let’s create a “New item” on the main page and choose “Pipeline”.

Once created you can configure your pipeline. This is where you will define your pipeline script (Groovy language).

You can find the script here.

Let’s go through this file and I will highlight the main steps.

At the top of the file you can set your own settings through environment variables. You have an example that you can follow with a project called “TALEND_JOB_PIPELINE” and a job “test”. The project git name should match the one in your GitHub repository. That is why the name is uppercase. Please be aware that in this script we use the job name as the Docker image name, so you cannot use underscores in your job name. If you want to use an underscore you need to define another name for your Docker image. The following environment variables must be set:

env.PROJECT_GIT_NAME = 'TALEND_JOB_PIPELINE' env.PROJECT_NAME = env.PROJECT_GIT_NAME.toLowerCase() env.JOB = 'test' env.VERSION = '0.1' env.GIT_URL = 'https://github.com/tgourdel/talend-pipeline-job.git' env.TYPE = "" // if big data = _mr env.DOCKERHUB_USER = "talendinc"

In this file, each step is defined by a “stage”. The first two stages are here for pulling the latest version of the job using the Git plugin.

Then comes the build of the job itself. As you can see we are utilizing the Maven plugin. The settings are in a Jenkins Config file. This is the file we added earlier in the Jenkins configuration with the maven-file ID.

In the stages “Build, Test and Publish to Nexus” and “Package Jobs as Container” the line to change is:

-Dproduct.path=/cmdline -DgenerationType=local -DaltDeploymentRepository=snapshots::default::http://nexus:8081/repository/snapshots/ -Xms1024m -Xmx3096m

Here you need to specify your own path to the CommandLine directory (relatively to the Jenkins server) and your Nexus URL.

After the build of the job in a Docker image we are going to push the image to Dockerhub registry. For this step and the next one we will use CLIs to use the different third-parties. As the Docker daemon should be running on the Jenkins’ machine you can use directly the docker CLI. We use the withCredentials() function to get your Dockerhub username and password:

stage ('Push to a Registry') {             withCredentials([usernamePassword(credentialsId: 'dockerhub', passwordVariable: 'dockerhubPassword', usernameVariable: 'dockerhubUser')]) {                sh 'docker tag $PROJECT_NAME/$JOB:$VERSION $DOCKERHUB_USER/$JOB:$VERSION'                sh "docker login -u ${env.dockerhubUser} -p ${env.dockerhubPassword}"                sh "docker push $DOCKERHUB_USER/$JOB:$VERSION"            } }

The stage “Deployment environment” is simply an interaction with the user when running the pipeline. It asks whether you want to deploy your container in AWS Fargate or Azure ACI. You can remove this step if you want to have a continuous build until the deployment. This step is for demo purposes.

The next two stages are the deployment itself to AWS Fargate or Azure ACI. In each of the two stages you need to modify with your own settings. For example, in the AWS Fargate deployment stage you need to modify this line:

aws ecs run-task --cluster TalendDeployedPipeline --task-definition TalendContainerizedJob --network-configuration awsvpcConfiguration={subnets=[subnet-6b30d745],securityGroups=[],assignPublicIp=ENABLED} --launch-type FARGATE

You need to modify the name of your Fargate Cluster and your task definition. For your information you need to create them in your AWS console. You can read the documentation to achieve this operation. At the time of writing, AWS Fargate is only available in N. Virginia region, but other regions will come. The container you that will be defined in your task definition is the one that will be created in your Docker Hub account with the name of your job as image name. For example, it would be talendinc/test:0.1 with the default configuration in the pipeline script.

The same applies to Azure ACI, you need to specify your own resource group and container instance.

4) Configure the Command line

As a matter of fact, Maven will use the CommandLine to build your job. The CommandLine can be used in 2 modes: script and server mode. Here we will use the CommandLine in server mode. First, you need to indicate the workspace of your CommandLine (which in our case will be the Jenkins workspace). Modify command-line.sh file as follow with your own path to Jenkins workspace (it depends on your pipeline name you choose in the previous step):

./Talend-Studio-linux-gtk-x86_64 -nosplash -application org.talend.commandline.CommandLine -consoleLog -data /var/jenkins_home/workspace/talend_job_pipeline startServer -p 8002

Change the Jenkins’ home path according to your own settings. Last thing to do is to modify the /configuration/maven_user_settings.xml file. To do so copy paste this file with your own nexus URL and login information.

Then launch the CommandLine in background:

$ /cmdline/commandline-linux.sh & 5) Run the pipeline

Once all the necessary configuration has been done you can run your pipeline. To do so, you can go in the Open Blue Ocean view and click on “run” button. It will trigger the pipeline and should see the pipeline progress:

Jenkins Pipeline to build Talend Jobs into Docker Containers

The pipeline in the context of this blog will ask you where you want to deploy your container. Choose either AWS Fargate or Azure ACI. Let’s take the Fargate example.

After having proceeded the deployment, your Fargate cluster should now have one pending task:

If you go into the detail of your task once run, you should be able to access the logs of your job:

You can now run your Talend integration job packaged in a Docker container anywhere such as:

  • AWS Fargate
  • Azure ACI
  • Google Container Engine
  • Kubernetes or OpenShift
  • And more …

Thanks to Talend CI/CD capabilities you can automate the whole process from the build to the run of your jobs. 

If you want to become cloud agnostic and take advantage of the portability of the containers this example shows you how you can use a CI/CD tool (Jenkins in our case) to automate the build and run in different cloud container services. This is only an example among others but being able to build your jobs as containers opens you to a whole new world for your integration jobs. Depending on your use-cases you could find yourself spend way less money thanks to these new serverless services (such as Fargate or Azure ACI). You could also now spend less time configuring your infrastructure and focus on designing your jobs.

If you want to learn more about how to take advantage of containers and serverless technology, join us at Talend Connect 2018 in London and Paris. We will have dedicated break-out sessions on serverless to help you go hands on with these demos. See you there!

The post Going Serverless with Talend through CI/CD and Containers appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Talend Connect Europe 2018: Liberate your Data. Become a Data Hero

Talend - Fri, 08/10/2018 - 11:03

Save the date! Talend Connect will be back in London and Paris in October

Talend will welcome customers, partners, and influencers to its annual company conference, Talend Connect, taking place in two cities, London and Paris, in October. A must-attend event for business decision makers, CIOs, data scientists, chief architects, and developers, Talend Connect will share innovative approaches to modern cloud and big data challenges, such as streaming data, microservices, serverless, API, containers and data processing in the cloud.

<< Reserve your spot for Talend Connect 2018: Coming to London and Paris >>

Talend customers from different industries including AstraZeneca, Air France KLM, BMW Group, Greenpeace and Euronext will go on stage to explain how they are using Talend’s solutions to put more data to work, faster. Our customers now see making faster decisions and monetizing data as a strategic competitive advantage. They are faced with the opportunity and challenge of having more data than ever spread across a growing range of environments, that change at an increasing speed, combined with the pressure to manage this growing complexity whilst simultaneously reduce operational costs.  At Talend Connect you can learn how Talend customers leverage more data across more environments to make faster decisions and support faster innovation whilst significantly reducing operational costs when compared with traditional approaches. Here’s what to expect at this year’s show.

 

Day 1: Training Day and Partner Summit

Partners play a critical role in our go-to-market plan for cloud and big data delivery and scale. Through their expertise in Talend technology and their industry knowledge, they can support organisations’ digital transformation strategies. During this first day, attendees will learn about our partner strategy and enablement, our Cloud-First strategy, as well as customer use cases.

The first day will also be an opportunity for training. Designed for developers, the two training sessions will enable the attendees to get started with Talend Cloud trough hands-on practices. Attendees will also be able to get certified on Talend Data Integration solution. An experienced Talend developer will lead the participants through a review of relevant topics, with hands-on practice in a pre-configured environment.

The training day and partner summit will be organised in both London and Paris.

Day 2: User Conference

Talend Connect is a forum in which customers, partners, company executives and industry analysts can exchange ideas and best approaches for tackling the challenges presented by big data and cloud integration. This year’s conference will offer attendees a chance to gain practical hands-on knowledge of Talend’s latest cloud integration innovations, including a new product that will be unveiled during the conference.

Talend Connect provides an ideal opportunity to discover the leading innovations and best practices of Cloud integration. With cloud subscription growing over 100% year-over-year, Talend continues to invest in this area, including serverless integration, DevOps and API capabilities.

Talend Data Master Awards

The winners of the Talend Data Master Awards will be announced at Talend Connect London. Talend Data Master Awards is a program designed to highlight and reward the most innovative uses of Talend solutions. The winners will be selected based on a range of criteria including market impact and innovation, project scale and complexity as well as the overall business value achieved.

Special Thanks to Our Sponsors

Talend Connect benefits from the support of partners including Datalytyx, Microsoft, Snowflake, Bitwise, Business&Decision, CIMT AG, Keyrus, Virtusa, VO2, Jems Group, Ysance, Smile and SQLI.

I am looking forward to welcoming users, customers, partners and all the members of the community to our next Talend Connect Europe!

The post Talend Connect Europe 2018: Liberate your Data. Become a Data Hero appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Why I hired a university professor to join our tech startup

SnapLogic - Thu, 08/09/2018 - 13:19

Originally published on minutehack.com. It’s universally accepted that AI is a game-changer and is having a huge impact on organizations across all industries. In the years to come, this isn’t something that is going to change. Indeed, AI and machine learning will be become so pervasive, making companies more innovative and allowing employees to offload[...] Read the full article here.

The post Why I hired a university professor to join our tech startup appeared first on SnapLogic.

Categories: ETL

3 Common Pitfalls in Building Your Data Lake and How to Overcome Them

Talend - Wed, 08/08/2018 - 11:36

Recently I had the chance to talk to an SVP of IT at one of the largest banks in North America about their digital transformation strategy. As we spoke, their approach to big data and digital transformation struck me as they described it as ever evolving. New technologies would come to market which required new pivots and approaches to leverage these capabilities for the business. It is more important than ever to have an agile architecture that can sustain and scale with your data and analytics growth. Here are three common pitfalls we often see when building a data lake and our thoughts on how to overcome them:

“All I need is an ingestion tool”

Ah yes, the development of a data lake is often seen as the holy grail of everything. Afterall, now you have a place to dump all of your data. The first issue most people run into is data ingestion. How could they collect and ingest the sheer variety and volume of data that was coming into a data lake. Any success of data collection is a quick win for them. So they bought a solution for data ingestion, and all the data can now be captured and collected like never before. Well problem solved, right? Temporarily, maybe, but the real battle has just begun.

Soon enough you will realize that simply getting your data into the lake is just the start. Most data lake projects failed because it turns into a big data swamp with no structure, no quality, a lack of talent and no trace of where the data actually came from. Raw data is rarely useful as a standalone since the data still needs to be processed, cleansed, and transformed in order to provide quality analytics. This often lead to the second pitfall.

Hand coding for data lake

We have had many blogs in the past on this, but you can’t emphasize this topic enough. It’s strikingly true that hand coding may look promising from the initial deployment costs, but the maintenance costs can increase by upwards of 200%. The lack of big data skills, on both the engineering and analytics sides, as well as the movement of cloud adds even more complexity to hand coding. Run the checklist here to help you determine when and where to have custom coding for your data lake project.

Self-service

With the rising demands of faster analytics, companies today are looking for more self-service capabities when it comes to integration. But it can easily cause peril without proper governance and metadata management in place. As many basic integration tasks may go to citizen integrators, it’s more important to ask is there governance in place to track that? Is access of your data given to the right people at the right time?  Is your data lake enabled with proper metadata management so your self-service data catalog is meaningful?

Don’t look for an avocado slicer.

As the data lake market matures, everyone is looking for more and yet struggling with each phase as they go through the filling, processing and managing of data lake projects.  To put this in perspective, here is a snapshot of the big data landscape from VC firm FirstMarkfrom 2012:

And this is how it looks in 2017:

The big data market landscape is growing like never before as companies are now more clear on what they need. From these three pitfalls, the biggest piece of advice I can offer is to avoid what I like to call “an avocado slicer”. Yes it might be interesting, fancy, and works perfectly for what you are looking for, but you will soon realize it’s a purpose-built point solution that might only work for ingestion, only compatible with one processing framework, or only works for one department’s particular needs. Instead, have a holistic approach when it comes to your data lake strategy, what you really need is a well-rounded culinary knife! Otherwise, you may end up with an unnecessary amount of technologies and vendors to manage in your technology stack.

In my next post, I’ll be sharing some best questions to ask for a successful data management strategy.  

The post 3 Common Pitfalls in Building Your Data Lake and How to Overcome Them appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

How to practice CI/CD the SnapLogic way

SnapLogic - Tue, 08/07/2018 - 14:27

With the SnapLogic Enterprise Integration Cloud’s (EIC) superior handling of different types of integrations for organizations, forward-looking companies are leveraging DevOps methodologies for their own data and application integration workflows and initiatives. CI/CD – continuous integration and continuous delivery – is a practice where code is built, integrated, and delivered in a frequent manner. This[...] Read the full article here.

The post How to practice CI/CD the SnapLogic way appeared first on SnapLogic.

Categories: ETL

How to Develop a Data Processing Job Using Apache Beam – Streaming Pipelines

Talend - Tue, 08/07/2018 - 12:21

In our last blog, we talked about developing data processing jobs using Apache Beam. This time we are going to talk about one of the most demanded things in modern Big Data world nowadays – processing of Streaming data.

The principal difference between Batch and Streaming is the type of input data source. When your data set is limited (even if it’s huge in terms of size) and it is not being updated along the time of processing, then you would likely use a batching pipeline. Input source, in this case, can be anything from files, database tables, objects in object storages, etc. I want to underline one more time that, with batching, we assume that data is mutable during all the processing time and the number of input records is constant. Why should we pay attention to this? Because even with files we can have unlimited data stream when files are always added or changed. In this instance, we have to apply a streaming approach to work with data. So, if we know that our data is limited and immutable then we need to develop a batching pipeline.

Things get more complicated when our data set is unlimited (continuously arriving) or/and mutable. Some of the examples of such sources might be the following – message systems (like Apache Kafka), new files in a directory (web server logs) or some other system collecting real-time data (like IoT sensors). The common theme among all of these sources is that we always have to wait for new data. Of course, we can split our data into batches (by time or by data size) and process every split in a batching way, but it would be quite difficult to apply some functions across all consumed datasets and create the whole pipeline for this. Luckily, there are several streaming engines that allow us to cope with this type of data processing easily – Apache SparkApache FlinkApache ApexGoogle DataFlow. All of them are supported by Apache Beam and we can run the same pipeline on different engines without any code changes. Moreover, we can use the same pipeline in batching or in streaming mode with minimal changes – the one just needs to properly set input source and voilà – everything works out of the box! Just like magic! I would dream of this a while ago when I was rewriting my batch jobs into streaming ones.

So, enough theory – it’s time to take an example and write our first streaming code. We are going to read some data from Kafka (unbounded source), perform some simple data processing and write results back to Kafka as well.

Let’s suppose we have an unlimited stream of geo-coordinates (X and Y) of some objects on a map (for this example, let’s say the objects are cars) which arrives in real time and we want to select only those that are located inside a specified area. In other words, we have to consume text data from Kafka topic, parse it, filter by specified limits and write back into another Kafka topic. Let’s see how we can do this with a help of Apache Beam.

Every Kafka message contains text data in the following format:
id,x,y

where:
  id – unique id of the object,
  x, y – coordinates on the map (integers).

We will need to take care of the format if it’s not valid and skip such records.

Creating a pipeline

Much like our previous blog, where we did batching processing, we create a pipeline in the same way:

Pipeline pipeline = Pipeline.create(options);

We can elaborate Options object to pass command line options into the pipeline. Please, see the whole example on Github for more details.

Then, we have to read data from Kafka input topic. As stated before, Apache Beam already provides a number of different IO connectors and KafkaIO is one of them. Therefore, we create new unbounded PTransform which consumes arriving messages from specified Kafka topic and propagates them further to the next step:

pipeline.apply( KafkaIO.<Long, String>read() .withBootstrapServers(options.getBootstrap()) .withTopic(options.getInputTopic()) .withKeyDeserializer(LongDeserializer.class) .withValueDeserializer(StringDeserializer.class))

By default, KafkaIO encapsulates all consumed messages into KafkaRecord object. Though, next transform just retrieves a payload (string values) by new created DoFn object:

.apply( ParDo.of( new DoFn<KafkaRecord<Long, String>, String>() { @ProcessElement public void processElement(ProcessContext processContext) { KafkaRecord<Long, String> record = processContext.element(); processContext.output(record.getKV().getValue()); } } ) )

After this step, it is time to filter the records (see the initial task stated above) but before we do that, we have to parse our string value according to the defined format. This allows it to be encapsulated into one functional object which then will be used by Beam internal transform Filter.

.apply( "FilterValidCoords", Filter.by(new FilterObjectsByCoordinates( options.getCoordX(), options.getCoordY())) )

Then, we have to prepare filtered messages to write back to Kafka by creating a new pair of key/values using internal Beam KV class which can be used across different IO connectors, including KafkaIO as well.

.apply( "ExtractPayload", ParDo.of( new DoFn<String, KV<String, String>>() { @ProcessElement public void processElement(ProcessContext c) throws Exception { c.output(KV.of("filtered", c.element())); } } ) )

The final transformation is needed to write messages into Kafka, so we simply use KafkaIO.write() – sink implementation – for these purposes. As for reading, we have to configure this transform with some required options, like Kafka bootstrap servers, output topic name and serialisers for key/value.

.apply( "WriteToKafka", KafkaIO.<String, String>write() .withBootstrapServers(options.getBootstrap()) .withTopic(options.getOutputTopic()) .withKeySerializer(org.apache.kafka.common.serialization.StringSerializer.class) .withValueSerializer(org.apache.kafka.common.serialization.StringSerializer.class) );

In the end, we just run our pipeline as usual:

pipeline.run();

This time it may seem a bit more complicated than it was in previous blog, but, as one can easily notice, we didn’t do any specific things to make our pipeline streaming-compatible. This is the whole responsibility of the Apache Beam data model implementation which makes it very easy to switch between batching and streaming processing for Beam users.

Building and running a pipeline

Let’s add the required dependencies to make it possible to use Beam KafkaIO:

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-io-kafka</artifactId>
  <version>2.4.0</version>
</dependency>

<dependency>
  <groupId>org.apache.kafka</groupId>
  <artifactId>kafka-clients</artifactId>
  <version>1.1.0</version>
</dependency>

Then, just build a jar and run it with DirectRunner to test how it works:

# mvn clean package
# mvn exec:java -Dexec.mainClass=org.apache.beam.tutorial.analytic.FilterObjects -Pdirect-runner -Dexec.args=”–runner=DirectRunner”

If it’s needed, we can add other arguments used in the pipeline with a help of “exec.args” option. Also, make sure that your Kafka servers are available and properly specified before running Beam pipeline. Lastly, the Maven command will launch a pipeline and run it forever until it will be finished manually (optionally, it is possible to specify maximum running time). So, it means that data will be processed continuously, in streaming mode.

As usual, all code of this example is published on this github repository.

Happy streaming!

The post How to Develop a Data Processing Job Using Apache Beam – Streaming Pipelines appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Looking Beyond OAS 3 (Part 1)

Talend - Mon, 08/06/2018 - 10:39

After reviewing the history of the OAI and API specifications last year, I wanted to take a look at the future of the OpenAPI Initiative (OAI) and outline my calculated, plausible evolution paths for the project moving forward.

State of the OAI

Over the past year, we’ve hit several milestones. We have released the OpenAPI Specification (OAS) version 3.0, a major revision achieved by the long and intense work of the technical community with the support from the steering committee (TSC). AOI is now 30+ members strong and, we keep gaining new members. We are also ramping up our marketing efforts as highlighted by the transition of the API Strategy & Practice conference RedHat/3Scale to the OAI, under the Linux Foundation umbrella.

These are impressive achievements to attain in a short amount of time and we need to build on this momentum as the support of OAS 3.0 by various API tools is quickly expanding this year. Now, what are the next opportunities in front of us, where should we collectively focus our energy, so we can enable the Open API ecosystem to strive and expand?

Based on some recent discussions around the potential evolution of the OAS scope and our mission to help describe REST APIs, there are several compelling opportunities from which one should carefully choose. We’ll cover these options in a series of blog posts over the next few weeks, but this first post will focus on the opportunities around web API styles.

Web API styles a) Enhanced support for Resource-driven APIs

With the active involvement of the creators of RAML and API Blueprint within the OAI, version 3 of OAS has added many enhancements including better examples, modularity and reuse. These capabilities are essential -especially when collectively designing a large set of API contracts – and helped reduce the functional gap with alternative specifications. We could continue to look at ways to close this gap even more so now that OAS eventually becomes a full replacement for RAML and API Blueprint in the future.

I grouped these three API specifications in a Resource-driven category because the notion of a Resource, identified by a Uniform Resource Identifier (URI) and manipulated via a standard set of HTTP methods (Uniform Interface), is central to the REST architecture style that they get inspiration from.

It is worth mentioning additional specs in this category such as the JSON-API that propose conventions to define resource fetching (sorting, pagination, filtering, linking, errors, etc.) and can help improve consistency of REST APIs. With OAS 3 capabilities, it should be possible to capture those conventions in a base OAS file reusable by several concrete API contract files (see the $ref mechanism). It’s feasible to see the OAI supporting the JSON-API spec as an additional project next to the core OAS project one day in the future.

Projects like HAL, JSON-LD and ALPS can help experienced API designers implement long-lasting APIs by reducing the coupling between clients and servers by better supporting the hypermedia principle in REST, at the media type level. They are rather complementary with OAS and could also be candidate as additional OAI projects to increase its support for Resource-driven APIs.

Support for streaming APIs based on the Server-Sent Event (SSE) media type, compatible with REST, are also becoming more and more frequent and could be better described in OAS with potentially limited changes.

The way OAS describes web APIs couple clients and servers at the resource level, isolating the clients from the lower level server implementation details such as the more complicated underlying microservices or databases, covering a very broad range of use cases that made it widely adopted in our industry.

However, there are situations where a web API is essentially a Function-driven API or a Data-driven API as illustrated below. Let’s now explore what that means and how the OAI could help describe those APIs in a standardized way.

b) Support for Function-driven APIs

For function-driven APIs (aka RPC APIs), a set of functions (aka methods or procedures) with their custom name, input and output parameters constitute the central API piece that is described in the contract. They also tend to rely on HTTP as a simple transport protocol, ignoring its application-level capabilities to offer more direct binding to programming languages that are in the majority function-driven as well.

While W3C SOAP was the most widely deployed variant due to its popularity 10 years ago as part of the SOA (Services Oriented Architecture) and WS-* bandwagon, it was quite complicated technically and resulted in low performance and limited support outside the major two programming environments (Java and .Net).

Because of its shortcomings, many RPC alternatives were developed such as XML-RPC, JSON-RPC that were already simpler than SOAP. However, the latest trend replaces textual data serialization format by more efficient binary formats such as Apache Avro or Apache Thrift.

Today, the leading RPC project is gRPC, that was created by Google and donated to the Cloud Native Computing Foundation (CNCF), also part of Linux Foundation. It is successful in high-performance microservices projects due to its optimized data format based on Protocol Buffers also created by Google and its reliance on the high-performance HTTP/2.0 protocol. It would be an interesting development if the gRPC project became part of the OAI or at least if OAS was supporting the description of such APIs (aka gRPC service definitions).

If you are interested in this idea, there is already a related OAS request for enhancement.

c) Support for Data-driven APIs

Finally, there is a category of web APIs where a data model (or data schema) is the central piece that can be almost directly exposed. In this case, the API can be generated, offering rich data access capabilities including filtering, sorting, pagination and relationship retrieval.

Microsoft was the first to develop this idea more than ten years ago, with Google pursing and then retiring a similar project called GData. OData is based on either Atom/XML or JSON formats and is now a mature project supported in many business tools like Excel, Tableau or SAP. It also exposes the data schema as metadata about the OData service at runtime, facilitating the tool discovery. It can be compared to standards like JDBC and ODBC but more web-native.

Recently, alternatives to OData have emerged, first from Netflix with its Falcor project which lets JavaScript clients manipulate a data model expressed as a JSON graph. Facebook has also released GraphQL in 2016 and received a very good level of interest and adoption in the API developers’ community.

Interestingly, OData automatically exposes a resource-driven API and a function-based API at the same time, offering a hybrid API style. GraphQL also supports exposing custom functions in a function-based manner but doesn’t have resource-based capabilities, instead reimplementing its own mechanisms such as caching.

There is a need for this style of APIs in the ecosystem and OAI could support the description of API contracts based on data models compatible with GraphQL and OData for example (see this OAS issue for more discussion) and potentially even host such a project if there was a joint interest.

Wrapping up

Beyond the incremental updates that the community will naturally add to the OAS 3 specification (see the list of candidates here and please contribute!), I believe that the original scope of OAS could be extended to support additional web API styles. In the second part of this blogpost series, I will explore the common API lifecycle activities and how OAS could be extended to improve its support for them.

The post Looking Beyond OAS 3 (Part 1) appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Four ways an iPaaS accelerates Workday value

SnapLogic - Fri, 08/03/2018 - 16:40

Enterprises are flocking to cloud-based platforms like Workday to transform how they do business, yet integration and data migration challenges can trigger deployment headaches and unforeseen costs that impede time to value. Workday, which marries financial management with critical HCM functionality in areas including workforce planning, recruiting, talent management, human resource management, benefits, payroll and[...] Read the full article here.

The post Four ways an iPaaS accelerates Workday value appeared first on SnapLogic.

Categories: ETL

Making Data a Team Sport: Muscle your Data Quality by Challenging IT & Business to Work Together

Talend - Thu, 08/02/2018 - 11:29

Data Quality is often perceived as the solo task of a data engineer. As a matter of fact, nothing could be further from the truth. People close to the business are eager to work and resolve data related issues as they’re the first to be impacted by bad data. But they are often reluctant to update data as Data Quality apps are not really made for them or just because they are not allowed to use them. That’s one of the reasons bad data keeps increasing. According to Gartner, poor data quality cost rose by 50% in 2017, reaching $15 million per year for every company.  This cost will explode in the upcoming years if nothing is done.

But things are changing: Data Quality is now increasingly becoming a company-wide strategic priority involving professionals from different horizons. To succeed, working as a team like in a sport team is a fair analogy to illustrate the key ingredients to succeed and win any data quality challenge:

  • As in team sports, you will hardly succeed with a solo approach rather than tackling from all angles
  • As in team sports, there are some practice to make the team succeed and win
  • As in team sports, Business/IT Teams would need the right tools, the right approach, and the right people to tackle the data quality challenge

This said, it is not as difficult as one could imagine. You just need to take up the challenge and do things the right way from the get go.

  1. The right tools. How to fight complexity with simple but interconnected apps

There is a plethora of data quality tools on the market. Go and register to a big data tradeshow and you will discover plenty of data preparation, stewardship, and tools offering several benefits to fight bad data. But only a few of them cover Data Quality for all. On one side, you will have sophisticated tools requiring deep expertise for a successful deployment.

These tools are often complex and require in-depth training to be deployed. Their User Interface is not suitable for everyone so only IT people can really manage them. If you have short term data quality priorities, you will miss your deadline. That would be like trusting a rookie to pilot a jumbo jet with flight instruments that are obviously too sophisticated to end successfully.

On the other side, you will find simple and powerful apps that are often too siloed to be injected into a data quality process. Even if they successfully focus on the business people with simple UI, they will miss a big piece to the puzzle, collaborative data management. And that’s precisely the challenge: success relies not only in the tools and capabilities themselves, but in their ability to simply talk to one another. For that you would need to have a platform-based solution that share, operate and transfer data, actions, and models together. That’s precisely what Talend provides.

You will confront multiple use cases where it will be next to impossible to manage your data successfully alone. By working together, users will empower themselves through the full data lifecycle.  Giving your business the power to overcome traditional obstacles such as cleaning, reconciling, matching or resolving your data.

  1. The right approach

It all starts with the key simple steps approach to manage data better together: the right approach: analyze, improve and control.

Analyze your Data Environment:

 Start by getting the big picture and identify key data quality challenges. Analyzing will help to give the big picture of your data. Rather than profiling data on its own with Data profiling in Talend Studio, a data engineer could simply delegate that task to a business analyst who knows customers best. In that case, Data Preparation offers simple yet powerful features that help the team get a glimpse of Data Quality with inflight indicators such as quality in every Data Set Columns. Data Preparationallows you to easily create a preparation based on a data set.

Let’s take the example of a team wishing to prepare a marketing campaign together with sales but suffering from bad data in the SalesForce CRM System. With Data Preparation, you have the ability to automatically as well as interactively profile and browse business data coming from SalesForce. Connected to Salesforce thru DataPrep, you will get a clear picture of your data quality. Once you identified the problem, you can solve it on your own with simple but powerful operations. But you’ve only just scratched the surface. That’s where you would need the expertise of a Data Engineer to go deeper and improve your data quality flows.

Improve your Data with in depth tools and start remediation designing stewardship campaigns

 Using Talend Studio as your Data Quality Engine, the data engineers of your IT department will get access to a wide array of very powerful features included into Talend Studio. You will for example separate the wheat from the chaff using a simple data filter operation such as t-filter to identify wrong email patterns or exclude from your domain list improper domain addresses. At that stage, you will need to make sure you isolate bad data into your data quality process. Once filtering is done, you will then continue to improve your data and for that you will call on others for help. Talend Studio will work as the pivot of your data quality process. From Talend Studio, you will enable you to log on your credentials to Talend Cloud and expand your data quality to users close to the business. Whether you’re a business user or a data engineer, Stewardship-now in the Cloud will then allow you to launch cleaning campaigns and solve the bad data challenge with your extended team. This starts with designing your campaign.

Using the same UI look and feel as Talend Data Preparation, Talend Data Stewardship will offer the same easy to use capacities that business users love. As it’s fully operationalized and connected to Talend Studio, it will enable IT or Business Process People to expand Data Quality Operations to new people unfamiliar to technical tools but keen on cleaning data with simple apps relying on their business knowledge and experience.

That’s the essence of collaborative data management: one app for each dedicated operation but seamlessly connected on a single platform that manages your data from ingestion to consumption.

As an example, feel free to view this webinar to learn how to use cloud-based tools to make data better for all:  https://info.talend.com/en_tld_better_dataquality.html

 

Control your data quality process to the last mile with the whole network of stewards

 Once you have designed your stewardship campaign, you need to call on Stewards for help and conduct the campaign to have them checked the data at their disposal. Talend Data Stewardship will play a massive role here. Unless other tools existing on the market, the ability to extend your data quality to stewards with UI-friendly applications will make it easier to resolve your data and make sure you have engaged key business contributors in an extended data resolution campaign. They will feel comfortable resolving business data using simple apps.

Engaging business people in your data quality process will bring your data quality processes several benefits too. You will get more accurate results as business analysts have the experience and required skills to choose the proper data. You will soon realize that they will feel committed and be eager to cooperate and work together with you as they’re finally the most concerned by Data Quality.

Machine learning will act here as a virtual companion of your data-driven strategy: as stewards will complete missing details, the machine learning capabilities of Talend Data Quality Solutions will learn from stewards and predict future matching records based on initial records resolved by Stewards. As the system will learn from users, it will give you free hands to pursue other stewardship campaigns and reinforce the impact and control of your data processes.

Finally, you will then build a data flow back to your SalesForce CRM System from your stewardship campaign so that bad data cleaned and resolved by Stewards will then be reinjected into the Salesforce CRM System. Such operations can only be achieved with simplicity if you have apps connected together on a single platform. You’ll have the opportunity to mark data sets as certified directly into a business app like Data Preparation so that users getting access to data will then have cleaned and trusted data to be used.

Remember this three-steps approach is a continuous improvement process that will only get better with time.s

To learn more about Data Quality, please download our Definitive Guide to Data Quality

The post Making Data a Team Sport: Muscle your Data Quality by Challenging IT & Business to Work Together appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL
Syndicate content