Assistance with Open Source adoption


API World recap and self-service integration with AI

SnapLogic - Thu, 09/20/2018 - 16:57

It’s been about a week since I attended API World in San Jose, California to present an hour-long keynote, “Supercharging Self-Service API Integration with AI,” and I wanted to share some of my takeaways from this great conference. For those who were unable to attend, I shared SnapLogic’s journey with machine learning (ML), going from[...] Read the full article here.

The post API World recap and self-service integration with AI appeared first on SnapLogic.

Categories: ETL

Why Organizations Are Choosing Talend vs Informatica

Talend - Thu, 09/20/2018 - 15:10

Data has the power to transform businesses in every industry from finance to retail to healthcare. 2.5 quintillion bytes of data are created every day, and the volume of data is doubling each year. The low cost of sensors, ubiquitous networking, cheap processing in the cloud, and dynamic computing resources are not only increasing the volume of data, but the enterprise imperative to do something with it. Plus, not only is there more data than ever, but also more people than ever who want to work with it to create business value.

Businesses that win today are using data to set themselves apart from the competition, transform customer experience, and operate more efficiently. But with so much data, and so many people who want to use it, extracting business value out of it is nearly impossible — and certainly not scalable — without data management software. But what is the right software to choose? And what criteria should data and IT professionals be looking for when selecting data integration software?

In a recent survey undertaken with TechValidate of 299 Talend users, respondents expressed clear preferences for data integration tools that had the following characteristics:

• Flexibility. Respondents wanted data integration tools to be able to connect to any number of data sources, wherever they happen to be located (in the cloud, on-premises, in a hybrid infrastructure, etc.)
• Portability. Respondents wanted the ability to switch deployment environments with the touch of a button.
• Ease of Use. Respondents want their data integration tools to be easy to use, with an intuitive interface.

Large majorities of survey respondents who selected Talend vs Informatica made their choice based on those factors.

Talend vs Informatica: A Common Choice

Data and IT professionals have numerous choices when deciding how to manage their enterprise data for business intelligence and analytics. Among our customers surveyed, we found the most common choice was between Talend and Informatica, followed by Talend vs hand coding.

The reasons to use a tool over traditional hand-coded approaches to data processes like ETL are numerous; interestingly, survey respondents that chose Talend over hand-coding see productivity gains of 2x or more.


var protocol = (("https:" == document.location.protocol) ? "https://" : "http://"); document.write(unescape("%3Cscript src='" + protocol + "' type='text/javascript'%3E%3C/script%3E"));
var tvAsset_ACEC4F8B3 = new TVAsset(); tvAsset_ACEC4F8B3.initialize({width: 610, height: 338, 'style': 'transparent', 'tvid':'ACEC4F8B3', 'protocol':document.location.protocol}); tvAsset_ACEC4F8B3.display();

They also find that their maintenance costs are reduced when they use Talend over hand coding. Clearly, choosing a data management tool like Talend over hand-coding data integrations is the right choice. But when organizations are trying to decide between tools, what factors are they considering?

Talend vs Informatica: Talend is More Flexible and Easier to Use

As we’ve seen, customers that chose Talend over Informatica cited Talend’s flexibility, portability, and ease of use as differentiating factors. These factors were particularly important to customers who chose Talend vs Informatica. In fact, 95% of these customers said that Talend’s flexibility and open source architecture distinguished it from the competition. In addition, 90% of them cited portability as a competitive differentiator, and 85% of them noted that ease of use distinguished Talend from the competition as well.

Given the increased impact that cloud data integration is having on the data management landscape, these factors make sense. The increasing amount of data in a wide variety of environments must to be processed and analyzed efficiently; in addition, there is an enterprise necessity to be able to change cloud providers and servers as easily as possible. Therefore, flexibility and portability gain greater importance. You don’t want your data management tools to hold you back from your digital transformation goals. Plus, with the greater number of people wanting access to data, having tools that are easy to use becomes very important to provide access to all the lines of business who want and need data for their analytics operations.

Talend: A Great Choice for Cloud Data Integration

Customers who are using Talend find its open-source architecture and collaborative tools useful for a number of business objectives, including using data to improve business efficiency and improving data governance.


var protocol = (("https:" == document.location.protocol) ? "https://" : "http://"); document.write(unescape("%3Cscript src='" + protocol + "' type='text/javascript'%3E%3C/script%3E"));
var tvAsset_555130D82 = new TVAsset(); tvAsset_555130D82.initialize({width: 600, height: 470, 'style': 'transparent', 'tvid':'555130D82', 'protocol':document.location.protocol}); tvAsset_555130D82.display();

Talend has proved extremely useful in helping organizations get true value out of their data. One customer noted:


var protocol = (("https:" == document.location.protocol) ? "https://" : "http://"); document.write(unescape("%3Cscript src='" + protocol + "' type='text/javascript'%3E%3C/script%3E"));
var tvAsset_208C13074 = new TVAsset(); tvAsset_208C13074.initialize({width: 610, height: 406, 'style': 'transparent', 'tvid':'208C13074', 'protocol':document.location.protocol}); tvAsset_208C13074.display();

If you’re considering using data management software, why not try Talend FREE for 30 days and see what results you can achieve for your business. Data can be truly transformative. Harness it with an open-source, scalable, easy-to-manage tool.

The post Why Organizations Are Choosing Talend vs Informatica appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

The data journey: From the data warehouse to data marts to data lakes

SnapLogic - Wed, 09/19/2018 - 15:59

With data increasingly recognized as the corporate currency of the digital age, new questions are being raised as to how that data should be collected, managed, and leveraged as part of an overall enterprise data architecture. Data warehouses: Model of choice For the last few decades, data warehouses have been the model of choice, used[...] Read the full article here.

The post The data journey: From the data warehouse to data marts to data lakes appeared first on SnapLogic.

Categories: ETL

The Big Data Debate: Batch vs. Streaming Processing

Talend - Tue, 09/18/2018 - 12:48

While data is the new currency in today’s digital economy, it’s still a struggle to keep pace with the changes in enterprise data and the growing business demands for information. That’s why companies are liberating data from legacy infrastructures by moving over to the cloud to scale data-driven decision making. This ensures that their precious resource— data — is governed, trusted, managed and accessible.

While businesses can agree that cloud-based technologies are key to ensuring the data management, security, privacy and process compliance across enterprises, there’s still an interesting debate on how to get data processed faster — batch vs. stream processing.

Each approach has its pros and cons, but your choice of batch or streaming all comes down to your business use case. Let’s dive deep into the debate to see exactly which use cases require the use of batch vs. streaming processing.

Batch vs. Stream Processing: What’s the Difference?

A batch is a collection of data points that have been grouped together within a specific time interval. Another term often used for this is a window of dataStreaming processing deals with continuous data and is key to turning big data into fast data. Both models are valuable and each can be used to address different use cases. And to make it even more confusing you can do windows of batch in streaming often referred to as micro-batches.

While the batch processing model requires a set of data collected over time, streaming processing requires data to be fed into an analytics tool, often in micro batches, and in real-time. Batch processing is often used when dealing with large volumes of data or data sources from legacy systems, where it’s not feasible to deliver data in streams. Batch data also by definition requires all the data needed for the batch to be loaded to some type of storage, a database or file system to then be processed. At times, IT teams may be idly sitting around and waiting for all the data to be loaded before starting the analysis phase.

Data streams can also be involved in processing large quantities of data, but batch works best when you don’t need real-time analytics. Because streaming processing is in charge of processing data in motion and providing analytics results quickly, it generates near-instant results using platforms like Apache Spark and Apache Beam. For example, Talend’s recently announced Talend Data Streams, is a free, Amazon marketplace application, powered by Apache Beam, that simplifies and accelerates ingestion of massive volumes and wide varieties of real-time data.

Is One Better Than the Other?

Whether you are pro-batch or pro-stream processing, both are better when working together. Although streaming processing is best for use cases where time matters, and batch processing works well when all the data has been collected, it’s not a matter of which one is better than the other — it really depends on your business objective.

Watch Big Data Integration across Any Cloud now.
Watch Now

However, we’ve seen a big shift in companies trying to take advantage of streaming. A recent survey of more than 16,000 data professionals showed the most common challenges to data science including everything from dirty data to overall access or availability of data. Unfortunately, streaming tends to accentuate those challenges because data is in motion. Before jumping into real-time, it is key to solve those accessibility and quality data issues.   

When we talk to organizations about how they collect data and accelerate time-to-innovation, they usually share that they want data in real-time, which prompts us to ask, “What does real-time mean to you?” The business use cases may vary, but real-time depends on how much time to the event creation or data creation relative to the processing time, which could be every hour, every five minutes or every millisecond.

To draw an analogy for why organizations would convert their batch data processes into streaming data processes, let’s take a look at one of my favorite beverages—BEER. Imagine you just ordered a flight of beers from your favorite brewery, and they’re ready for drinking. But before you can consume the beers, perhaps you have to score them based on their hop flavor and rate each beer using online reviews. If you know you have to complete this same repetitive process on each beer, it’s going to take quite some time to get from one beer to the next. For a business, the beer translates into your pipeline data. Rather than wait until you have all the data for processing, instead you can process it in micro batches, in seconds or milliseconds (which means you get to drink your beer flight faster!).

Why Use One Over the Other?

If you don’t have a long history working with streaming processing, you may ask, “Why can’t we just batch like we used to?” You certainly can, but if you have enormous volumes of data, it’s not a matter of when you need to pull data, but when you need to use it.

Companies view real-time data as a game changer, but it can still be a challenge to get there without the proper tools, particularly because businesses need to work with increasing volumes, varieties and types of data from numerous disparate data systems such as social media, web, mobile, sensors, the cloud, etc. At Talend, we’re seeing enterprises typically want to have more agile data processes so they can move from imagination to innovation faster and respond to competitive threats more quickly. For example, data from the sensors on a wind turbine are always-on. So, the stream of data is non-stop and flowing all the time. A typical batch approach to ingest or process this data is obsolete as there is no start or stop of the data. This is a perfect use case where stream processing is the way to go.

The Big Data Debate

It is clear enterprises are shifting priorities toward real-time analytics and data streams to glean actionable information in real time. While outdated tools can’t cope with the speed or scale involved in analyzing data, today’s databases and streaming applications are well equipped to handle today’s business problems.

Here’s the big takeaway from the big data debate: just because you have a hammer doesn’t mean that’s the right tool for the job. Batch and streaming processing are two different models and it’s not a matter of choosing one over the other, it’s about being smart and determining which one is better for your use case.

The post The Big Data Debate: Batch vs. Streaming Processing appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Serverless: A Game Changer for Data Integration

Talend - Fri, 09/14/2018 - 17:46

The concept of cloud computing has been around for years, but cloud services truly became democratized with the advent of virtual machines and the launch of Amazon Elastic Compute in 2006.

Following Amazon, Google launched Google App Engine in 2008, and then Microsoft launched Azure in 2010.

At first, cloud computing offerings were not all that different from each other. But as with nearly every other market, segmentation quickly followed growth.

In recent years, the cloud computing market has grown large enough for companies to develop more specific offers with the certainty that they’ll find a sustainable addressable market. Cloud providers went for ever more differentiation in their offerings, supporting features and capabilities such as artificial intelligence/machine learning, streaming and batch, etc.

The very nature of cloud computing, the abundance of offerings and the relative low cost of services took segmentation to the next level, as customers were able to mix and match cloud solutions in a multi-cloud environment. Hence, instead of niche players addressing the needs of specific market segments, many cloud providers can serve the different needs of the same customers.

Introduction to Serverless

The latest enabler of this ultra-segmentation is serverless computing. Serverless is a model in which the cloud provider acts as the server, dynamically managing the allocation of resources and time. Pricing is based on the actual amount of resources consumed by an application, rather than on pre-purchased units of capacity.

With this model, server management and capacity planning decisions are hidden from users, and serverless code can be used in conjunction with code deployed in microservices.

As research firm Gartner Inc. has pointed out, “serverless computing is an emerging software architecture pattern that promises to eliminate the need for infrastructure provisioning and management.” IT leaders need to adopt an application-centric approach to serverless computing, the firm says, managing application programming interfaces (APIs) and service level agreements (SLAs), rather than physical infrastructures.

The concept of serverless is typically associated with Functions-as-a-Services (FaaS). FaaS is a perfect way to deliver event-based, real-time integrations. FaaS cannot be thought of without container technologies, both because containers power the underlying functions infrastructure and because they are perfect for long-running, compute intensive workloads.

The beauty of containers lies in big players such as Google, AWS, Azure, Redhat and others working together to create a common container format – this is very different from what happened with virtual machines, where AWS created AMI, VMware created VMDK, Google created Google Image etc. With containers, IT architects can work with a single package that runs everywhere. This package can contain a long running workload or just a single service.

Serverless and Continuous Integration

Serverless must always be used together with continuous integration (CI) and continuous delivery (CD), helping companies reduce time to market. When development time is reduced, companies can deliver new products and new capabilities more quickly, something that’s extremely important in today’s market. CI/CD manages the additional complexity you manage with a fine grained, serverless deployment model. Check out how to go serverless with Talend through CI/CD and containers here.

Talend Cloud supports a serverless environment, enabling organizations to easily access all cloud platforms; leverage native performance; deploy built-in security, quality, and data governance; and put data into the hands of business users when they need it.

Talend’s strategy is to help organizations progress on a journey to serverless, beginning with containers-as-a-service, to function-as-a-service, to data platform-as-a-service, for both batch and streaming. It’s designed to support all the key users within an organization, including data engineers, data scientists, data stewards, and business analysts.

An organization’s data integration backbone has to be native and portable, according to the Talend approach. Code native means there is no additional runtime and no additional development needed. Not even the code becomes proprietary, so there is no lock-in to a specific environment. This enables flexibility, scale and performance.

The benefits of serverless are increased agility, unlimited scalability, simpler maintenance, and reduced costs. It supports a multi-cloud environment and brings the pay-as-you-go model to reality.

The serverless approach makes data-driven strategies more sustainable from a financial point of view. And that’s why serverless is a game changer for data integration. Now there are virtually infinite possibilities for data on-demand. Organizations can decide how, where, and when they process data in a way that’s economically feasible for them.

The post Serverless: A Game Changer for Data Integration appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

How to Spot a DevOps Faker: 5 Questions, 5 Answers

Talend - Thu, 09/13/2018 - 18:55

With the rapid growth of DevOps teams and jobs, it follows that there are candidates out there who are inflating–or flat-out faking–their relevant skills and experience. We sat down with Nick Piette, Director of Product Marketing API Integration Products here at Talend to get the inside scoop on how to spot the DevOps fakers in the crowd:

What clues should you look for on a resume or LinkedIn profile that someone is faking their DevOps qualifications?

Nick: For individuals claiming DevOps experience, I tend to look for the enabling technologies we’ve seen pop up since the concept’s inception. What I’m looking for often depends where they are coming from. If I see they have solid programming experience, I look for complimentary examples where the candidate mentions experience with source control management (SCM), build automation or containerization technologies. I’m also looking for what infrastructure monitors and configuration management tools they have used in the past. The opposite is true when candidates come from an operations background. Do they have coding experience, and are they proficient in the latest domain specific languages?

What signs should you look for in an interview? How should you draw these out?

Nick: DevOps is a methodology. I ask interviewees to provide concrete examples of overcoming some of the challenges many organizations run into, how the candidate’s team reduced the cost of downtime, how they handled the conversion of existing manual tests to automated tests, what plans they implemented to prevent code getting to the main branch, what KPIs were used to measure and dashboard. The key is the detail–individuals who are vague and lack attention to detail raise a red flag from an experience standpoint.

 Do you think DevOps know-how is easier to fake (at least up to a point) than technical skills that might be easier caught in the screening/hiring process?

Nick: Yes, if the interviewer is just checking for understanding vs. experience. It’s easier to read up on the methodology and best practices and have book smarts than it is to have the technology experience and street smarts. Asking about both during an interview makes it harder to fake.

How can you coach people who turn out to have DevOps-related deficiencies?

Nick: Every organization is different, so we always expect some sort of deficiency related to the process. We do the best we can to ensure everything here is documented. We’re also practicing what we preach–it’s a mindset and a company policy.

Should we be skeptical of people who describe themselves as “DevOps gurus,” “DevOps ninjas,” or similar in their online profiles?

Nick: Yes. There is a difference between being an early adopter and an expert. While aspects of this methodology have been around for a while, momentum really started over the last couple years. You might be an expert with the technologies, but DevOps is much more than that.



The post How to Spot a DevOps Faker: 5 Questions, 5 Answers appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Three reasons organizations need self-service integration now

SnapLogic - Thu, 09/13/2018 - 14:11

Consumer behavior today is very different than it was 10 years ago – the rise of cloud applications and mobile apps now let people complete their goals and intents with a simple tap or click. Information is more accessible than ever and complex tasks can be completed effortlessly. A simple example worth repeating – anyone[...] Read the full article here.

The post Three reasons organizations need self-service integration now appeared first on SnapLogic.

Categories: ETL

Creating Real-Time Anomaly Detection Pipelines with AWS and Talend Data Streams

Talend - Wed, 09/12/2018 - 13:35

Thanks for continuing to read all of our streaming data use cases during my exploration of Talend Data Streams. For the last article of this series, I wanted to walk you through a complete IoT integration scenario using a low consumption device and leveraging only cloud services.

In my previous posts, I’ve used a Raspberry Pi and some sensors as my main devices. This single board computer is pretty powerful and you can install a light version of Linux as well. But in real life, enterprises will probably use System On Chip things such as Arduino, PLC, ESP8266 … Those SOC are less powerful, consume less energy and are mostly programmed in C, C++ or Python. I’ll be using an ESP8266 that has embedded Wi-Fi and some GPIO to attach sensors. If you want to know more about IoT hardware have a look at my last article “Everything You Need to Know About IoT: Hardware“.

Our use case is straightforward. First, the IoT device will send sensor values to Amazon Web Services (AWS) IoT using MQTT. Then we will create a rule in AWS IoT to redirect device payload to a Kinesis Stream. Next, from Talend Data Streams we will connect to the Kinesis stream, transform our raw data using standard components. Finally, with the Python processor, we will create an anomaly detection model using Z-Score and all anomalies will be stored in HDFS.

<<Download Talend Data Streams for AWS Now>>


If you want to build your pipelines along with me, here’s what you’ll need:

  • An Amazon Web Services (AWS) account
  • AWS IoT service
  • AWS Kinesis streaming service
  • AWS EMR cluster (version 5.11.1 and Hadoop 2.7.X) on the same VPC and Subnet as your Data Streams AMI.
  • Talend Data Streams from Amazon AMI Marketplace. (If you don’t have one follow this tutorial: Access Data Streams through the AWS Marketplace)
  • An IoT device (can be replaced by any IoT data simulator)
High-Level Architecture

Currently, Talend Data Streams doesn’t feature an MQTT connector. In order to get around this, you’ll find an architecture sample below to leverage Talend Data Streams to ingest IoT data in real-time and storing it to a Hadoop Cluster.

Preparing Your IoT Device

As mentioned previously, I ‘m using an ESP8266 or also called Node MCU, it has been programmed to:

  • Connect to a Wi-Fi hotspot
  • Connect securely to AWS IoT broker using the MQTT protocol
  • Read every second distance, temperature and humidity sensor values
  • Publish over MQTT sensor values to the topic IoT

If you are interested in how to develop an MQTT client on the ESP8266 take a look at this link. However, you could use any device simulator.

IoT Infrastructure: AWS IoT and Kinesis



The AWS IoT service is a secure and managed MQTT broker. In this first step I’ll walk you through registering your device, generate public/private key and C.A. to connect securely.

First, login to your Amazon Web Services account and look for IoT. Then, select IoT Core in the list.

Register your connected thing. From the left-hand side menu click on “Manage”, select “Things” and click on “Create”.

Now, select “Create a single thing” from your list of options (alternatively you can select “Create many things “for bulk registration of things).

Now give your thing a name (you can also create device types, groups and other searchable attributes). For this example, let’s keep default settings and click on next.

Now to secure the device authentication using the “One-click certification” creation. Click on “Create a Certificate”.

Download all the files, those have to be stored on the edge device and used with MQTT client to securely connect to AWS IoT, click on “Activate” then “Done”.

In order to allow our device to publish messages and subscribe to topics, we need to attach a policy  from the menu. Click on “Secure” and select “Policies”, then click on “Create”.

Give a name to the policy, in action start typing IoT and select IoT. NOTE: To allow all actions, tick the box “Allow” below and click on “Create”.

Let’s attach this policy to a certificate, from the left menu click on “Secure” and select certificate and click on the certificate of your thing.

If you have multiple certificates, click on “Things” to make sure that the right certificate. Next, click on “Actions” and select “Attach Policy”.

Select the policy we’ve just created and click on “Attach”.

Your thing is now registered and can connect, publish messages and subscribe to topics securely! Let’s test it (it’s now time to turn on the ESP).

Testing Your IoT Connection in AWS

From the menu click on Test, select Subscribe to a topic, type IoT for a topic and click on “Subscribe to Topic”. 

You can see that sensor data is being sent to the IoT topic.

Setting Up AWS Kinesis

On your AWS console search for “Kinesis” and select it.

Click on “Create data stream”.

Give your stream a name and select 1 shards to start out. Later on if you add more devices you’ll need to increase the number of shards. Next, click on “create Kinesis stream”.

Ok, now we are all set on the Kinesis part. Let’s return back to AWS IoT, on the left menu click on “Act” and press “Create”.

Name your rule, select all the attributes by typing “*” and filter on the topic IoT.

Scroll down and click on “Add Action” and select “Sends messages to an Amazon Kinesis Stream”. Then, click “Configure action” at the bottom of the page.

Select the stream you’ve previously created, use an existing role or create a new one that can access to AWS IoT. Click on “Add action” and then “Create Rule”.

We are all set at this point, the sensor data collected from the device through MQTT will be redirected to the Kinesis Stream that will be the input source for our Talend Data Streams pipeline.

Cloud Data Lake: AWS EMR

Currently, with the Talend Data Streams free version, you can use HDFS but only with an EMR cluster. In this part, I’ll describe how to provision  a cluster and how to set up Data Streams to use HDFS in our pipeline.

Provision your EMR cluster

Continuing on your AWS Console, look for EMR.

Click on “Create cluster”.

Next, go to advanced options.

Let’s choose a release that is fully compatible with Talend Data Streams. The 5.11.1 and below will do, then select the components of your choice (Hadoop, Spark, Livy and Zeppelin and Hue in my case). We are almost there, but don’t click on next just yet.

In the Edit software settings, we are going to edit the core-site.xml when the cluster is provisioned, in order to use specific compression codecs required for Data Streams and to allow root impersonation.

Paste the following code to the config:

[   {     "Classification": "core-site",     "Properties": {       "io.compression.codecs": ",,,", "hadoop.proxyuser.root.hosts": "*", "hadoop.proxyuser.root.groups": "*"     }   } ]

On the next step, select the same VPC and subnet as your Data Streams AMI and click “Next”. Then, name your cluster and click “Next”.

Select an EC2 key pair and go with default settings for the rest and click on “Create Cluster”.  After a few minutes, your cluster should be up and running.

Talend Data Streams and EMR set up

Still on your AWS Console, look for EC2.

You will find 3 new instances with blank names that we need to rename. Then by looking at the security groups you can identify which one is the master node.

Now we need to connect to the master node through SSH (check that your client computer can access port 22, if not add an inbound security rule to allow your IP). Because we need to retrieve Hadoop config files I’m using Cyberduck (alternatively use FileZilla or any tool that features SFTP), use the EC2 DNS for the server, Hadoop as a user and the related EC2 key pair to connect.

Now using your favorite SFTP tool connect to your Data Streams EC2 machine, using the ec2-user (allow your client to access port 22). If you don’t have the Data Streams free AMI yet follow this tutorial to provision one: Access Data Streams through the AWS Marketplace.

Navigate to /opt/data-streams/extras/etc/hadoop. NOTE: The folders /etc/hadoop might not exist in  /opt/data-streams/extras/ so you need to create them.

Restart your Data Streams EC2 machine so that it will start to pick up the Hadoop config files.

The last step is to allow all traffic from Data Streams to your EMR cluster and vice versa. To do so, create security rules to allow all traffic inbound on both sides for Data Streams and EMR security groups ID.

Talend Data Streams: IoT Streaming pipeline

<<Download Talend Data Streams for AWS Now>>

Now it’s time to finalize our real-time anomaly detection pipeline that uses Zscore. This pipeline is based on my previous article, so if you want to understand the math behind the scenes you should read this article.

All the infrastructure is in place and required set up is done, we can now start building some pipelines. Now logon to your Data Streams Free AMI using the public IP and the instance ID.

Create your Data Sources and add Data Set

In this part, we will create two data sources:

  1. Our Kinesis Input Stream
  2. HDFS using our EMR cluster

From the landing page select Connection on the left-hand side menu and click on “ADD CONNECTION”.

Give a name to your connection, and for the Type select “Amazon Kinesis” in the drop-down box.

Now use an IAM user that has access to Kinesis with an Access key. Fill in the connection field with Access key and Secret, click on “Check connection” and click on “Validate”. Now from the left-hand side menu select Datasets and click on “ADD DATASET”. 

Give your dataset a name and select the Kinesis connection we’ve created before from the drop-down box. Select the region of your Kinesis stream then your Stream, CSV for the format and Semicolon for the delimiter. Once that is done, click on “View Sample” then “Validate”. 

Our input data source is set up and our samples are ready to be used in our pipeline. Let’s create our output data source connection, on the left-hand-side menu select “CONNECTIONS”, click on “ADD CONNECTION”, give a name to your connection. Then select “HDFS” for the type, use “Hadoop as User” name and click on “Check Connection”. If it says it has connected, then click on “Validate.

That should do it for now, we will create the dataset within the pipeline, but before going further make sure that the Data Streams AMI have access to all inbound traffic to EMR Master and Slave nodes (add an inbound network security rule for EMR ec2 machine to allow all traffic from Data Streams Security group) or you will not be able to read and write to the EMR cluster.

Build your Pipeline

From the left-hand side menu select Pipelines, click on Add Pipeline.

In the pipeline, on the canvas click Create source, select Kinesis Stream and click on Select Dataset.

Back to the pipeline canvas you can see the sample data at the bottom. As you’ve noticed incoming IoT messages are really raw at this point, let’s convert current value types (string) to number, click on the green + sign next to Kinesis component and select the Type Converter processor. 

Let’s convert all our fields to “Integer”. To do that, select the first field (.field0) and change the output type to Integer. To change the field type on the next fields, click on NEW ELEMENT. Once you have done this for all fields, click on SAVE. 

Next to the Type Converter processor on your canvas, click on the green + sign and add a Windows processor, in order to calculate a Z-Score, we need to define a processing window.

Now let’s set up our window. My ESP8266 sends sensor values every second, and I want to create a Fixed Time window that contains more or less 20 values, so I’ll choose Window duration = Window slide length = 20000 ms— don’t forget to click Save. 

Since I’m only interested about Humidity, which I know is in field1, I’ll make things easier for myself later by converting the humidity row values in my window into a list of values (or array in Python) by aggregating by the field1 (humidity) data. To do this, add an aggregation processor next to the Window Z-Score component. Within the aggregation processor, choose .field1 as your Field and List as the Operation (since you will be aggregating field1 into a list). 

The next step is to calculate Z-score for humidity values. In order to create a more advanced transformation, we need to use the Python processor, so next to the Aggregate processor add a Python Row processor.

Change the Map type from FLATMAP to MAP, click on the 4 arrows to open up the Python editor and paste the code below and click SAVE. In the Data Preview, you can see what we’ve calculated in the Python processor: the average humidity, standard deviation and Z-Score array and humidity values for the current window.

Even if the code below is simple and self-explanatory, let me sum up the different steps:

  • Calculate the average humidity within the window
  • Find the number of sensor values within the window
  • Calculate the variance
  • Calculate the standard deviation
  • Calculate Z-Score
  • Output Humidity Average, Standard Deviation, Zscore and Humidity values.
#Import Standard python libraries import math #average function def mean(numbers):     return float(sum(numbers)) / max(len(numbers), 1) #initialize variables std=0 #Load input list #average value for window avg=mean(input['humidity']) ##lenth window mylist=input['humidity'] lon=len(mylist) # x100 in order to workaround Python limitation lon100=100/lon #Calculate Variance for i in range(len(mylist)):     std= std + math.pow(mylist[i]-avg,2) #Calculate Standard deviation    stdev= math.sqrt(lon100*std/100) #Re-import all sensor values within the window myZscore=(input['humidity']) #Calculate Z-Score for all sensor value within the window for j in range(len(myZscore)):     myZscore[j]= (myZscore[j]-avg)/stdev #Ouput results output['HumidityAvg']=avg output['stdev']=stdev output['Zscore']=myZscore

If you open up the Z-Score array, you’ll see Z-score for each sensor value.

Next to the Python processor add a Normalize processor to flatten the python array into records, in the column to normalize type Zscore and select is list option then save.

Let’s now recalculate the initial humidity value from the sensor, to do that we will a python processor and write the below code :

#Ouput results output['HumidityAvg']=input['HumidityAvg'] output['stdev']=input['stdev'] output['Zscore']=input['Zscore'] output['humidity']=round(input['Zscore']*input['stdev']+input['HumidityAvg'])


Don’t forget to change the Map type to MAP and click save. Let’s go one step further and select only the anomalies, if you had a look at my previous article, anomalies are Zscores that are outside the -2 Standard Deviation and +2 Standard deviation range, in our case the range is around -1.29 and +1.29. And now add a FilterRow processor, the product doesn’t allow us yet to filter on range of values, so we will filter the Absolute value of the Zscore that are superior to 1.29, we test on absolute value because Zscore can be negative.

The last output shows that 5 records that are anomalies out of the 50 sample records. Let’s now store those anomalies to HDFS, click on “Create a Sink” on the canvas an click on “Add Dataset”. Set it up as per below and click on Validate.

You will end up with an error message, don’t worry it’s just a warning Data Streams cannot fetch sample of a file that has not been created yet. We are now all set, let’s run the pipeline by clicking on the play button on the top.

Let’s stop the pipeline and have a look at our cluster, using Hue on EMR you can easily browse hdfs, go to user/Hadoop/anomalies.csv. Each partition file contains records that are anomalies for each processing windows.

There you go! We’ve built our Anomaly Detection Pipeline with Talend Data Streams, reading sensor values from a SOC based IoT device and only using cloud services. The beauty of Talend Data Streams is that we accomplished all of this without writing any code (apart from the Zscore calculation). I’ve only used the beautiful web UI.

To sum up, we’ve read data from Kinesis, used Type Convertor, Aggregation and Window processors to transform our raw data and then Python row to calculate Standard Deviation, Average and Z-Score for each individual humidity sensor readings. Then we’ve filtered out normal values and stored anomalies in HDFS of an EMR cluster.

That was my last article on the Data Streams for the year. Stay tuned, I’ll write the next episodes when it becomes GA in the beginning of 2019. Again, Happy Streaming!

The post Creating Real-Time Anomaly Detection Pipelines with AWS and Talend Data Streams appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

A welcome return to the Constellation Research iPaaS ShortList

SnapLogic - Tue, 09/11/2018 - 18:55

Constellation Research, an analyst firm that covers the IT space, has once again named SnapLogic to their Integration Platform as a Service (IPaaS) ShortList. As a perennial member of this ShortList, it remains an honor and a reminder of how competitive the integration space has been and continues to be. Constellation Research founder, Ray Wang,[...] Read the full article here.

The post A welcome return to the Constellation Research iPaaS ShortList appeared first on SnapLogic.

Categories: ETL

Key Considerations for Converting Legacy ETL to Modern ETL

Talend - Tue, 09/11/2018 - 14:25

Recently, there has been a surge in our customers who want to move away from legacy data integration platforms to adopting Talend as their one-stop shop for all their integration needs. Some of the organizations have thousands of legacy ETL jobs to convert to Talend before they are fully operational. The big question that lurks in everyone’s mind is how to get past this hurdle.

Defining Your Conversion Strategy

To begin with, every organization undergoing such a change needs to focus on three key aspects:

  1. Will the source and/or target systems change? Is this just an ETL conversion from their legacy system to modern ETL like Talend?
  2. Is the goal to re-platform as well? Will the target system change?
  3. Will the new platform reside on the cloud or continue to be on-premise?

This is where Talend’s Strategic Services can help carve a successful conversion strategy and implementation roadmap for our customers. In the first part of this three-blog series, I will focus on the first aspect of conversion.

Before we dig into it, it’s worthwhile to note a very important point – the architecture of the product itself. Talend is a JAVA code generator and unlike its competitors (where the code is migrated from one environment to the other) Talend actually builds the code and migrates built binaries from one environment to the other. In many organizations, it takes a few sprints to fully acknowledge this fact as the architects and developers are used to the old ways of referring to code migration.

The upside of this architecture is that it helps in enabling a continuous integration environment that was not possible with legacy tools. A complete architecture of Talend’s platform not only includes the product itself, but also includes third-party products such as Jenkins, NEXUS – artifact repository and a source control repository like GIT. Compare this to a JAVA programming environment and you can clearly see the similarities. In short, it is extremely important to understand that Talend works differently and that’s what sets it apart from the rest in the crowd.

Where Should You Get Started?

Let’s focus on the first aspect, conversion. Assuming that nothing else changes except for the ETL jobs that integrate, cleanse, transform and load the data, it makes it a lucrative opportunity to leverage a conversion tool – something that ingests legacy code and generates Talend code. It is not a good idea to try and replicate the entire business logic of all ETL jobs manually as there will be a great risk of introducing errors leading to prolonged QA cycles. However, just like anyone coming from a sound technical background, it is also not a good idea to completely rely on the automated conversion process itself since the comparison may not always be apples to apples. The right approach is to use the automated conversion process as an accelerator with some manual interventions.

Bright minds bring in success. Keeping that mantra in mind, first build your team:

  • Core Team – Identify architects, senior developers and SMEs (data analysts, business analysts, people who live and breathe data in your organization)
  • Talend Experts – Bring in experts of the tool so that they can guide you and provide you with the best practices and solutions to all your conversion related effort. Will participate in performance tuning activities
  • Conversion Team – A team that leverages a conversion tool to automate the conversion process. A solid team with a solid tool and open to enhancing the tool along the way to automate new designs and specifications
  • QA Team – Seasoned QA professionals that help you breeze through your QA testing activities

Now comes the approach – Follow this approach for each sprint:


Analyze the ETL jobs and categorize them depending on the complexity of the jobs based on functionality and components used. Some good conversion tools provide analyzers that can help you determine the complexity of the jobs to be converted. Spread a healthy mix of varying complexity jobs across each sprint.


Leverage a conversion tool to automate the conversion of the jobs. There are certain functionalities such as an “unconnected lookup” that can be achieved through an innovative method in Talend. Seasoned conversion tools will help automate such functionalities


Focus on job design and performance tuning. This is your chance to revisit design, if required, either to leverage better component(s) or to go for a complete redesign. Also focus on performance optimization. For high-volume jobs, you could increase the throughput and performance by leveraging Talend’s big data components, it is not uncommon for us to see that we end up completely redesigning a converted Talend Data Integration job to a Talend Big Data job to drastically improve performance. Another feather in our hat where you can seamlessly execute standard data integration jobs alongside big data jobs.


Unit test and ensure all functionalities and performance acceptance criteria are satisfied before handing over the job to QA


An automated QA approach to compare result sets produced by the old set of ETL jobs and new ETL jobs. At the least, focus on:

  • Compare row counts from the old process to that of the new one
  • Compare each data element loaded by the load process to that of the new one
  • Verify “upsert” and “delete” logic work as expected
  • Introduce an element of regression testing to ensure fixes are not breaking other functionalities
  • Performance testing to ensure SLAs are met

Now, for several reasons, there can be instances where one would need to design a completely new ETL process for a certain functionality in order to continue processing data in the same way as before. For such situations, you should leverage the “Talend Experts” team that not only liaisons with the team that does the automated conversion but also works closely with the core team to ensure that, in such situations, the best solution is proposed which is then converted to a template and provided to the conversion team who can then automate the new design into the affected jobs.

As you can see, these activities can be part of the “Categorize” and “Convert” phases of the approach.

Finally, I would suggest chunking the conversion effort into logical waves. Do not go for a big bang approach since the conversion effort could be a lengthy one depending on the number of legacy ETL jobs in an organization.


This brings me to the end of the first part of the three-blog series. Below are the five key takeaways of this blog:

  1. Define scope and spread the conversion effort across multiple waves
  2. Identify core team, Talend experts, a solid conversion team leveraging a solid conversion tool and seasoned QA professionals
  3. Follow an iterative approach for the conversion effort
  4. Explore Talend’s big data capabilities to enhance performance
  5. Innovate new functionalities, create templates and automate the conversion of these functionalities

Stay tuned for the next two!!

The post Key Considerations for Converting Legacy ETL to Modern ETL appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

ML taking stage at API World and happy hour in Mountain View

SnapLogic - Fri, 09/07/2018 - 12:47

There’s going to be a lot happening next week which I’m really excited to share. The common theme is helping our customers overcome their obstacles, which is something that I, and the rest of the team at SnapLogic, are passionate about. First, I will be at API World in San Jose, California on Monday, September[...] Read the full article here.

The post ML taking stage at API World and happy hour in Mountain View appeared first on SnapLogic.

Categories: ETL

Metadata Management 101: What, Why and How

Talend - Fri, 09/07/2018 - 10:26

Metadata Management has slowly become one of the most important practices for a successful digital initiative strategy. With the rise of distributed architectures such as Big Data and Cloud which can create siloed systems and data, metadata management is now vital for managing the information assets in an organization. The internet has a lot of literature around this concept and readers can easily get confused with the terminology. In this blog, I wanted to give the users a brief overview of metadata management in plain English.

What does metadata management do?

Let’s get started with the basics. Though there are many definitions out there for Metadata Management, but the core functionality is enabling a business user to search and identify the information on the key attributes in web-baseded user interface.

An example of a searchable key attribute could be Customer ID or a member name. With a proper metadata management system in place, business users will be able to understand where the data for that attribute is coming from and how was the data in the attribute calculated. They will be able to visualize which enterprise systems in the organization the attribute being used in (Lineage) and will be able to understand the impact of changing something (Impact Analysis) to the attribute such as the length of the attribute to other systems.

Technical users also have needs for metadata management. By combining business metadata with technical metadata, a technical user will also be able to find out which ETL job or database process is used to load data into the attribute. Operational metadata such as control tables in a data warehouse load can also be combined to this integrated metadata model. This is powerful information for an end user at to have at their fingertips. The end result of metadata management can be in the form of another ‘database’ of the metadata of key attributes of the company. The industry term for such a database would be called a Data Catalog, or a glossary or Data inventory.

How does metadata management work?

Metadata Management is only one of the initiatives of a holistic Data Governance program but this is the only initiative which deals with “Metadata”. Other initiatives such as Master Data Management (MDM) or Data Quality (DQ) deal with the actual “data” stored in various systems. Metadata management integrates metadata stores at the enterprise level.

Tools like Talend Metadata Manager provide an automated way to parse and load different types of metadata. The tool also enables to build an enterprise model based on the metadata generated from different systems such as your data warehouse, data integration tools, data modelling tools, etc.

Users will be able to resolve conflicts based on for example attribute names and types. You will also be able to create custom metadata types to “stitch” metadata between two systems. A completely built metadata management model would give a 360-degree view on how different systems in your organization are connected together.  This model can be a starting point to any new Data Governance initiative. Data modelers will have one place now to look for a specific attribute and use it in their own data model. This model is also the foundation of the ‘database’ that we talked about in the earlier section. Just like any other Data Governance initiatives, as the metadata in individual systems change, the model needs to be updated following a SDLC methodology which includes versioning, workflows and approvals. Access to the metadata model should also be managed by creating roles, privileges and policies.

Why do we need to manage metadata?

The basic answer is, trust. If metadata is not managed during the system lifecycle, silos of inconsistent metadata will be created in the organization that does not meet any teams full needs and provide conflicting information. Users would not know how much they need to trust the data as they is no metadata to indicate how and when the data got to the system and what business rules were applied.

Costs also need to be considered. Without effectively managing metadata, each development project would have to go through the effort of defining data requirements increasing costs and decreasing efficiency. Users are presented with many tools and technologies creating redundancy and excess costs and do not provide the full value of the investment as the data they are looking for is not available. The data definitions are duplicated across multiple systems driving higher storage costs.

As business becomes mature and more and more systems are added, they need to consider how the metadata (and not just the data) needs to be governed. Managing metadata provides clear benefits to the business and technical users and the organization as a whole. I hope this has been a useful intro to all the very basics of metadata management. Until next time!

The post Metadata Management 101: What, Why and How appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

[Step-by-Step] How to Load Salesforce Data Into Snowflake in Minutes

Talend - Thu, 09/06/2018 - 16:47

As cloud technologies move into the mainstream at an unprecedented rate, organizations are augmenting existing, on-premise data centers with cloud technologies and solutions—or even replacing on-premise data storage and applications altogether. But in almost any modern environment, data will be gathered from multiple sources at a variety of physical locations (public cloud, private cloud, or on-premise). Talend Cloud is Talend’s Integration Platform-as-a-Service (IPaaS), a modern cloud and on-premises data and application integration platform and is particularly performant for cloud-to-cloud integration projects.

To explore the capabilities and features of Talend Cloud, anyone can start 30-day free trial. Several sample jobs are available for import as part of the Talend Cloud trial to get you familiar with the Talend IPaaS solution. The video below walks you through two sample jobs to load data from Salesforce into a Snowflake database.

To get started, you obviously need to be a user (or trial user) of Talend Cloud, Snowflake cloud data warehouse and Salesforce. Then, there’s a simple 2-step process to migrate Salesforce data to Snowflake, using Talend cloud. The first job will use a snowflake connection to create a user-defined database with three tables in Snowflake. The second job will then migrate these three tables from to the snowflake cloud warehouse.

The full step-by-step process is available here with attached screenshots. Want to try  Talend Cloud? Start your trial today!

The post [Step-by-Step] How to Load Salesforce Data Into Snowflake in Minutes appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Moving to Microsoft Azure Cloud? 8 tips for evaluating data integration platforms

SnapLogic - Thu, 09/06/2018 - 12:59

As organizations increasingly move their data and operations to the Microsoft Azure Cloud, they must migrate data from old systems stored on-premises. Unfortunately, IT can’t always meet data integration deadlines by writing custom code and legacy integration technologies drive up the time and costs to migrate. The key to success is finding a data integration[...] Read the full article here.

The post Moving to Microsoft Azure Cloud? 8 tips for evaluating data integration platforms appeared first on SnapLogic.

Categories: ETL

Robots on steroids: RPA is here to stay

SnapLogic - Wed, 09/05/2018 - 13:20

Previously published on ITProPortal.  Few narratives have captured the imaginations of business leaders more than robots—bots tirelessly doing work faster than their human colleagues on a 24/7 constant basis. Leaving aside for the moment the socioeconomic ramifications of robots reducing job opportunities, the fact remains that inventions like Robotic Process Automation, or RPA, make life[...] Read the full article here.

The post Robots on steroids: RPA is here to stay appeared first on SnapLogic.

Categories: ETL

The SnapLogic Patterns Catalog is here!

SnapLogic - Tue, 09/04/2018 - 12:50

Building integration pipelines is quick and easy with SnapLogic’s clicks-not-code approach. Many SnapLogic users build simple and complex pipelines within minutes. Recently made available to all SnapLogic users, the SnapLogic Patterns Catalog is a library of pipeline templates that can be used to shave off even more time by eliminating the need to build pipelines[...] Read the full article here.

The post The SnapLogic Patterns Catalog is here! appeared first on SnapLogic.

Categories: ETL

BDaaS: Taking the pain out of big data deployment

SnapLogic - Fri, 08/31/2018 - 12:49

There’s Software-as-a-Service (SaaS), Platform-as-a-Service (PaaS), even Infrastructure-as-a-Service (IaaS). Now, in the quest to make big data initiatives more accessible to mainstream customers, there’s a new As-A-Service offering that offloads the heavy lifting and capital expenditures associated with Big Data analytics. Organizations from global retail giants to specialty manufacturers are mixing sales data, online data, Internet[...] Read the full article here.

The post BDaaS: Taking the pain out of big data deployment appeared first on SnapLogic.

Categories: ETL

Self-Service Big Data and AI/ML at Intelligent Data Summit

SnapLogic - Thu, 08/30/2018 - 12:57

Recently, I presented at the Intelligent Data Summit, a virtual event hosted by IDevNews. I was joined by SoftwareAG, MapR, among others and one of our technology partners, Reltio. This timely, online event was focused on all things AI/ML, data, IoT, and other modern technologies and it was a pleasure to be a part of[...] Read the full article here.

The post Self-Service Big Data and AI/ML at Intelligent Data Summit appeared first on SnapLogic.

Categories: ETL

Migrate from Salesforce to Microsoft Dynamics 365 for Sales using SnapLogic

SnapLogic - Wed, 08/29/2018 - 12:51

A recent Gartner report announced that Customer Relationship Management (CRM) became the largest software market in 2017 and is the fastest growing software market in 2018. Market growth for cloud CRM increased as well, at twice the rate of the overall market growth, and it will only continue to accelerate at a rapid pace as[...] Read the full article here.

The post Migrate from Salesforce to Microsoft Dynamics 365 for Sales using SnapLogic appeared first on SnapLogic.

Categories: ETL

Six Do’s and Don’ts of Collaborative Data Management

Talend - Tue, 08/28/2018 - 11:29

Data Quality Projects are not technical projects anymore. They are becoming collaborative and team driven.

As organizations strive to succeed their digital transformation, data professionals realize they need to work as teams with business operations as they are the ones who need better data to succeed their operations. Being in the cockpit, Chief Data Officers need to master some simple but useful Do’s and Don’t’s about running their Data Quality Projects.

Let’s list a few of these.

 DO’S  Set your expectations from the start.

Why Data Quality? What do you target? How deep will you impact your organization’s business performance? Find your Data Quality answers among business people. Make sure you know your finish line, so you can set intermediate goals and milestones on a project calendar.

Build your interdisciplinary team.

Of course, it’s about having the right technical people on board: people who master Data Management Platforms. But It’s all also about finding the right people who will understand how Data Quality impacts the business and make them your local champions in their respective department. For example, Digital Marketing Experts often struggle with bad leads and low performing tactics due to the lack of good contact information. Moreover, new regulations such as GDPR made marketing professionals aware about how important personal data are. By putting such tools as Data Preparation in their hands, you will give them a way to act on their Data without losing control. They will be your allies in your Data Quality Journey.

Deliver quick wins.

While it’s key to stretch people capabilities and set ambitious objectives, it’s also necessary to prove your data quality project will have positive business value very quickly. Don’t spend too much time on heavy planning. You need to prove business impacts with immediate results. Some Talend customers achieved business results very quickly by enabling business people with apps such as Data Prep or Data Stewardship.  If you deliver better and faster time to insight, you will gain instant credibility and people will support your project. After gaining credibility and confidence, it will be easier to ask for additional means when presenting your projects to the board. At the end remember many small ones make a big one.

DON’TS Don’t underestimate the power of bad communication

We often think technical projects need technical answers. But Data Quality is a strategic topic. It would be misleading to treat it as a technical challenge. To succeed, your project must be widely known within your organization. You will take control of your own project story instead of leaving bad communication spreading across departments. For that, you must master the perfect mix of know-how and communication skills so that your results will be known and properly communicated within your organization. Marketing suffering from bad leads, operations suffering from missing infos, strategists suffering from biased insights. People may ask you to extend your projects and solve their data quality issues, which is a good reason to ask for more budget.

Don’t overengineer your projects then making it too complex and sophisticated.

Talend provides simple and powerful platform to produce fast results so you can start small and deliver big. One example of having implemented Data Management from the start, is Carhartt who managed to clean 50,000 records in one day. You don’t necessarily need to wait a long time to see results.

Don’t Leave the clock running and leave your team without clear directions

Set and meet deadlines as often as possible. It will bolster your credibility. As time is running fast and your organization may shift to short term business priorities, track your route and stay focused on your end goals. Make sure you deliver project on time. Then celebrate success. When finishing a project milestone, make sure you take time to celebrate with your team and within the organization.


To learn more about Data Quality, please download our Definitive Guide to Data Quality.


The post Six Do’s and Don’ts of Collaborative Data Management appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL
Syndicate content