AdoptOS

Assistance with Open Source adoption

ETL

Talend Cloud: A hybrid-friendly, secure Cloud Integration Platform

Talend - Tue, 12/11/2018 - 16:21

As enterprises move towards massively scaled interconnected software systems, they are embracing the cloud like never before. Very few would dispute the notion that the cloud has become one of the biggest drivers of change in the enterprise IT landscape and that the cloud has provided IT a powerful way to deploy services in a timely and cost-effective manner.

However, the tremendous benefits that you’ve tapped into the cloud has to be balanced against the need to adopt, configure and manage services in a secure manner. Digital transformation with a non-secure cloud solution is impossible.

When we look at the integration world, it is often a fragmented place where applications and data may each exist on multiple platforms; IT teams are frequently supporting a plethora of on-premises and cloud apps while moving petabytes of sensitive data among those apps. Data leakage, corruption, or loss can be devastating to a business.

A cloud integration solution, such as Integration Platform as a Service (iPaaS), with subpar security mechanisms may make your sensitive data such as customer records, passwords, or personal information susceptible to data breach and misuse. One way to avoid this, is to choose an iPaaS that has robust security mechanisms such as data encryption, user and access protection, security certifications, and information security standards in place.

Talend Cloud is a unified cloud integration platform (iPaaS) that integrates data, people, and applications across all types of data sources: public cloud, on-premises, and SaaS applications to deliver fast and reliable big data analytics. Besides having natively inherited AWS’s security practices, the platform is built from the ground up with all pieces of the infrastructure puzzle taken into account to best secure each layer: physical, network, operating system, data base, and application. The entire stack is secure for each customer, partner, and workload. As part of the holistic security design, Talend Cloud has earned industry-leading security certifications to ensure that its platform meet exacting stands for a range of customers and industries.

Talend Cloud provides cloud platform security certifications such as:

  • SSAE 16
  • SOC 2 Type II certification
  • ISAE 3402 certification
  • Cloud Security Alliance Certification (Level 1)

As well as Compliance Certifications such as GDPR via AWS. Additionally, Talend Cloud provides security features and services such as data encryption, key management for tenant isolation and remote engine, bi-annual penetration test, SSO support for Okta, IAM management and many more.

To learn more about how Talend provides security in iPaaS, take a look at the complete list of powerful security features detailed in this Talend Cloud Security White Paper or drop us a line at  https://www.talend.com/contact/ for any questions.

 

 

The post Talend Cloud: A hybrid-friendly, secure Cloud Integration Platform appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Three issues that every organization must consider for optimal data usage

SnapLogic - Tue, 12/11/2018 - 16:03

Originally published on AG Connect.  Companies have an obsession with data. There is more data than ever before and companies generate large volumes of data about customers, finances, machines and social media to gain insights and determine business strategies. But it is difficult to extract value from those data. Many organizations have been so busy gathering data[...] Read the full article here.

The post Three issues that every organization must consider for optimal data usage appeared first on SnapLogic.

Categories: ETL

Data Matching with Different Regional Data Sets

Talend - Tue, 12/11/2018 - 11:22

When it comes to Data Matching, there is no ‘one size fits all menu’. Different matching routines, different algorithms and different tuning parameters will all apply to different datasets. You generally can’t take one matching setup used to match data from one distinct data set and apply it to another. This proves especially true when matching datasets from different regions or countries. Let me explain.

Data Matching for Attributes that are Unlikely to Change

Data Matching is all about identifying unique attributes that a person, or object, has; and then using those attributes to match individual members within that set. These attributes should be things that are ‘unlikely to change’ over time. For a person, these would be things like “Name” and “Date of Birth”. Attributes like “Address” are much more likely to change and therefore of less importance, although this does not mean you should not use them. It’s just that they are less unique and therefore of less value, or lend less weight, to the matching process. In the case of objects, they would be attributes that uniquely identify that object, so in the case of say, a cup (if you manufactured cups) those attributes would be things like “Size”, “Volume”, “Shape”, “Color”, etc. The attributes themselves are not too important, it’s the fact that they should be ‘things’ that are unlikely to change over time.

So, back to data relating to people. This is generally the main use case for data matching. So here comes the challenge. Can’t we use one set of data matching routines for a ‘person database’ and just use the same routines etc. for another dataset? Well, the answer is no, unfortunately. There are always going to be differences in the data that will manifest itself during the matching, and none more so than using datasets from different geographical regions such as different countries. Data matching routines are always tuned for a specific dataset, and whilst there are always going to be differences from dataset to dataset. The difference becomes much more distinct when you chose data from different geographical regions. Let us explore this some more.

Data Matching for Regional Data Sets

First, I must mention a caveat. I am going to assume that matching is done in western character sets, using Romanized names, not in languages or character sets such as Japanese or Chinese. This does not mean the data must contain only English or western names, far from it, it just means the matching routines are those which we can use for names that we can write using western, or Romanized characters. I will not consider matching using non-western characters here. 

Now, let us consider the matching of names. To do this for the name itself, we use matching routines that do things like phoneticize the names and then look for differences between the result. But first, the methodology involves blocking on names, sorting out the data in different piles that have similar attributes. It’s the age-old ‘matching the socks’ problem. You wouldn’t match socks in a great pile of fresh laundry by picking one sock at a time from the whole pile and then trying to find its duplicate. That would be very inefficient and take ages to complete. You instinctively know what to do, you sort them out first into similar piles, or ‘blocks’, of similar socks. Say, a pile of black socks, a pile of white socks, a pile of colored socks etc. and then you sort through those smaller piles looking for matches. It’s the same principle here. We sort the data into blocks of similar attributes, then match within those blocks. Ideally, these blocks should be of a manageable and similar size. Now, here comes the main point.

Different geographic regions will produce different distributions of block sizes and types that result in differences to the matching that will need to be done in those blocks, and this can manifest itself in terms of performance, efficiency, accuracy and overall quality of the matching. Regional variations in the distribution of names within different geographical regions, and therefore groups of data, can vary widely and therefore cause big differences in the results obtained.

Let’s look specifically at surnames for a moment. In the UK, according to the National Office of Statistics, there are around 270,000 surnames that cover around 95% of the population. Now obviously, some surnames are much more common than others. Surnames such as Jones, Brown, Patel example are amongst the most common, but the important thing is there is a distribution of these names that follow a specific graphical shape if we chose to plot them. There will be a big cluster of common names at one end, followed by a specific tailing-off of names to the other, whilst the shape of the curve would be specific to the UK and to the UK alone. Different countries or regions would have different shapes to their distributions. This is an important point. Some regions would have a much narrower distribution, names could be much more similar or common, whilst some regions would be broader, names would be much less common. The overall distribution of distinct names could be much more or much less and this would, therefore, affect the results of any matching we did within datasets emanating from within those regions. A smaller distribution of names would result in bigger block sizes and therefore more data to match on within those blocks. This could take longer, be less efficient and could even affect the accuracy of those matches. A larger distribution of names would result in many more blocks of a smaller size, each of which would need to be processed.

Data Matching Variances Across the Globe

Let’s take a look at how this varies across the globe. A good example of regional differences comes from Taiwan. Roughly forty percent of the population share just six different surnames (when using the Romanised form). Matching within datasets using names from Taiwanese data will, therefore, result in some very large blocks. Thailand, on the other hand, presents a completely different challenge. In Thailand, there are no common surnames. There is actually a law called the ‘Surname Act’ that states surnames cannot be duplicated and families should have unique surnames. In Thailand, it is incredibly rare for any two people to share the same name. In our blocking exercise, this would result in a huge number of very small blocks.

The two examples above may be extreme, but they perfectly illustrate the challenge. Datasets containing names vary from region to region and therefore the blocking and matching strategy can vary widely from place to place. You cannot simply use the same routines and algorithms for different datasets, each dataset is unique and must be treated so. Different matching strategies must be adopted for each set, each matching exercise must be ‘tuned’ for that specific dataset in order to find the most effective strategy and the results will vary. It doesn’t matter what toolset you choose to use; the same principle applies to all as it’s an issue that is in the data and cannot be changed or ignored. 

To summarize, the general point is that regional, geographic, cultural and language variations can make big differences to how you go about matching personal data within different datasets. Each dataset must be treated differently. You must have a good understanding of the data contained within those datasets and you must tune and optimize your matching routines and strategy for each dataset. Care must be taken to understand the data and select the best strategy for each separate dataset. Blocking and matching strategies will vary, you cannot just simply reuse the exact same approach and routines from dataset to dataset, this can vary widely from region to region. Until next time!

The post Data Matching with Different Regional Data Sets appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Accelerate your machine learning (ML) projects with three new ML Snap Packs

SnapLogic - Mon, 12/10/2018 - 15:59

More and more organizations are searching for ways to drive business value with machine learning (ML), a practical form of artificial intelligence (AI). But conventional approaches to developing and deploying ML models are hampering their efforts. Traditional methods are slow, code-intensive, and require specialized skill sets. What’s more, data scientists and data engineers are forced[...] Read the full article here.

The post Accelerate your machine learning (ML) projects with three new ML Snap Packs appeared first on SnapLogic.

Categories: ETL

Hit the “Easy” Button with Talend & Databricks to Process Data at Scale in the Cloud

Talend - Fri, 12/07/2018 - 15:00

The challenge today for big data is that 85% of on-premises based Big Data projects fail to meet expectations and over 2/3 of Big Data potential is not being realized by organizations. Why is that you ask? Well, simply put on-premises “Big Data” programs are not that easy. 

Your data engineers will need to be proficient in new programming languages and architectural models while your system admins will need to learn how to set up and manage a data lake. So you’re not really focusing on what you do best and instead paying top dollar for data engineers with the programming skills while spending (wasting) a lot of time configuring infrastructure – and not reaping the benefits of a big data program.

In short, making big data available at scale is hard and can be very expensive and the complexity is really killing big data projects.  

Welcoming modern data engineering in the cloud

Data engineers ensure the data the organization is using clean, reliable, and prepped for whichever use cases that may present themselves.  In spite of the challenges with on-premises “big data,” technologies like Apache Spark have become a best practice due to its ability to scale as jobs get larger and SLA’s become more stringent.  

But using Spark on-premises as we’ve highlighted is not that easy.  The market and technologies have come to an inflection point where it is agreed that what is needed is the ability to:

  1. Eliminate the complexity of system management to lower operations costs and increase agility
  2. Have automatic scale up/down of processing power, to grow and shrink as needed while only paying for what you use
  3. Enable a broader set of users to utilize these services without requiring a major upgrade in their education or hiring expensive external expertise

To simplify success with big data programs, market leaders have moved from an on-premises model to a cloud model.  Cloud based environments offer the ability to store massive volumes of data as well as all varieties (structured to unstructured). Now what is needed is the ability to process that data for consumption in BI tools, data science, or machine learning.

Databricks, founded by the original creators of Apache Spark, provides the Databricks Unified Analytics Platform. Databricks accelerates innovation by bringing data and ML together.  This service solves many of the hard challenges discussed above by automatically handling software provisioning, upgrades, and management.  Databricks also manages the scaling up and down to ensure that you have the right amount of processing power and saving money but shutting down clusters when they are not needed.  By taking this workload off the table for their customers, this allows those customers to focus on the next level of analytics – machine learning and data science.

While Databricks solves two out of three of the big challenges posed, there is still the third challenge of making the technology more accessible to “regular” data engineers that do not have the programming expertise to support these massively parallel, cloud environments.  But that is where Talend comes in.  Talend provides a familiar data flow diagram design surface and will convert that diagram into an expertly programmed data processing job native to Databricks on Azure or AWS.

The combination of Databricks and Talend then provides a massively scalable environment that has a very low configuration overhead while having a highly productive and easy to learn/use development tool to design and deploy your data engineering jobs.  In essence, do more with less work, expense, and time.

For further explanation and a few examples, keep reading….

Example use case

Watch these videos and see for yourself how easy it is to run a Spark Serverless in the Cloud.

Movie recommendation use case with machine learning and Spark Serverless

 

  Create and connect to a Databricks Spark Cluster with Talend

 

Click here to learn more about serverless and how to modernize your architecture?

Check out our GigaOM webinar with Databricks and Talend to learn how to accelerate your analytics and machine learning.

The post Hit the “Easy” Button with Talend & Databricks to Process Data at Scale in the Cloud appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Gartner Application Strategies & Solutions Summit recap: How Wendy’s stays competitive

SnapLogic - Thu, 12/06/2018 - 17:46

The SnapLogic team joined thousands of business and technology leaders recently at the Gartner Application Strategies & Solutions Summit (APPS) in Las Vegas. With dozens of workshops, breakout sessions, and meetups, attendees and vendors alike had several opportunities to learn from and network with Gartner analysts, industry experts and peers on API management, AI, DevOps,[...] Read the full article here.

The post Gartner Application Strategies & Solutions Summit recap: How Wendy’s stays competitive appeared first on SnapLogic.

Categories: ETL

Driving self-service analytics with SnapLogic and Workday Prism Analytics

SnapLogic - Wed, 12/05/2018 - 19:03

According to Gartner, “the analytics and business intelligence (BI) software market growth doubled to 11.5 percent in 2017, which is above the overall enterprise software market growth (about 10 percent).” Gartner also states that “modern BI platforms represent the highest-growing segment at 30 percent.” This growth indicates that organizations are keen on empowering their teams[...] Read the full article here.

The post Driving self-service analytics with SnapLogic and Workday Prism Analytics appeared first on SnapLogic.

Categories: ETL

Getting Started with Talend Open Studio: Run and Debug Your Jobs

Talend - Mon, 12/03/2018 - 13:39

In the past blogs, we have learned how to install Talend Open Studio, how to build a basic job loading data into Snowflake, and how to use a tMap component to build more complex jobs. In this blog, we will enable you with some helpful debugging techniques and provide additional resources that you can leverage as you continue to learn more about Talend Open Studio. 

As with our past blogs, you are welcome to follow along in our on-demand webinar. This blog corresponds with the last video of the webinar.

In this tutorial, we will quickly address how to successfully debug your Talend Jobs, should you run into errors. Talend classifies errors into two main categories: Compile errors and Runtime errors. A Compile error prevents your Java code from compiling properly (this usually includes syntax errors or Java class errors). A Runtime error prevents your job from completing successfully, resulting in the job failing during execution.

In the previous blog, we designed a Talend job to generate a Sales report to get data into a Snowflake cloud data warehouse environment. For the purposes of this blog, we have altered that job, so when we try to run it, we will see both types of errors.  In this way, we can illustrate how to resolve both types of errors.

Resolving Compile Errors in Talend Open Studio

Let’s look at a compile error.  When we execute this job in Talend Studio, it will first attempt to compile, however, the compile will fail with the below error.

You can review java error details within the log, which states that the “quantity cannot be resolved or is not a field”. Conveniently, it also highlights the component the error is most closely associated to.

 To locate the specific source of the problem within the tMap component, you can either dive into the tMap and search yourself OR you can switch to the Code view. Although you cannot directly edit the code here, you can select the red box highlighted on the right of the scroll bar to bring you straight to the source of the issue.

In this case, the arithmetic operator is missing from the Unit Price and Quantity equation.

Next, head into the tMap component and make the correction to the Unit Price and Quantity equation by adding a multiplication operator (*) between Transactions.Unit_Price and Transactions.qty. Click Ok and now run the job again.

And now you see the compile error has been resolved.

Resolving Runtime Errors in Talend Open Studio

Next, the job attempts to send the data out to Snowflake. And a runtime error occurs. You can read the log and it says, “JDBC driver not able to connect to Snowflake” and “Incorrect username or password was specified”.

To address this issue, we’ll head to the Snowflake component and review the credentials. It looks like the Snowflake password was incorrect, so re-enter the Snowflake password, and click run again to see if that resolved the issue.

And it did! This job has been successfully debugged and the customer data has been published to the Snowflake database.

Conclusion

This was the last of our planned blogs on getting started with Talend Open Studio, but there are other resources that you can access to improve your skills with Talend Open Studio. Here are some videos that we recommend you look at to strengthen and add on to the skills that you have gained from these past four blogs:

Joining Two Data Sources with the tMap Component – This tutorial will give you some extra practice using tMap to join your data complete with downloadable dummy data and PDF instructions.

Adding Condition-Based Filters Using the tMap Component ­– tMap is an incredibly powerful and versatile component with many uses, and in this tutorial, you will learn how to use tMap and its expression builder to filter data based on certain criteria.

Using Context Variables – Learn how to use context variables, which allow you to run the same job in different environments.

For immediate questions, be sure to visit our Community, and feel free to let us know what types of tutorials would be helpful for you.  

The post Getting Started with Talend Open Studio: Run and Debug Your Jobs appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Accelerate the Move to Cloud Analytics with Talend, Snowflake and Cognizant

Talend - Fri, 11/30/2018 - 18:20

In the last few years, we’ve seen the concept of the “Cloud Data Lake” has gained more traction in the enterprise. When done right, a data lake can provide the agility for Digital Transformation around customer experience by enabling access to historical and real-time data for analytics.

However, while the data lake is now a widely accepted concept both on-premises and in the cloud, organizations still have trouble making them usable and filling them with clean, reliable data. In fact, Gartner has predicted that through 2018, 90% of deployed data lakes will be useless.  This is largely due to the diverse and complex combinations of data sources and data models that are popping up more than ever before.            

Migrating enterprise analytics on-premises to the cloud requires significant effort before delivering value. Cognizant just accelerated your time to value with a new Data Lake Quickstart solution. In this blog, I want to show you how you can run analytics migration projects to the cloud significantly faster, deliver in weeks instead of months, with lower risk using this new Quickstart.

Cognizant Data Lake Quickstart with Talend on Snowflake

First, let’s start by going into detail on what this Quickstart solution is comprised of. The Cognizant Data Lake Quickstart Solution includes:

  • A data lake reference architecture based on:
    • Snowflake, the data warehouse built for the cloud
    • Talend Cloud platform
    • Amazon S3 and Amazon RDS
  • Data migration from on-premises data warehouses (Teradata/Exadata/Netezza) to Snowflake using metadata migration
  • Pre-built jobs for data ingestion and processing (pushdown to Snowflake and EMR)

Data Lake Reference Architecture

How It Works
  • Uses Talend to extract data files from on-premises (structured/semi-structured) and ingest into Amazon S3 using a metadata-based approach to store data quality rules and target layout
  • Stores data on Amazon S3 as an enterprise data lake for processing
  • Leverages the Talend Snowflake data loader to move files to Snowflake from Amazon S3
  • Runs Talend jobs on execution connecting to Snowflake and process data
Data Migration from On-premises Data Warehouse (Teradata/Exadata/Netezza) to Snowflake

For data migration projects, the metadata-based migration framework leverages Talend and Snowflake. Both source and target (Snowflake) metadata (Schema, tables, columns and datatype) are captured in the metadata repository using a Talend ETL process. The data migration is executed using Talend and Snowflake Copy utility.

Pre-built Jobs for Data ingestion and Processing

For incremental data loads, Cognizant has included pre-built Talend jobs that support data loads from source systems into the Amazon S3 layer, further into Snowflake Staging. These jobs then transform and load the data into Snowflake Presentation layer tables using Snowflake compatible SQL. Another option is to have pre-built jobs use the Amazon S3 layer to build a conformed layer in S3 using AWS EMR and Talend Spark components then later load the conformed data directly into Snowflake Presentation layer tables.

Conclusion

So, what are the benefits of this Quickstart architecture? Let’s review:

  • Cost optimization – Up to 50% reduction in initial setup effort to migrate to Snowflake
  • Simplification – Template based approach to facilitate Infrastructure setup and Talend jobs
  • Faster time to market – Deliver in weeks instead of months.
  • Agility: Any changes to migration mainly consist of changes only to metadata without any code change. Self-service mechanism to onboard new sources, configurations, environments, etc. just by providing metadata with minimal Talend technical expertise. It’s also easy to maintain as all data migration configurations are maintained in a single metadata repository.

Now go out and get your cloud data lake up and running quickly. Comment below and let me know what you think!

The post Accelerate the Move to Cloud Analytics with Talend, Snowflake and Cognizant appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

What the Healthcare Industry Can Teach Companies About Their Data Strategy

Talend - Fri, 11/30/2018 - 14:55

The information revolution – which holds the promise of a supercharged economy through the use of advanced analytics, data management technologies, the cloud, and knowledge – is affecting every industry. Digital transformation requires major IT modernization and the ability to shorten time data to insights to make the right business decisions. For companies, it means being able to efficiently process and analyze data from a variety of sources at scale. All this in the hope to streamline operations, enhance customer relationship, and provide new and improved products and services.

The healthcare and pharmaceutical industries are the perfect embodiment of what is at stake with the data revolution. Opportunities lie at all the steps of the health care value chain for those who succeed in their digital transformation:

  • Prevention: Predicting patients at risk for disease or readmission.
  • Diagnosis: Accurately diagnosing patient conditions, matching treatments with outcomes.
  • Treatment: Providing optimal and personalized health care through the meaningful use of health information.
  • Recovery and reimbursement: Reducing healthcare costs, fraud and avoidable healthcare system overuse. Providing support for reformed value-based incentives for cost effective patient care, effective use of Electronic Health Records (EHR), and other patient information.

Being able to unlock the relevance of healthcare data is the key to having a 360-view of the patient and, ultimately, delivering better care.

Data challenges in the age of connected care

But that’s simpler said than done. The healthcare industry faces the same challenge as others in that often business insights are missed due to speed of change and the complexity of mounting data users and needs. Healthcare organizations have to deal with massive amounts of data housed in a variety of data silos such as information from insurance companies, patient records from multiple physicians and hospitals. To access this data and quickly analyze healthcare information, it is critical to break down the data silos.

Healthcare organizations are increasingly moving their data warehouse to a cloud-based solution and creating a single, unified platform for modern data integration and management across cloud and on-premise environments. Cloud-based integration solutions provide broad and robust connectivity, data quality, and governance tracking, simple pricing, data security and big data analysis capabilities.

Decision Resources Group (DRG) finds success in the cloud

Decision Resources Group (DRG) is a good example of the transformative power of the cloud for healthcare companies. DRG provides healthcare analytics, data and insight products and services to the world’s leading pharma, biotech and medical technology companies. To extend its competitive edge, DRG made the choice to build a cloud data warehouse to support the creation of its new Real-World Data Platform, a comprehensive claim and electronic health record repository that covers more than 90% of the US healthcare system. With this platform, DRG is tracking the patient journey, identifying influencers in healthcare decision making and segmenting data so that their customers have access to relevant timely data for decision making.

DRG determined that their IT infrastructure could not scale to handle the petabytes of data that needed to be processed and analyzed. They looked for solutions that contained a platform with a SQL engine that works with big data and could run on Amazon Web Services (AWS) in the cloud.

DRG selected data integration provider Talend and the Snowflake cloud data warehouse as the foundation of its new Real-World Data Platform. With an integration with Spark for advanced machine learning and Tableau for analysis, DRG gets scalable compute performance without complications allowing their developers to build data integration workflows without much coding involved. DRG now has the necessary infrastructure to accommodate and sustain massive growth in data assets and user groups over time and is able to perform big data analytics at the speed of cloud. This is the real competitive edge.

The right partner for IT modernization

When it came to its enterprise information overhaul, DRG is not the only healthcare company that made the choice to modernize in the cloud. AstraZeneca, the world’s seventh-largest pharmaceutical company, chose to build a cloud data lake with Talend and AWS for its digital transformation. This architecture enables them to scale up and scale down based on business needs.

Healthcare and pharmaceutical companies are at the forefront of a major transformation across all industries requiring the use of advanced analytics and big data technologies such as AI and machine learning to process and analyze data to provide insights into the data. This digital transformation requires IT modernization, using hybrid or multi-cloud environments and providing a way to easily combine and analyze data from various sources and formats. Talend is the right partner for these healthcare companies, but also for any other company going through digital transformation.

Additional Resources: 

Read more about DRG case study https://www.talend.com/customers/drg-decision-resources-group/

Read more about AstraZeneca case study https://www.talend.com/customers/astrazeneca/

Talend cloud https://www.talend.com/products/integration-cloud/

The post What the Healthcare Industry Can Teach Companies About Their Data Strategy appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Three reasons to move your on-premises data architecture to the cloud

SnapLogic - Thu, 11/29/2018 - 15:35

Most companies only use 5 to 10 percent of the data they collect. So estimates Beatriz Sanz Sai, a 20-year veteran in advanced analytics and the head of Ernst & Young’s global data and analytics practice. While it’s impossible to validate such a claim, the fact is many organizations are gathering lots of data but[...] Read the full article here.

The post Three reasons to move your on-premises data architecture to the cloud appeared first on SnapLogic.

Categories: ETL

Talend and Red Hat OpenShift Integration: A Primer

Talend - Wed, 11/28/2018 - 14:41

One of the aspects I am always fascinated about Talend is its ability to run programs according to multiple job execution methodologies. Today I wanted to write an overview of a new way of executing data integration jobs using Talend and Red Hat OpenShift Platform.

First and foremost, let us do a quick recap of the standard ways of running Talend jobs. Users usually run Talend jobs using Talend schedulers which can be either in the Cloud or On-premise. Other methods include creating standalone jobs, building web services from Talend Jobs, building OSGI Bundle for ESB and the latest entry to this list from Talend 7.1 onwards is building the job as Docker image. For this blog, we are going to focus on the Docker route and show you how Talend Data Integration jobs can be used with Red Hat OpenShift Platform. 

I would also highly recommend reading two other interesting Talend blogs related to the interaction between Talend and Docker, which are:

  1. Going Serverless with Talend through CI/CD and Containers by Thibaut Gourdel
  2. Overview: Talend Server Applications with Docker by Michaël Gainhao 

Before going to other details, let’s get into the basics of containers, Docker and Red Hat OpenShift Platform. For all those are already proficient in container technology, I would recommend skipping ahead to the next section of the blog.

Containers, Docker, Kubernetes and Red Hat OpenShift

What is a container? A container is a standardized unit of software which is quite lightweight and can be executed without environment related constraints. Docker is the most popular container platform and it has helped the Information technology industry in two major fronts i.e. reduction in the infrastructure and maintenance cost and reduction in turnaround time to bring applications to market. 

The diagram above shows how the various levels Docker container platform and Talend jobs are stacked in application containers. The Docker platform interacts with underlying infrastructure and host operating system and it helps the application containers to run in a seamless manner without knowing the complexities of the underlying layers.

Kubernetes

Next, let us quickly talk about Kubernetes and how it has helped in the growth of container technology. When we are building more and more containers, we will need an orchestrator who can control the management, automatic deployment and scaling of the containers and Kubernetes is the software platform which does this orchestration in a magical way.

Kubernetes helps to coordinate a cluster of computers as a single unit and we can deploy containerized applications on top of the cluster. It consists of Pods which acts as logical host for the containers and these pods are running on top of worker machines in Kubernetes called Nodes. There are a lot of other concepts in Kubernetes but let us limit ourselves to the context of the blog since Talend Job containers are executed on top of these Pods.

Red Hat OpenShift

OpenShift is the open source container application platform from Red Hat which is built on top of Docker containers and Kubernetes container cluster manager. I am republishing the official OpenShift block diagram from Red Hat website for your quick reference.

OpenShift comes in a few flavors apart from the free (Red Hat OpenShift Online Starter) version.

  1. Red Hat OpenShift Online Pro
  2. Red Hat OpenShift Dedicated
  3. Red Hat OpenShift Container Platform

OpenShift Online Pro and Dedicated will be running on top of Red Hat hosted infrastructure and OpenShift Container Platform can be set up on top of customer’s own infrastructure.

Now let’s move to our familiar territory where we are planning to convert the Talend job to Docker container.

Talend Job Conversion to Container and Image Registry Storage

Considering the customers who are using older versions of Talend, we will first create a Docker image from a sample Talend job. Those who are already using Talend 7.1 version, you have the capability to export the Talend jobs to Docker as mentioned in the introduction section. So, you can safely move to the next section where the Docker image is already available and we will meet you there. People who are still with me, let us quickly build a Docker image for a sample job

Categories: ETL

Three ways API management transforms your organization

SnapLogic - Tue, 11/27/2018 - 15:50

In my previous blog post, “Future-proof your API lifecycle strategy,” I took a pretty nuts-and-bolts approach in explaining why companies are rethinking their application programming interface (API) lifecycle strategy for the future. Here I’ll take the discussion up a notch, to talk about three ways that a modern approach to API management can fundamentally change[...] Read the full article here.

The post Three ways API management transforms your organization appeared first on SnapLogic.

Categories: ETL

Getting Started with Talend Open Studio: Building a Complex tMap Job

Talend - Tue, 11/27/2018 - 15:00

In our previous blog, we walked through a simple job moving data from a CSV file into a Snowflake data warehouse.  In this blog, we will explore some of the more advanced features of the tMap component.

Similar to the last blog, you will be working with customer data in a CSV file and writing out to a Snowflake data warehouse; however, you will also be joining your customer CSV file with transaction data. As a result, you will need Talend Open Studio for Data Integration, two CSV data sources that you would like to join (in this example we use customer and transaction data sets), and a Snowflake warehouse for this tutorial. If you would like to follow a video-version of this tutorial, feel free to watch our on-demand webinar and skip to the fourth video.

First, we will join and transform customer data and transaction data.  As you join the customer data with transaction data, any customer data that does not find matching transactions will be pushed out to a tLogRow component (which will present the data in a Studio log following run time). The data that is successfully matched will be used to calculate top grossing customer sales before being pushed out into a Sales Report table within our Snowflake database.

Construct Your Job

Now, before beginning to work on this new job, make sure you have all the necessary metadata configurations in your Studio’s Repository.  As demonstrated in the previous blog (link to blog #2), you will need to import your Customer metadata, and you will need to use the same process to import your transaction metadata. In addition, you will need to import your Snowflake data warehouse as mentioned in the previous blog if you haven’t done so already.

So that you don’t have to start building a new job from scratch, you can take the original job that you created from the last blog (containing your customer data, tMap and Snowflake table) and duplicate it by right-clicking on the job and selecting Duplicate from the dropdown menu. Rename this new job – in this example we will be calling the new job “Generate_SalesReport”.

Now in the Repository you can open the duplicated job and begin adjusting the job as needed. More specifically, you will need to delete the old Snowflake output component and the Customers table configuration within t-Map. 

Once that is done, you can start building out the new flow. 

Start building out your new job by first dragging and dropping your Transactions metadata definition from the Repository onto the Design Window as a tFileInputDelimited component, connecting this new component to the tMap as a lookup.  An important rule-of-thumb to keep in mind when working with the tMap component is that the first source connected to a tMap is the “Main” dataset.  Any dataset linked to the tMap after the “Main” dataset is considered a “Lookup” dataset.

At this point it is a good idea to rename the source connections to the tMap.  Naming connections will come in handy when it’s time to configure the tMap components. To rename connections, perform a slow double-click on the connection arrow. The name will become editable.  Name the “Main” connection (the Customer Dataset) “Customers” and the “Lookup” connection (the Transactions dataset) “Transactions”.  Later, we will come back to this tMap and configure it to perform a full inner join of customer and transaction data.  For now, we will continue to construct the rest of the job flow.

To continue building out the rest of the job flow, connect a tLogRow component as an output from the tMap (in the same way as discussed above, rename this connection Cust_NoTransactions). This tLogRow will capture customer records that have no matching transactions, allowing you to review non-matched customer data within the Studio log after you run your job. In a productionalized job flow, this data would be more valuable within a database table making it available for further analysis, but for simplicity of this discussion we will just write it out to a log.

The primary output of our tMap consists of customer data that successfully joins to transaction data. Once joined, this data will be collected using a tAggregateRow component to calculate total quantity and sales of items purchased. To add the tAggregateRow component to the design window, either search for it within the Component Pallet and then drag and drop it into the Design Window OR click directly in the design window and begin typing “tAggregateRow” to automatically locate and place it into your job flow. Now, connect your tAggregateRow to the tMap and name the connection “Cust_Transactions”.

Next, you will want to sort your joined, aggregated data, so add the tSortRow component.

In order to map the data to its final destination–your Snowflake target table—you will need one more tMap. To distinguish between the two tMap components and their intended purposes, make sure to rename this tMap to something like “Map to Snowflake”.

Finally, drag and drop your Snowflake Sales Report table from within the Repository to your Design window and ensure the Snowflake output is connected to your job. Name that connection “Snowflake” and click “Yes” to get the schema of the target component.

As a best practice, give your job a quick look over and ensure you’ve renamed any connections or components with clear and descriptive labels. With your job constructed, you can now configure your components.

Configuring Your Components

First, double-click to open the Join Data tMap component configuration. On the left, you can see two source tables, each identified by their connection name. To the right, there are two output tables: one for the customers not matched to any transactions and one for the joined data.

Start by joining your customers and transactions data. Click and hold ID from within the Customers table and drag and drop it onto ID from within the Transactions table. The default join type in a tMap component is a Left Outer Join.  But you will want to join only those customer Id’s that have matching transactions, so switch the Join Model to an “Inner Join”. 

Within this joined table, we want to include the customer ID in one column and the customers’ full names on a separate column. Since our data has first name and last name as two separate columns, we will need to join them, creating what is called a new “expression”. To do this, drag and drop both the “first_name” and “last_name” columns onto the same line within the table.  We will complete the expression in a bit.

Similarly, we want the Quantity column from the transaction data on its own line, but we also want to use it to complete a mathematical expression. By dragging and dropping Unit Price and Quantity onto the same line within the new table, we can do just that.

You can now take advantage of the “Expression Builder”, which gives you even more control of your data. It offers a list of defined pre-coded functions that you can apply directly to this expression—I highly recommend that you look through the Expression Builder to see what it can offer. And even better, if you know the Java code for your action, you can enter it manually. In this first case, we want to concatenate the first and last names. After adding the correct syntax within the expression builder, click Ok. 

You will want to use the Expression Building again for your grouped transaction expression. With the Unit Price and Quantity expression, complete an arithmetic action to get the total transaction value by multiplying the Unit Price by the Quantity. Then, click Ok.

Remember, we set our Join Model to an Inner Join.  However, Talend offers a nice way to capture just the outer customers whom didn’t have transactions.  To capture these “rejects” from an Inner Join, first drag and drop ALL the fields from the customers table to the Cust_NoTransactions output table. Then, select the tool icon at the top right of this table definition and switch the “Catch lookup inner join reject” to “true”.

With the fields properly mapped, it is time to move on and review the data below. Rename the first_name field to be simply “name” (since it now includes the last name) and rename the Unit Price column to “transaction cost” (since it now has the mathematical expression applied). Then, ensure no further adjustments are necessary to the table’s column types to avoid any mismatched type conflicts through the flow. 

With this tMap properly configured, click Ok. And then click “Yes” to propagate the changes.

Next, you will need to configure the Aggregate component. To do this, enter the Component Tab (below the Design Workspace) and edit the schema.

To properly configure the output schema of my tAggregateRow component, first choose the columns on the left that will be grouped.  In this case we want to group by ID and Name.  So, select “id” and “name” and then clicking the yellow arrow button pointing to the right.  Next, we want to create two new output columns to store our aggregated values.  By clicking the green “+” button below the “Aggregate Sales (Output)” section you can add the desired number of output columns. First, create a new output column for the total quantity (“total_qty”) and identify it as an Integer type. And then create another for the total sales (“total_sales”) and set it as a double type. Next, click ok, making sure to choose to propagate the changes.

With the output schema configured properly within the tAggregateRow component, we can now configure the Group By and Operations Sections of the tAggregateRow component. To add your two Grouped By output columns and two Operations ouput columns, go back to the Component Tabs. Click the green plus sign below the Group By section twice and the Operations section twice to account for the output columns configured in the tAggregateRow schema. Then, in the Operations section, set the “total_qty” column’s general function as “sum” and identify the input column position as “qty”.  This configures the tAggregateRow component to add all the quantities from the grouped customer Id’s and output the total value in the “total_qty” column. Likewise, set the “total_sales” function as “sum” and input column position as “transaction_cost”.

Next, head to the sorting component and configure it to sort by total sales to help us identify who our highest paying customers are. To do this, click on the green “+” sign in the Component Tab, select “total_sales” in the Schema Column, and select “num” to ensure that your data is sorted numerically. Last, choose “desc” so your data will be shown to you in descending order.

Now, configure your final tMap component, by matching the customer name, total quantity and total sales. Then click Ok and click Yes to propagate the changes.

Finally, make sure your tLogRow component is set to present your data in table format, making it easier for you to read the inner join reject data.

Running Your Job

At last, you are ready to run your job!

Thanks to the tLogRow component, within the log, you can see the six customers that were NOT matched with transaction data.

If you head to Snowflake, you can view your “sales_report” worksheet and review the top customers in order of highest quantity and sales.

And that’s how to create a job that joins different sources, captures rejects, and presents the data the way you want it. In our next blog, we will be going through running and debugging your jobs. As always, please comment and let us know if there are any other basic skills you would like us to cover in a tutorial.

The post Getting Started with Talend Open Studio: Building a Complex tMap Job appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Web Summit 2018 Recap: From data to insights

SnapLogic - Wed, 11/21/2018 - 16:07

Earlier this month, SnapLogic CEO Gaurav Dhillon joined fellow CEOs Zander Lurie of SurveyMonkey and Jager McConnell of Crunchbase at the annual Web Summit conference in Lisbon, Portugal for a lively panel discussion on the opportunities and challenges around data. The session, “Big Data to Big Insights,” was moderated by Intellyx founder and Forbes contributor[...] Read the full article here.

The post Web Summit 2018 Recap: From data to insights appeared first on SnapLogic.

Categories: ETL

5 Recipes for Not Becoming the Data Turkey of Your Organization

Talend - Wed, 11/21/2018 - 09:24

With Thanksgiving around the corner, it’s a perfect moment to take a step back and get some recipes to be data savvy within your organization. Fortunately, Talend experts have a recipe for data success that will help you to stay above the fray.

As companies become more data driven, being ahead of the curve will obviously be considered as a sign of curiosity and a way to differentiate. This is also a means for your company to anticipate incoming trends and thrive in a changing world where data has become the subject of concern and heavy regulations.

Follow these simple recipes to anticipate trends, follow regulations or better manage your data.

 

Recipe #1: Learn more about the Data Kitchen and how to be GDPR Compliant

Recent news is here to remind that not meeting data compliance standards can be damaging for any organization type.  As gdprtoday stated, data complaints appear to be widespread and it won’t stop here.

 

To better understand GPDR, avoid penalties and build proper data governance, follow the guidance of our GPDR Whitepaper that explains how to regain control of your data and get ready with data protection regulations.

 

Recipe #2:  Open the fridge and discover your Data

While taking GDPR into consideration, it will also be the right time to identify how to better value the data you do have. To do that, you first need to understand what’s inside your data sources and assess it.

Data profiling is the process of examining the data available in different data sources and collecting statistics and information about this data. It helps to assess the quality level of the data according to defined set goals. If data is of a poor quality or managed in structures that cannot be integrated to meet the needs of the enterprise, business processes and decision-making suffer.

The best advice would be to read this post that explains the principals of Data Profiling. If you’re a data engineer, also follow this introduction to Talend Open Studio for Data Quality.

 

Recipe #3: Engage your guests, cook and enrich data together.

You alone will have a hard time solving all your data quality problems. It can be far better to consider data as an organization priority and not as a sole IT responsibility.  Managing Data Quality beyond IT will involve different responsibilities in your organization to make your data strategy an enterprise wide success. This webinar will explain you the very first steps about collaborative data management. And if you don’t want to fail, this post will provide you with some good recommendations.

 

Recipe #4: Set the table and let the trust flow freely

Once Data will be cleaned, you would need to provide your teams with a way to share and crawl datasets easily. Follow this webinar  about creating a single point of trust with the newly announced Talend Data Catalog. You’ll learn why a data catalog would benefit your entire company and how to take advantage of it.

 

Recipe #5: Don’t cook solo. Learn from experienced cooks.

You may look for customer references or good recipes from companies in your industry. Don’t hesitate to download this guide to see how companies fight their data integrity challenges with modern Talend Tools.

 

Want a dessert?  Why not enjoying a good pecan pie  with this thought leadership IDC whitepaper about intelligent governance ?

And if you’re still hungry, don’t hesitate to download our Definitive Guide to Data Quality.

Happy Thanksgiving!

 

The post 5 Recipes for Not Becoming the Data Turkey of Your Organization appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Removing the integration headache in M&A deals

SnapLogic - Tue, 11/20/2018 - 15:28

Originally published in Finance Monthly.  The merger and acquisition market is on track to hit record levels in 2018. According to Mergermarket, the first half of the year saw 8,560 deals recorded globally at a value of $1.94tn, with 26 deals falling into the megadeals category of over $10bn per deal. The landscape is littered[...] Read the full article here.

The post Removing the integration headache in M&A deals appeared first on SnapLogic.

Categories: ETL

A Serverless Architecture for Big Data

Talend - Mon, 11/19/2018 - 10:53

This post is co-authored by Jorge Villamariona and Anselmo Barrero at Qubole.

A popular term emerging from the software industry over the last few years is serverless computing, more commonly referred to as just “serverless”. So what does it mean? In its simplest form, a serverless architecture is a computing model where a service provider dynamically manages the allocation of computing resources based on a Service Level Agreement (SLA), provisioning and running resources only for the time needed and without requiring end-user involvement. 

With a serverless architecture, the server provider would automatically increase computing capacity when demand for resources is high and would intelligently downscale when demand for resources goes down.  In this architecture, the end users only care about the tasks they want to execute (get a report, execute a query, execute a data pipeline, etc.) without the hassle of procuring, provisioning and managing the underlying infrastructure.

Traditional vs. Serverless Architectures

So, what are some major advantages for going serverless? Cost, scale and, environment options to start. Traditional architectures rely on the infrastructure administrator’s ability to estimate workloads and size hardware and software accordingly.  Moving to the cloud represents an improvement over on-premises architectures because it allows the infrastructure to scale on-demand. 

However, administrators still need to be involved to define the conditions and rules to scale and manage the cloud infrastructure. The next step forward is to leverage a serverless architecture and allow the infrastructure to automatically decide behind the scenes when to provision, scale and decommission resources as workloads change.  Qubole is a great example of a serverless architecture.

The Qubole platform automatically determines the infrastructure needed and scales it intelligently based on the workloads and SLAs.  As a result, Qubole’s serverless architecture saves customers over 50% in annual infrastructure costs compared to traditional and other managed cloud big data architectures.

This intelligent automation allows Qubole to process over an exabyte of data per month for customers deploying AI, machine learning, and analytics without requiring customers to provision and manage any infrastructure

Value of adopting a serverless architecture for Big Data

Big data deals with large volumes of data arriving at high speed which makes it difficult and inefficient to estimate the infrastructure required for processing it ahead of time.   On-premises infrastructures impose limits in processing power, are expensive, and complex to manage and maintain. Deploying Big Data in the cloud on your own or as a managed service from cloud providers (Amazon AWS, Microsoft Azure, Google Cloud, etc) improves the processing limitations and eases capex, but it creates overhead managing and attempting to optimize the infrastructure.  Improper utilization, underutilization, or overutilization on certain time periods can lead to cloud costs that are much higher than on-premises processing. This, combined with scarce skilled resources results in a very low success rate of only 15% for all big data projects according to Gartner 

To successfully leverage a serverless platform for big data you need to look for a solution that addresses the following questions:

  • Will it reduce big data infrastructure costs?
  • Does it provide automation and resources to execute data pipelines and provide analytics at any scale?
  • Will it reduce operational costs?
  • Will it help my data team scale and not be overrun by business demands for data? 

A serverless platform like Qubole is very appealing to teams deploying big data because it addresses the factors that cause big data projects to fail since it reduces infrastructure complexity and costs, as well as reliance on scarce experts.   

Qubole reduces the administration overhead by providing a simple interface to define the run-time characteristics of big data engines. Users only need to specify the minimum and maximum clusters size, whether to leverage spot instances (in the case of AWS) and the cluster composition to meet their price performance objectives. Qubole then takes over and automatically manages the infrastructure based on the business  requirements  and the workloads’ SLA without the need for further manual intervention

Qubole’s serverless architecture auto-scales to avoid latencies when dealing with large bursty incoming loads and it also down-scales to avoid idle wasted resources.  Qubole can scale from 5 nodes up to 200 nodes in less than 5 minutes. For reference, Qubole also manages the largest Spark cluster in the cloud (500+ nodes).

TCO of a Serverless Big Data Architecture 

When it comes to pricing, Qubole’s serverless architecture offers the best performance by adding computing capacity only when needed and orderly downscaling it as soon as resources become idle. 

With Qubole there are no infrastructure administration overheads or cloud resources overspent. Additionally, as we can see in the chart above, data teams leveraging Qubole don’t suffer from delays in provisioning computing resources when workloads suddenly increase.

 The combination of Talend Cloud and Qubole not only lowers infrastructure costs, but also increases the productivity of the data team, since they don’t need to worry about cluster procurement, configuration, and management. Data teams build their data pipelines in Talend Cloud and push their execution to the Qubole serverless platform, all without having to write complex code or managing infrastructure.

This partnership allows these teams to focus on building highly functioning end-to-end data pipelines, allowing data scientists to deploy faster IOT, machine learning and advanced analytics applications that have high impact on the business. With Talend and Qubole data teams build scalable serverless data pipelines, that work at low operating costs while often being engineered and maintained by single developers.  This cost reduction makes the benefits of big data more accessible to a wider audience.

To learn more about Qubole and test-drive the serverless platform visit https://www.qubole.com/lp/testdrive/

About the Authors

Jorge Villamariona works for the Product Marketing team at Qubole.  Over the years Mr. Villamariona has acquired extensive experience in relational databases, business intelligence, big data engines, ETL,  and CRM systems. Mr. Villamariona enjoys complex data challenges and helping customers gain greater insight and value from their existing data.

    Anselmo Barrero is a Director of Business Development at Qubole with more than 25 years of experience in IT and three patents granted. Mr. Barrero is passionate about building products and strategic partnerships to address market opportunities. He has created products that yield more than 50% YoY growth and established strategic partnerships in areas such as Data Warehouse that resulted in more than 100% consecutive YoY growth.   In his current role Mr. Barrero is responsible for establishing strategic partnerships in big data and the cloud to allow customers reduce the cost and time of getting value out of their data

The post A Serverless Architecture for Big Data appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Getting Started with Talend Open Studio: Building Your First Job

Talend - Sun, 11/18/2018 - 11:02

In the previous blog, we walked through the installation and set-up of Talend Open Studio and briefly demonstrated key features to familiarize you with the Studio interface. In this blog, we will build a simple job to load data from a local file into Snowflake, a cloud data warehouse technology. More specifically, we will build a new job that takes customer data from your local machine and maps it to a target table within Snowflake.

To follow along in this tutorial, you will need Talend Open Studio for Data Integration (download here), some customer data (either use your own customer data or generate some dummy data), and a Snowflake data warehouse with a database already created. If you don’t have access to a Snowflake data warehouse, you can use another relational database technology.

To see a video of this tutorial, please feel free to see our step-by-step webinar—just skip to the third video.

To get started, right-click within the Job Designs folder in the repository and create a new folder titled “Data Integration” to house your job. Next, dive into that new folder and choose “create a job” and name your job Customer_Load. As a best practice, enter the job’s intended purpose and a general description of its overall function. Once you click finish, the new job will be available within your new folder.

Bringing Data from a CSV into a Talend Job

Before building out your flow, bring your customer data into the Repository. To do this, create a new file delimited element within your Metadata folder by right-clicking on the File Delimited button and choosing “Create File Delimited.” Then, name the file “Customer” and click Next.

From there, browse to locate your customer data file. Once selected, the data is visible within the File Viewer. To define the settings and your data elements, click “Next”. In the next window, choose to use a comma as a field separator. Because we selected a CSV file, set the Escape Char setting to CSV and the Text Enclosure to be double quotes. Make sure to check “Set heading row as column names” before proceeding.

Now the customer data is imported and organized. One final time, click “Next” to confirm your data schema. Talend Open Studio will guess each column’s type based on each column’s contents—be sure to double check that everything is correct.

After checking this data set, you can see that Talend Open Studio guessed the “phone2” column was a date, which is incorrect, so instead, change it to String and then click Finish.

Next, you can drag your Customers delimited file onto the Design window as a tFileInputDelimited component. This brings your customer data into the Talend job.

Creating Your Snowflake Connection

Next, you need to create a new connection to your existing Snowflake table. First, find the Snowflake heading in the Repository and right-click to “Create a new connection”. Give your new connection a name, and then enter your account name (so if your Snowflake URL is talend.snowflake.com, your account name would be talend), User ID, and password. Also identify the Snowflake warehouse, schema and database you will be moving your data to.

After you input all of the necessary information, test the connection and make sure it is successful. Following a successful connection to Snowflake, select from the listed tables those you want to be added to the Talend Repository for this connection then click finish. This will import the schema of those tables from Snowflake into the Repository. From there, you can now choose your table of interest from the repository (in this case, Customers) and drag and drop it into the Design window as a tSnowflakeOutput component. As a side note, we have chosen to use an existing table in this tutorial; however, you can also use Talend to create a table in an existing database.

To map the source data (customer file) to the target table (Snowflake), add a tMap component by clicking within the Design window and searching for “tMap”. The tMap component is a very robust component that can be used for a wide range of functions, but for now, we will be using it to simply link the fields between two tables (to learn more about tMap, stay tuned for the next blog in the series). To start using the tMap, connect the CSV file to tMap by dragging the orange arrow from the file delimited component to the tMap.

Next, to connect the tMap to your Snowflake output, right click on tMap and select Row, and click *New Output* to create a new output connection and give it a name like “Customers”. Then, select “Yes” when asked whether you would like to get the schema of the target component.

Your Design window should now look like this: 

To configure the tMap component, double-click on the component itself within the Design Window to reveal the input and output tables. Here you must link both table columns together. You can either drag and drop to link each corresponding field individually, or select Automap, which works great in this case to link the fields between the two tables. Make any adjustments necessary. Once you have ensured the types have been properly auto-selected, click Ok to save this configuration.

If you haven’t installed the additional licenses yet, this Snowflake output component will flash an error. If that’s the case, simply select to install the additional packages which are located within the Help drop-down.

You’re now ready to run the job and populate the data tables within Snowflake. Within the Run tab, simply click Run. You can watch the process run from start to finish within Studio, pushing 500 rows out to Snowflake.

Once the run has been completed successfully, you can head to your Snowflake account. In this example, you can see that 500 records were successfully processed through Talend Studio and loaded into your Snowflake Cloud Data Warehouse.

And that’s how to build your first job within Talend Open Studio. In our next blog, we will go through some more complex functionalities of tMap, and we will also give a few tips on running and debugging your Talend jobs. Please leave a comment and let us know if there are any other things that would help you get started on Talend Open Studio.

The post Getting Started with Talend Open Studio: Building Your First Job appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Beachbody Gets Data Management in Shape with Talend Solutions

Talend - Thu, 11/15/2018 - 21:20

This post is co-authored by Hari Umapathy, Lead Data Engineer at Beachbody and Aarthi Sridharan, Sr.Director of Data (Enterprise Technology) at Beachbody.

Beachbody is a leading provider of fitness, nutrition, and weight-loss programs that deliver results for our more than 23 million customers. Our 350,000 independent “coach” distributors help people reach their health and financial goals.

The company was founded in 1998, and has more than 800 employees. Digital business and the management of data is a vital part of our success. We average more than 5 million monthly unique visits across our digital platforms, which generates an enormous amount of data that we can leverage to enhance our services, provide greater customer satisfaction, and create new business opportunities.

Building a Big Data Lake

One of our most important decisions with regard to data management was deploying Talend’s Real Time Big Data platform about two years ago. We wanted to build a new data environment, including a cloud-based data lake, that could help us manage the fast-growing volumes of data and the growing number of data sources. We also wanted to glean more and better business insights from all the data we are gathering, and respond more quickly to changes.

We are planning to gradually add at least 40 new data sources, including our own in-house databases as well as external sources such as Google Adwords, Doubleclick, Facebook, and a number of other social media sites.

We have a process in which we ingest data from the various sources, store the data that we ingested into the data lake, process the data and then build the reporting and the visualization layer on top of it. The process is enabled in part by Talend’s ETL (Extract, Transform, Load) solution, which can gather data from an unlimited number of sources, organize the data, and centralize it into a single repository such as a data lake.

We already had a traditional, on-premise data warehouse, which we still use, but we were looking for a new platform that could work well with both cloud and big data-related components, and could enable us to bring on the new data sources without as much need for additional development efforts.

The Talend solution enables us to execute new jobs again and again when we add new data sources to ingest in the data lake, without having to code each time. We now have a practice of reusing the existing job via a template, then just bringing in a different set of parameters. That saves us time and money, and allows us to shorten the turnaround time for any new data acquisitions that we had to do as an organization.

The Results of Digital Transformation

For example, whenever a business analytics team or other group comes to us with a request for a new job, we can usually complete it over a two-week sprint. The data will be there for them to write any kind of analytics queries on top of it. That’s a great benefit.

The new data sources we are acquiring allow us to bring all kinds of data into the data lake. For example, we’re adding information such as reports related to the advertisements that we place on Google sites, the user interaction that has taken place on those sites, and the revenue we were able to generate based on those advertisements.

We are also gathering clickstream data from our on-demand streaming platform, and all the activities and transactions related to that. And we are ingesting data from the Salesforce.com marketing cloud, which has all the information related to the email marketing that we do. For instance, there’s data about whether people opened the email, whether they responded to the email and how.

Currently, we have about 60 terabytes of data in the data lake, and as we continue to add data sources we anticipate that the volume will at least double in size within the next year.

Getting Data Management in Shape for GDPR

One of the best use cases we’ve had that’s enabled by the Talend solution relates to our efforts to comply with the General Data Protection Regulation (GDPR). The regulation, a set of rules created by the European Parliament, European Council, and European Commission that took effect in May 2018, is designed to bolster data protection and privacy for individuals within the European Union (EU).

We leverage the data lake whenever we need to quickly access customer data that falls under the domain of GDPR. So when a customer asks us for data specific to that customer we have our team create the files from the data lake.

The entire process is simple, making it much easier to comply with such requests. Without a data lake that provides a single, centralized source of information, we would have to go to individual departments within the company to gather customer information. That’s far more complex and time-consuming.

When we built the data lake it was principally for the analytics team. But when different data projects such as this arise we can now leverage the data lake for those purposes, while still benefiting from the analytics use cases.

Looking to the Future

Our next effort, which will likely take place in 2019, will be to consolidate various data stores within the organization with our data lake. Right now different departments have their own data stores, which are siloed. Having this consolidation, which we will achieve using the Talend solutions and the automation these tools provide, will give us an even more convenient way to access data and run business analytics on the data.

We are also planning to leverage the Talend platform to increase data quality. Now that we’re increasing our data sources and getting much more into data analytics and data science, quality becomes an increasingly important consideration. Members of our organization will be able to use the data quality side of the solution in the upcoming months.

Beachbody has always been an innovative company when it comes to gleaning value from our data. But with the Talend technology we can now take data management to the next level. A variety of processes and functions within the company will see use cases and benefits from this, including sales and marketing, customer service, and others.

About the Authors: 

Hari Umapathy

Hari Umapathy is a Lead Data Engineer at Beachbody working on architecting, designing and developing their Data Lake using AWS, Talend, Hadoop and Redshift.  Hari is a Cloudera Certified Developer for Apache Hadoop.  Previously, he worked at Infosys Limited as a Technical Project Lead managing applications and databases for a huge automotive manufacturer in the United States.  Hari holds a bachelor’s degree in Information Technology from Vellore Institute of Technology, Vellore, India.

 

Aarthi Sridharan

Aarthi Sridharan is the Sr.Director of Data (Enterprise Technology) at Beachbody LLC,  a health and fitness company in Santa Monica. Aarthi’s leadership drives the organization’s abilities to make data driven decisions for accelerated growth and operational excellence. Aarthi and her team are responsible for ingesting and transforming large volumes of data into traditional enterprise data warehouse and into the data lake and building analytics on top it. 

The post Beachbody Gets Data Management in Shape with Talend Solutions appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL
Syndicate content