Assistance with Open Source adoption


IP Expo Europe 2018 recap: AI and machine learning on display

SnapLogic - Thu, 10/11/2018 - 17:06

Artificial Intelligence (AI) and machine learning (ML) have come to dominate business and IT discussions the world over. From boardrooms to conferences to media headlines, you can’t escape the buzz, the questions, the disruption. And for good reason – more than any other recent development, AI and ML are transformative, era-defining technologies that are fundamentally[...] Read the full article here.

The post IP Expo Europe 2018 recap: AI and machine learning on display appeared first on SnapLogic.

Categories: ETL

Bitwise: Cloud Data Warehouse Modernization – Inside Look at Talend Connect London

Talend - Thu, 10/11/2018 - 09:28

With expectations of business users evolving beyond limitations of traditional BI capabilities, we see a general thrust of organizations developing a cloud-based data strategy that enterprise users can leverage to build better analytics and make better business decisions. While this vision for cloud strategy is fairly straightforward, the journey of identifying and implementing the right technology stack that caters to BI and analytical requirements across the enterprise can create some stumbling blocks if not properly planned from the get-go.

As a data management consulting and services company, Bitwise helps organizations with their modernization efforts. Based on what we see at our customers when helping to consolidate legacy data integration tools to newer platforms, modernize data warehouse architectures or implement enterprise cloud strategy, Talend fits as a key component of a modern data approach that addresses top business drivers and delivers ROI for these efforts.

For this reason, we are very excited to co-present “Modernizing Your Data Warehouse” with Talend at Talend Connect UK in London. If you are exploring cloud as an option to overcome limitations you may be experiencing with your current data warehouse architecture, this session is for you. Our Talend partner is well equipped to address the many challenges with the conventional data warehouse (that will sound all too familiar to you) and walk through the options, innovations, and benefits for moving to cloud in a way that makes sense to the traditional user.

For our part, we aim to show “how” people are moving to cloud by sharing our experiences for building the right business case, identifying the right approach, and putting together the right strategy. Maybe you are considering whether Lift & Shift is the right approach, or if you should do it in one ‘big bang’ or iterate – we’ll share some practical know-how for making these determinations within your organization.

With so many tools and technologies available, how do you know which are the right fit for you? This is where vendor neutral assessment and business case development, as well as ROI assessment associated with the identified business case, becomes essential for getting the migration roadmap and architecture right from the start. We will highlight a real-world example for going from CIO vision to operationalizing cloud assets, with some lessons learned along the way.

Ultimately, our session is geared to help demonstrate that by modernizing your data warehouse in cloud, you not only get the benefits of speed, agility, flexibility, scalability, cost efficiency, etc. – but it puts you in a framework with inherent Data Governance, Self-Service and Machine Learning capabilities (no need to develop these from scratch on your own), which are the cutting-edge areas where you can show ROI for your business stakeholders…and become a data hero.

Bitwise, a Talend Gold Partner for consulting and services, is proud to be a Gold Sponsor of Talend Connect UK. Be sure to visit our booth to get a demo on how we convert ANY ETL (such as Ab Initio, OWB, Informatica, SSIS, DataStage, and PL/SQL) to Talend with maximum possible automation.

About the author:

Ankur Gupta

EVP Worldwide Sales & Marketing, Bitwise

The post Bitwise: Cloud Data Warehouse Modernization – Inside Look at Talend Connect London appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

New Talend APAC Cloud Data Infrastructure Now Available!

Talend - Wed, 10/10/2018 - 16:37

As businesses in the primary economic hubs in Asia such as Tokyo, Banglore, Sydney and Singapore are growing at a historical level, they are moving to the cloud like never before. For those companies, their first and foremost priority is to fully leverage the value of their data while meeting strict local data residency, governance, and privacy requirements. Therefore, keeping data in a cloud data center that’s on the other side of the globe simply won’t be enough.

That’s why Talend is launching a new cloud data infrastructure in Japan, in addition to its US data center and the EU data center across Frankfurt and Dublin, in a secure and highly scalable Amazon Web Services (AWS) environment, to allow APAC customers to get cloud data integration and data management services closer to where the data is stored. This is most beneficial to local enterprise businesses and foreign companies who have plans to open up offices in the local region.

There are several benefits Talend Cloud customers can expect from this launch.

Accelerating Enterprise Cloud Adoption

Whether your cloud-first strategy is about modernizing legacy IT infrastructure, leveraging a hybrid cloud architecture, or building a multiple cloud platform, Talend new APAC cloud data infrastructure will allow your transition to the cloud become more seamless. With a Talend Cloud instance independently available in APAC, companies can build a cloud data lake or a cloud data warehouse for faster, more scalable and more agile analytics with more ease.

More Robust Performance

For customers who are using Talend Cloud services in the Asia Pacific, this new cloud data infrastructure will lead to faster extract, transform and load time despite of the data volume. Additionally, it will boost performance for customers using AWS services such as Amazon EMR, Amazon Redshift, Amazon Aurora and Amazon DynamoDB.

Increased Data Security with Proximity

Maintaining data within the local region means the data do not have to make a long trip outside of the immediate area, which can reduce the risk of data security breaches at rest, in transit,  and in use and ease companies’ worries about security measures.

Reduced Compliance and Operational Risks

Because the new data infrastructure offers an instance of Talend Cloud that is deployed independently from the US or the EU, companies can maintain higher standards regarding their data stewards, data privacy, and operational best practices.

For Japan customers, they are likely to be better compliant with Japan’s stringent data privacy and security standards. In the case of industry and government regulation adjustments, Talend Cloud customers would still be able to maintain flexibility and agility to keep up with the changes.

If you are a Talend customer, you will soon have the opportunity to migrate your site to the new APAC data center. Log in or contact your account manager for more information.

Not a current Talend Cloud customers? Test drive Talend Cloud for 30 days free of charge or learn how Talend Cloud can help you connect your data from 900+ data sources to deliver big data cloud analytics instantly.




The post New Talend APAC Cloud Data Infrastructure Now Available! appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Five questions to ask about data lakes

SnapLogic - Tue, 10/09/2018 - 18:15

Data is increasingly being recognized as the corporate currency of the digital age. Companies want to leverage data to achieve deeper insights leading to competitive advantage over their peers. According to IDC projections, total worldwide data will surge to 163 zettabytes (ZB) by 2025, an increase of 10x the amount of what exists today. The[...] Read the full article here.

The post Five questions to ask about data lakes appeared first on SnapLogic.

Categories: ETL

5 Questions to Ask When Building a Cloud Data Lake Strategy

Talend - Tue, 10/09/2018 - 15:41

In my last blog post, I shared some thoughts on the common pitfalls when building a data lake. As the movement to the cloud gets more and more common, I’d like to further discuss some of the best practices when building a cloud data lake strategy. When going beyond the scope of integration tools or platforms for your cloud data lake, here are 5 questions to ask, that can be used as a checklist:

1. Does your Cloud Data Lake strategy include a Cloud Data Warehouse?

As many differences as there are between the two, people often times compare the two types of technology approaches. Data warehouses being the centralization of structured data, and Data Lakes often times being the holy grail of all types of data. (You can read more about the two approaches here.)

Not to confuse the two, as these technology approaches should actually be brought together. You will need a data lake to accommodate all types of data that your business deal with today, make it structured, semi-structured or unstructured, on-premise or in the cloud, or those newer types of data such as IoT data. The data lake often time has a landing zone and staging zone for raw data – data at this stage are not yet consumable, but you may want to keep them for future discovery or data science projects. On the other hand, a cloud data warehouse will be in the picture after data is cleansed, mapped and transformed, so that it is more consumable for business analysts to access and make the use of data for reporting or other analytical use. Data at this stage is often time highly processed to adjust to the data warehouse.

If your approach currently only works with a cloud data warehouse, then often time you are losing raw and some formats of data already, it is not so helpful for any prescriptive or advanced analytics projects, or machine learning and AI initiatives as some meanings within the data is already lost. Vice versa, if you don’t have a data warehouse alongside with your data lake strategy, you will end up with a data swamp where all data is kept with no structure, and not consumable by analysts.

From the integration perspective, make sure your integration tool work with both data lake and data warehouse technologies, which will lead us to the next question. 

Download Data Lakes: Purposes, Practices, Patterns, and Platforms now.
Download Now

2. Does your integration tool have ETL & ELT?

As much as you may know about ETL in your current on-premises data warehouse, moving it to the cloud is a different story, not to mention in a cloud data lake context. Where and how data is processed really depends on what you need for your business.

Similar to what we described in the first question, sometimes you need to keep more of the raw nature of the data, and other times you need more processing. This would require your integration tool to cope with both ETL and ELT capabilities, where the data transformation can be handled either before the data is loaded to your final target, e.g. a cloud data warehouse, or after data is landed there. ELT is more often leveraged when the speed of data ingestion is key to your project, or when you want to keep more intel about your data. Typically, cloud data lakes have a raw data store, then a refined (or transformed) data store. Data scientists, for example, prefer to access the raw data, whereas business users would like the normalized data for business intelligence.

Another use of ELT refers to the massive parallel processing capabilities coming with big data technologies such as Spark and Flink. If your use case requires such a strong processing power, then ELT is a better choice where the processing has more scalability.

3. Can your cloud data lake handle both simple ETL tasks and complex big data ones?

This may look like an obvious question but when you ask about this question, put yourself in the users’ shoes and really think through if your choice of tool can meet both requirements.

Not all of your data lake usage will be complex ones that require advanced processing and transformation, many of them can be simple activities such as ingesting new data into the data lake. Often times, the tasks go beyond the data engineering or IT team as well. So ideally the tool of your choice should be able to handle simple tasks fast and easy, but also can scale to the complexity to meet the requirements of advanced use cases.  Building a data lake strategy that can cope with both can help you make your data lake more consumable and practical for various types of users for different purposes.

4. How about batch and streaming needs?

You may think your current architecture and technology stack is good enough, and your business is not really in the Netflix business where streaming is a necessity. Get it? Well think again.

Streaming data has become a part of our everyday lives whether you realize it or not. The “Me” culture has put everything at the moment of now. If your business is on social media, you are in streaming. If IoT and sensor is the next growth market for your business, you are in streaming. If you have a website for customer interaction, you are in streaming. In IDC’s 2018 Data Integration and Integrity End User Survey, 93% of the respondents indicate the plan to use streaming technology by 2020. Real-time and streaming analytics have become a must for modern businesses today to create that competitive edge. So, this naturally raises the questions: can your data lake handle both your batch and streaming needs? Do you have the technology and people to work with streaming, which is fundamentally different from typical batch needs?

Streaming data is particularly challenging to handle because it is continuously generated by an array of sources and devices as well as being delivered in a wide variety of formats.

One prime example of just how complicated streaming data can be comes from the Internet of Things (IoT). With IoT devices, the data is always on; there is no start and no stop, it just keeps flowing. A typical batch processing approach doesn’t work with IoT data because of the continuous stream and the variety of data types it encompasses.

So make sure your data lake strategy and data integration layer can be agile enough to work with both use cases.

You can find more tips on streaming data, here.

5. Can your data lake strategy help cultivate a collaborative culture?

Last but not least, collaboration.

It may take one person to implement the technology, but it will take a whole village to implement it successfully. The only way to make sure your data lake is a success is to have people use it, improving the workflow one way or another.

In a smaller scope, the workflow in your data lake should be able to be reused and leveraged among data engineers. Less recreation will be needed, and operationalization can be much faster. In a bigger scope, the data lake approach can help improve the collaboration between IT and business teams. For example, your business teams are the experts of their data and they know the meaning and the context of data better than anyone else. Data quality can be much improved if the business team can work on the data for business rule transformations, while IT still governs that activity. Defining such a line with governance in place is a delicate work and no easy task. But you may think through your data lake approach, whether it’s governed but open at the same time to encourage not only final consume /usage of the data, but the improvement of data quality in the process, and be recycled to be available to a broader organization.

To summarize, there we go the 5 questions I would recommend asking when thinking about building a cloud data lake strategy. By no means are these the only questions you should think, but hopefully it initiates some thinking outside of your typical technical checklist. 

The post 5 Questions to Ask When Building a Cloud Data Lake Strategy appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

How to Implement a Job Metadata Framework using Talend

Talend - Tue, 10/09/2018 - 14:39

Today, data integration projects are not just about moving data from point A to point B, there is much more to it. The ever-growing volumes of data, the speed at which the data changes presents a lot of challenges in managing the end-to-end data integration process. In order to address these challenges, it is paramount to track the data-journey from source to target in terms of start and end timestamps, job status, business area, subject area, and the individuals responsible for a specific job. In other words, metadata is becoming a major player in data workflows. In this blog, I want to review how to implement a job metadata framework using Talend. Let’s get started!

Metadata Framework: What You Need to Know

The centralized management and monitoring of this job metadata are crucial to data management teams. An efficient and flexible job metadata framework architecture requires a number of things. Namely, a metadata-driven model and job metadata.

A typical Talend Data Integration job performs the following tasks for extracting the data from source systems and loading them into target systems.

  1. Extracting data from source systems
  2. Transforming the data involves:
    • Cleansing source attributes
    • Applying business rules
    • Data Quality
    • Filtering, Sorting, and Deduplication
    • Data aggregations
  3. Loading the data into a target systems
  4. Monitoring, Logging, and Tracking the ETL process

Figure 1: ETL process

Over the past few years, the job metadata has evolved to become an essential component of any data integration project. What happens when you don’t have job metadata in your data integration jobs? It may lead to incorrect ETL statistics and logging as well as difficult to handle errors occurred during the data integration process. A successful Talend Data Integration project depends on how well the job metadata framework is integrated with the enterprise data management process.

Job Metadata Framework

The job metadata framework is a meta-data driven model that integrates well with Talend product suite. Talend provides a set of components for capturing the statistics and logging information during the flight of the data integration process.

Remember, the primary objective of this blog is to provide an efficient way to manage the ETL operations with a customizable framework. The framework includes the Job management data model and the Talend components that support the framework.

Figure 2: Job metadata model

Primarily, the Job Metadata Framework model includes:

  • Job Master
  • Job Run Details
  • Job Run Log
  • File Tracker
  • Database High Water Mark Tracker for extracting the incremental changes

This framework is designed to allow the production support to monitor the job cycle refresh and look for the issues relating to job failure and any discrepancies while processing the data loads. Let’s go through each of piece of the framework step-by-step.

Talend Jobs

Talend_Jobs is a Job Master Repository table that manages the inventory of all the jobs in the Data Integration domain.




Unique Identifier to identify a specific job


Job Name is the name of the job as per the naming convention (<type>_<subject area>_<table_name>_<target_destination>


Business Unit / Department or Application Area


Job author Information


Additional Information related to the job


The last updated date

Talend Job Run Details

Talend_Job_Run_Details registers every run of a job and its sub jobs with statistics and run details such as job status, start time, end time, and total duration of main job and sub jobs.




Unique Identifier to identify a specific job run


Business Unit / Department or Application Area


Job author Information


Unique Identifier to identify a specific job


Job Name is the name of the job as per the naming convention (<type>_<subject area>_<table_name>_<target_destination>


Unique Identifier to identify a specific sub job


Sub Job Name is the name of the sub job as per the naming convention (<type>_<subject area>_<table_name>_<target_destination>


Main Job Start Timestamp


Main Job End Timestamp


Main Job total job execution duration


Sub Job Start Timestamp


Sub Job End Timestamp


Sub Job total job execution duration


Sub Job Status (Pending / Complete)


Main Job Status (Pending / Complete)


The last updated date

Talend Job Run Log

Talend_Job_Run_Log logs all the errors occurred during particular job execution. Talend_Job_Run_Log extracts the details from the Talend components specially designed for catching logs (tLogCatcher) and statistics (tStatCacher).

Figure 3: Error logging and Statistics

The tLogCatcher component in Talend operates as a log function triggered during the process by one of these components: Java exceptions, tDie or tWarn. In order catch exceptions coming from the job, tCatch function needs to be enabled on all the components.

The tStatCatcher component gathers the job processing metadata at the job level.




Unique Identifier to identify a specific job run


Unique Identifier to identify a specific job


The time when the message is caught


The Process ID of the Job


The Parent process ID


The root process ID


The system process ID


The name of the project


The name of the Job


The ID of the Job file stored in the repository


The version of the current Job


The Name of the current context


The priority sequence


The name of the component if any


Begin or End


The error message generated by the component when an error occurs. This is an After variable. This variable functions only if the Die on error checkbox is cleared.




Time for the execution of a Job or a component with the tStatCaher Statistics check box selected


Record counts


Job references


Log thresholds for managing error handling workflows

Talend High Water Marker Tracker

Talend_HWM_Tracker helps in processing delta and incremental changes of a particular table. The High Water Tracker is helpful when the “Change Data Capture” is not enabled and the changes are extracted based on specific conditions such as “last_updated_date_time” or ‘revision_date_time.” In some cases, the High Water Mark relates to the highest sequence number when the records are processed based on the sequence number.




Unique Identifier to identify a specific source table


Unique Identifier to identify a specific job


The name of the Job


The name of the source table


The source table environment


The source table database type


High Water Field (Datetime)


High Water Field (Number)


High Water SQL Statement

Talend File Tracker

Talend_File_Tracker registers all the transactions related to file processing. The transaction details include source file location, destination location, file name pattern, file name suffix, and the name of the last file processed.




Unique Identifier to identify a specific source file


Unique Identifier to identify a specific job


The name of the Job


The file server environment


The file name pattern


The source file location


The target file location


The file suffix


The name of the last file processed for a specific file


The override flag to re-process the file with the same name


The last updated date


This brings to the end of the implementing Job metadata framework using Talend. The following are key takeaways from this blog:

  1. The need and the importance of Job metadata framework
  2. The data model to support the framework
  3. The customizable data model to support different types of job patterns.

As always – let me know if you have any questions below and happy connecting!

The post How to Implement a Job Metadata Framework using Talend appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Cloudera 2.0: Cloudera and Hortonworks Merge to form a Big Data Super Power

Talend - Thu, 10/04/2018 - 18:39

We’ve all dreamed of going to bed one day and waking up the next with superpowers – stronger, faster and even perhaps with the ability to fly.  Yesterday that is exactly what happened to Tom Reilly and the people at Cloudera and Hortonworks.   On October 2nd they went to bed as two rivals vying for leadership in the big data space. In the morning they woke up as Cloudera 2.0, a $700M firm, with a clear leadership position.  “From the edge to AI”…to infinity and beyond!  The acquisition has made them bigger, stronger and faster. 

Like any good movie, however, the drama is just getting started, innovation in the cloud, big data, IoT and machine learning is simply exploding, transforming our world over and over, faster and faster.  And of course, there are strong villains, new emerging threats and a host of frenemies to navigate.

What’s in Store for Cloudera and Hortonworks 2.0

Overall, this is great news for customers, the Hadoop ecosystem and the future of the market.  Both company’s customers can now sleep at night knowing that the pace of innovation from Cloudera 2.0 will continue and accelerate.  Combining the Cloudera and Hortonworks technologies means that instead of having to pick one stack or the other, now customers can have the best of both worlds. The statement from their press release “From the Edge to AI” really sums up how complementary some of the investments that Hortonworks made in IoT complement Cloudera’s investments in machine learning.  From an ecosystem and innovation perspective, we’ll see fewer competing Apache projects with much stronger investments.  This can only mean better experiences for any user of big data open source technologies.

At the same time, it’s no secret how much our world is changing with innovation coming in so many shapes and sizes.  This is the world that Cloudera 2.0 must navigate.  Today, winning in the cloud is quite simply a matter of survival.  That is just as true for the new Cloudera as it is for every single company in every industry in the world.  The difference is that Cloudera will be competing with a wide range of cloud-native companies both big and small that are experiencing explosive growth.  Carving out their place in this emerging world will be critical.

The company has so many of the right pieces including connectivity, computing, and machine learning.  Their challenge will be, making all of it simple to adopt in the cloud while continuing to generate business outcomes. Today we are seeing strong growth from cloud data warehouses like Amazon RedshiftSnowflakeAzure SQL Data Warehouse and Google Big Query.  Apache Spark and service players like Databricks and Qubole are also seeing strong growth.  Cloudera now has decisions to make on how they approach this ecosystem and they choose to compete with and who they choose to complement.

What’s In Store for the Cloud Players

For the cloud platforms like AWS, Azure, and Google, this recent merger is also a win.  The better the cloud services are that run on their platforms, the more benefits joint customers will get and the more they will grow their usage of these cloud platforms.  There is obviously a question of who will win, for example, EMR, Databricks or Cloudera 2.0, but at the end of the day the major cloud players will win either way as more and more data, and more and more insight runs through the cloud.

Talend’s Take

From a Talend perspective, this recent move is great news.  At Talend, we are helping our customers modernize their data stacks.  Talend helps stitch together data, computing platforms, databases, machine learning services to shorten the time to insight. 

Ultimately, we are excited to partner with Cloudera to help customers around the world leverage this new union.  For our customers, this partnership means a greater level of alignment for product roadmaps and more tightly integrated products. Also, as the rate of innovation accelerates from Cloudera, our support for what we call “dynamic distributions” means that customers will be able to instantly adopt that innovation even without upgrading Talend.  For Talend, this type of acquisition also reinforces the value of having portable data integration pipelines that can be built for one technology stack and can then quickly move to other stacks.  For Talend and Cloudera 2.0 customers, this means that as they move to the future, unified Cloudera platform, it will be seamless for them to adopt the latest technology regardless of whether they were originally Cloudera or Hortonworks customers. 

You have to hand it to Tom Reilly and the teams at both Cloudera and Hortonworks.  They’ve given themselves a much stronger position to compete in the market at a time when people saw their positions in the market eroding.  It’s going to be really interesting to see what they do with the projected $125 million in annualized cost savings.  They will have a lot of dry powder to invest in or acquire innovation.  They are going to have a breadth in offerings, expertise and customer base that will allow them to do things that no one else in the market can do. 

The post Cloudera 2.0: Cloudera and Hortonworks Merge to form a Big Data Super Power appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Tips for enhancing your data lake strategy

SnapLogic - Thu, 10/04/2018 - 18:10

As organizations grapple with how to effectively manage ever voluminous and varied reservoirs of big data, data lakes are increasingly viewed as a smart approach. However, while the model can deliver the flexibility and scalability lacking in traditional enterprise data management architectures, data lakes also introduce a fresh set of integration and governance challenges that[...] Read the full article here.

The post Tips for enhancing your data lake strategy appeared first on SnapLogic.

Categories: ETL

Why Cloud-native is more than software just running on someone else’s computer

Talend - Thu, 10/04/2018 - 10:17

The cloud is not “just someone else’s computer”, even though that meme has been spreading so fast on the internet. The cloud consists of extremely scalable data centers with highly optimized and automated processes. This makes a huge difference if you are talking about the level of application software.

So what is “cloud-native” really?

“Cloud-native” is more than just a marketing slogan. And a “cloud-native application” is not simply a conventionally developed application which is running on “someone else’s computer”. It is designed especially for the cloud, for scalable data centers with automated processes.

Software that is really born in the cloud (i.e. cloud-native) automatically leads to a change in thinking and a paradigm shift on many levels. From the outset, cloud-native developed applications are designed with scalability in mind and are optimized with regard to maintainability and agility.

They are based on the “continuous delivery” approach and thus lead to continuously improving applications. The time from development to deployment is reduced considerably and often only takes a few hours or even minutes. This can only be achieved with test-driven developments and highly automated processes.

Rather than some sort of monolithic structure, applications are usually designed as a loosely connected system of comparatively simple components such as microservices. Agile methods are practically always deployed, and the DevOps approach is more or less essential. This, in turn, means that the demands made on developers increase, specifically requiring them to have well-founded “operations” knowledge.

Download The Cloud Data Integration Primer now.
Download Now

Cloud-native = IT agility

With a “cloud-native” approach, organizations expect to have more agility and especially to have more flexibility and speed. Applications can be delivered faster and continuously at high levels of quality, they are also better aligned to real needs and their time to market is much faster as well. In these times of “software is eating the world”, where software is an essential factor of survival for almost all organizations, the significance of these advantages should not be underestimated.

In this context: the cloud certainly is not “just someone else’s computer”. And the “Talend Cloud” is more than just an installation from Talend that runs in the cloud. The Talend Cloud is cloud-native.

In order to achieve the highest levels of agility, in the end, it is just not possible to avoid changing over to the cloud. Potentially there could be a complete change in thinking in the direction of “serverless”, with the prospect of optimizing cost efficiency as well as agility.  As in all things enterprise technology, time will tell. But to be sure, cloud-native is an enabler on the rise.

About the author Dr. Gero Presser

Dr. Gero Presser is a co-founder and managing partner of Quinscape GmbH in Dortmund. Quinscape has positioned itself on the German market as a leading system integrator for the Talend, Jaspersoft/Spotfire, Kony and Intrexx platforms and, with their 100 members of staff, they take care of renowned customers including SMEs, large corporations and the public sector. 

Gero Presser did his doctorate in decision-making theory in the field of artificial intelligence and at Quinscape he is responsible for setting up the business field of Business Intelligence with a focus on analytics and integration.

The post Why Cloud-native is more than software just running on someone else’s computer appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Moving Big Data to the cloud: A big problem?

SnapLogic - Tue, 10/02/2018 - 13:49

Originally published on Data Centre Review. Digital transformation is overhauling the IT approach of many organizations and data is at the center of it all. As a result, organizations are going through a significant shift in where and how they manage, store and process this data. To manage big data in the not so distant[...] Read the full article here.

The post Moving Big Data to the cloud: A big problem? appeared first on SnapLogic.

Categories: ETL

California Leads the US in Online Privacy Rules

Talend - Tue, 10/02/2018 - 12:12

With California often being looked to as the state of innovation, the newly enforced California Consumer Privacy Act (CCPA) came as no surprise. This new online privacy law gives consumers the right to know what information companies are collecting about them, why they are collecting that data, and who they are sharing it with.

Some specific industries such as Banking or Health Sciences had already considered this type of compliance at the core of their digital transformation. But as the CCPA applies to potentially any company, no matter its size or industry, anyone serious about personalizing interactions with their visitors, prospects, customers, and employees needs to pay attention.

Similarities to GDPR

Although there are indeed some differences between GDPR and the CCPA, in terms of the data management and governance frameworks that needs to be established, the two are similar. These similarities include:

  • You need to know where your personal data is across your different system, which means that you need to run a data mapping exercise
  • You need to create a 360° view of your personal data and manage consent at a fine grain, although CCPA looks more permissive on consent than GDPR
  • You need to publish a privacy notice where you tell the regulation authorities, customers and other stakeholders what you are doing with the personal information within your database. You need to anonymize data (i.e. through data masking) for any other systems that includes personal data, but that you want to scope out from your compliance effort and privacy notice.   
  • You need to foster accountabilities so that the people in the companies that participate in the data processing effort are engaged for compliance
  • You need to know where your data is, including when it is shared or processed through third parties such as business partners or cloud providers. You need to control cross border data transfers and potential breaches while transparently communicating in cases of breaches 
  • You need to enact the data subject access rights, such as the right for data access, data rectification, data deletion, and data portability. CCPA allows a little more time to answer to a request, 45 days versus 1 month.  

Download Data Governance & Sovereignty: 16 Practical Steps towards Global Data Privacy Compliance now.
Download Now

Key Takeaways from the CCPA

The most important takeaway is that data privacy regulations are burgeoning for companies all over the world. With the stakes getting higher and higher, from the steep fines to the reputation risks, compliance consumers that can negatively affect the benefits of digital transformation).

While this law in its current state is specific to California, the idea of a ripple effect at the federal level might not be far off.  So instead of seeing it as a burden, such regulations should be taken as an opportunity. In fact, one of the side effects of all those regulations, today with data scandals now negatively impacting millions of consumers, is that data privacy now makes the headlines. Consumers are now understanding how valuable their data can be and how damaging the impact of losing control over personal data could be.

The lesson learned is that, although regulatory compliance is often what triggers a data privacy compliance project, it shouldn’t be the only driver. The goal is rather to establish a system of trust with your customers for their personal data. In a recent benchmark, where we exercised our right of data access and privacy against more than 100 companies, we could demonstrate that most company are very low on their maturity for achieving that goal. But it demonstrated as well that the best in class are setting the standards for turning it into a memorable experience.






The post California Leads the US in Online Privacy Rules appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Making the Bet on Open Source

Talend - Fri, 09/28/2018 - 17:20

Today, Docker or Kubernetes are obvious choices. But, back in 2015, these technologies were just emerging and hoping for massive adoption. How do tech companies make the right open source technology choices early?

As a CTO today, if you received an email from your head of Engineering saying, “Can we say that Docker is Enterprise production ready now?,” Your answer would undoubtedly be “yes”. If you hadn’t started leveraging Docker already, you would be eager to move on the technology that Amazon and so many other leading companies are now using as the basis of their applications’ architectures. However, what would your reaction be if you had received that email four years ago when Docker was still far from stable, lacked integration, support or tooling with all the major operating systems and Enterprise platforms, On-Premise or Cloud? Well, that is the situation that we at Talend were facing in 2015.

By sharing our approach and our learnings from choosing to develop with Docker and Kubernetes, I hope that we can help other CTOs and tech companies’ leaders with their decisions to go all-in with today’s emerging technologies.

Increasing Docker use from demos to enterprise-ready products

Back in 2014, as we were architecting our next generation Cloud Integration platform, micro-services and containerization were two trends that we closely monitored.

Talend, which is dedicated to monitoring emerging projects and technologies identified Docker as a very promising containerization technology that we could use to run our micro-services. That same year, one of our pre-sales engineers had heard about Docker at a Big Data training and learned about its potential to accelerate the delivery of product demos to its prospects as a more efficient alternative to VMWare or Virtual Box images.

From that day, Docker usage across Talend has seen an explosive growth, from the pre-sales use case of packaging demos to providing reproduction environments to tech support or quality engineering and of course its main usage around service and application containerization for R&D and Talend Cloud.

During our evaluation, we did consider some basic things like we would with any up-and-coming open source technology. First, we needed to determine the state of the security features offered by Docker. Luckily, we found that we didn’t need to build anything on top of what Docker already provided which was a huge plus for us.

Second, like many emerging open source technologies, Docker was not as mature as it is today, so it was still “buggy.” Containers would sometimes fail without any clear explanation, which would mean that we would have to invest time to read through the logs to understand what went wrong—a reality that anyone who has worked with a new technology understands well. Additionally, we had to see how this emerging technology would fit with our existing work and product portfolio, and determine whether they would integrate well. In our case, we had to check how Docker would work with our Java-based applications & services and evaluate if the difficulties that we ran into there would be considered a blocker for future development.

Despite our initial challenges, we found Docker to be extremely valuable and promising as it greatly improved our development life cycle by facilitating the rapid exchange and reuse of pieces of work between different teams. In the first year of evaluation, Docker quickly became the primary technology used by QA to rapidly setup testing environments at a fraction of the cost and with better performance compare to the more traditional Virtual environments (VMWare or VirtualBox).

After we successfully used Docker in our R&D processes, we knew we had made the right choice and that it was time to take it to the next level and package our own services for the benefit of our customers. With the support of containers and more specifically Docker by major cloud players such as AWS, Azure or Google, we had the market validation that we needed to completely “dockerize” our entire cloud-based platform, Talend Cloud.

While the choice to containerize our software with Docker was relatively straightforward, the choice to use Kubernetes to orchestrate those containers was not so evident at the start.

Talend’s Road to Success with Kubernetes

In 2015, Talend started to consider technologies that might orchestrate containers which were starting to make up our underlying architectures, but the technology of choice wasn’t clear. At this point, we faced a situation that every company has experienced: deciding what technology to work with and determining how to decide what technology would be the best fit.

At Talend, portability and agility are key concepts, and while Docker was clearly multiplatform, each of the cloud platform vendors had their own flavor of the orchestration layer.

We had to bet on an orchestration layer that would become the de facto standard or be compatible with major cloud players. Would it be Kubernetes, Apache Mesos or Docker Swarm?

Initially, we were evaluating both Mesos and Kubernetes. Although Mesos was more stable than Kubernetes at the time and its offering was consistent with Talend’s Big Data roadmap, we were drawn to the comprehensiveness of the Kubernetes applications. The fact that Google was behind Kubernetes gave us some reassurance around its scalability promises.

At the time, we were looking for container orchestration offerings, using Mesos required that we bundle several other applications for it to have the functionality we needed. On the other hand, Kubernetes’ applications had everything we needed already bundled together. We also thought about our customers: We wanted to make sure we chose the solution that would be the easiest for them to configure and maintain. Last—but certainly not least—we looked at the activity of the Kubernetes community. We found it promising that many large companies were not only contributing to the project but were also creating standards for it as well. The comprehensive nature of Kubernetes and the vibrancy of its community led us to switch gears and go all-in with Kubernetes.

As with any emerging innovative technology, there are constant updates and project releases with Kubernetes, which results in several iterations of updates in our own applications. However, this was a very small concession to make to use such a promising technology.

Similar to our experience with Docker, I tend to believe that we made the right choice with Kubernetes. Its market adoption (AWS EKS, Azure AKS, OpenShift Kubernetes) proved us right. The technology has now been incorporated into several of our products, including one of our recent offerings, Data Streams.

Watching the team go from exploring a new technology to actually implementing it was a great learning experience that was both exciting and very rewarding.

Our Biggest Lessons in Working with Emerging Technologies

Because we have been working with and contributing to the open source community since we released our open source offering Talend Open Studio for Data Integration in 2006, we are no strangers to working with innovative, but broadly untested technologies or sometimes uncharted territories. However, this experience with Docker and Kubernetes has emphasized some of the key lessons we have learned over the years working with emerging technologies:

  • Keep your focus: During this process, we learned that working with a new, promising technology requires that you keep your end goals in mind at all times. Because the technologies we worked with are in constant flux, it could be easy to get distracted by any new features added to the open source projects. It is incredibly important to make sure that the purpose of working with a particular emerging technology remains clear so that development won’t be derailed by new features that could be irrelevant to the end goal.
  • Look hard at the community: It is incredibly important to look to the community of the project you choose to work with. Be sure to look at the roadmap and the vision of the project to make sure it aligns with your development (or product) vision. Also, pay attention to the way the community is run—you should be confident that it is run in a way that will allow the project to flourish.
  • Invest the time to go deep into the technology: Betting on innovation is hard and does not work overnight. Even if it is buggy, dive into the technology because it can be worth it in the end. From personal experience, I know it can be a lot of work to debug but be sure to keep in mind that the technology’s capabilities—and its community—will grow, allowing your products (and your company) to leverage the innovations that would be very time consuming, expensive and difficult to build on your own.

Since we first implemented Docker and Kubernetes, we have made a new bet on open source: Apache Beam. Will it be the next big thing like Docker and Kubernetes? There’s no way to know at this point—but when you choose to lead with innovation, you can never take the risk-free, well-travelled path. Striving for innovation is a never-ending race, but I wouldn’t have it any other way.

The post Making the Bet on Open Source appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

How to load your Salesforce data into NetSuite

SnapLogic - Thu, 09/27/2018 - 13:06

Connecting customer relationship management (CRM) software with enterprise resource planning (ERP) technology is a fairly common integration requirement for organizations looking to complete a series of business goals from sales forecasting, revenue accounting by product or portfolio, to identifying highest revenue by industry or geography. To achieve these goals, integrators need to synchronize data across[...] Read the full article here.

The post How to load your Salesforce data into NetSuite appeared first on SnapLogic.

Categories: ETL

From Dust to Trust: How to Make Your Salesforce Data Better

Talend - Wed, 09/26/2018 - 17:15

Salesforce is like a goldmine. You own it but it’s up to you to extract gold out of it. Sound complicated? With Dreamforce in full swing, we are reminded that trusted data is the key to success for any organization.

According to a Salesforce survey, “68% of sales professionals say it is absolutely critical or very important to have a single view of the customer across departments/roles. Yet, only 17% of sales teams rate their single view of the customer capabilities as outstanding.”

As sales teams are willing to change into high-performing trusted advisors, they are still spending most of their time on non-selling activities. The harsh reality is that sales people cannot wait to get clean, complete, accurate and consistent data into their systems.  They often end up spending lots of time on their own correcting bad records and reuniting customer insights. To minimize their time spent on data and boost their sales numbers, they need your help to rely on single customer view filled with trusted data.

Whether you’re working for a nonprofit that’s looking for more donors or at a company looking to get qualified leads, managing data quality in your prospects or donator CRM pipeline is crucial.

Watch Better Data Quality for All now.
Watch Now

Quick patches won’t solve your data quality problem in the long run

Salesforce was intentionally designed to digitally transform your business processes but was unfortunately not natively built to process and manage your data. As data is exploding, getting trusted data is becoming more and more critical. As a result, lots of Incubators’ apps started emerging on the Salesforce Marketplace. You may be tempted to use them and patch your data with quick data quality operations. 

But you may end up with separate features built by separate companies with different levels of integration, stability, and performance. You also take the risk of having the app not supported over the long term, putting your data pipeline and operations at risk. This in turn, will only make things worse by putting all the data quality on your shoulders whereas you may rely on your sales representative to resolve data. And you do not want to become the bottleneck of your organization.

After the fact Data Quality is not your best option

Some Business Intelligence Solutions have started emerging, further allowing you to prepare your data at the Analytical Level. But this is often a one-shot option for one single need and not solving the fulfilling the full need. You will still have bad data to input into Salesforce. Salesforce Data can be used in multiple scenarios by multiple persons. Operating Data Quality directly into Salesforce Marketing, Service or Commerce Cloud is the best approach to deliver trusted data at its source so that everybody can benefit from it.

The Rise of Modern Apps to boost engagement:

Fortunately, Data Quality has evolved to become a team activity rather than a single isolated job. You then need to find ways and tools to engage your sales org into data resolution initiatives. Modern apps are key here to make that it a success.

Data Stewardship to delegate errors resolution with business experts

Next-generation data stewardship tools such as Talend Data Stewardship give you the ability to reach everyone who knows the data best within the organization. In parallel, business experts will be comfortable editing and enriching data within UI friendly tool that makes the job easier. Once you captured tacit knowledge from end users, you can scale it to millions of records thru built in machine learning capabilities within Talend Data Stewardship.

Data Preparation to discover and clean data directly with Salesforce

Self-service is the way to get data quality standards to scale. Data analyst spend 60% of their time cleaning data and getting it ready to use. Reduced time and effort mean more value and more insight to be extracted from data. Talend Data Preparation deals with this problem. It is a self – service application that allows potentially anyone to access a data set and then cleanse, standardize, transform, or enrich the data. With it’s ease of use, Data Preparation helps to solves  organizational pain points where often times employees are spending so much time crunching data in Excel or expecting their colleagues to do that on their behalf.

Here are two use cases to learn from:

Use Case 1: Standardizing Contact Data and removing duplicates from Salesforce

Duplicates are the bane of CRM Systems. When entering data into Salesforce, Sales Rep can be in a rush and create duplicates that stay for long. Let them pollute your CRM and it will impact every user and sales rep confidence in your data.

Data Quality here has a real direct business impact on your sales productivity and your marketing campaigns too.

Bad Data mean unreachable customers or untargeted prospects that escape from your customized campaigns leading to low conversion rate and lower revenue. 

With Talend Data Prep, you can really be a game changer: Data Prep allows you to connect natively and directly to your Salesforce platform and perform some ad-hoc data quality operations.

  • By entering your SDFC Credentials, you will get native access to customer fields you want to clean
  • Once data is displayed into Data Prep, Quality Bar and smart assistance will allow you to quickly spot your duplicates
  • Click the header of any column containing duplicates from your dataset.
  • Click the Table tab of the functions panel to display the list of functions that can be applied on the whole table
  • Point your mouse over the Remove duplicate rows function to preview its result and click to apply it
  • Once you perform this operation, your duplicates can be removed
  • You can also register this as a recipe you may want to apply it to other data sources
  • You also have some options in Data Prep to certify your dataset so other team members know this data source can be trusted
  • Collaborate with IT to expand your jobs with Talend Studio to fully automate your data quality operations and proceed with advanced matching operations

Use case 2:  Real time Data Masking into Salesforce

The GDPR defines pseudonymization as “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information.” Pseudonymization or anonymization therefore, may significantly reduce the risks associated with data processing, while also maintaining the data’s utility.

Using Talend Cloud, you can process it directly into Salesforce. Talend Data Preparation enables any business users to obfuscate data the easy way. After native connection with Salesforce Dataset:

  • Click the header of any column containing data to be masked from your dataset
  • Click the Table tab of the functions panel to display the list of functions that can be applied
  • Point your mouse over the Obfuscation function and click to apply it
  • Once you perform this operation, data will be masked and anonymized

When confronted with in-depth fields and more sophisticated data masking techniques, data engineers will take the lead operating pattern data masking techniques directly into Talend Studio and perform them into Salesforce within personal fields such as Security Numbers or Credit Cards.  You can still easily spot data to be masked into Data Prep and ask data engineers to perform anonymization techniques into Talend Studio in a second phase.


Without data quality tools and methodology, you will then end up with unqualified, unsegmented or unprotected customers’ accounts leading to lower revenue, lower marketing effectiveness and more importantly frustrated sales rep spending their time for trusted client data.  As strong as it may be, your Salesforce goldmine can easily transform itself into dust if you don’t put trust into your systems. Only platforms such as Talend Cloud with powerful data quality solutions can help you to extract hidden gold from your Salesforce data and deliver it trusted to the whole organization.

Want to know more? Go to Talend Connect London on October 15th & 16th or Talend Connect Paris on October 17th & 18th to learn from real business cases such as Greenpeace, Petit Bateau.

Whatever your background, technical or not, there will be a session that meets your needs.  We have plenty of use cases and data quality jobs we’ll expose both in technical and customer tracks.





The post From Dust to Trust: How to Make Your Salesforce Data Better appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Building Agile Data Lakes with Robust Ingestion and Transformation Frameworks – Part 1

Talend - Wed, 09/26/2018 - 10:31

This post was authored by Venkat Sundaram from Talend and Ramu Kalvakuntla from Clarity Insights.

With the advent of Big Data technologies like Hadoop, there has been a major disruption in the information management industry. The excitement around it is not only about the three Vs – volume, velocity and variety – of data but also the ability to provide a single platform to serve all data needs across an organization. This single platform is called the Data Lake. The goal of a data lake initiative is to ingest data from all known systems within an enterprise and store it in this central platform to meet enterprise-wide analytical needs.

However, a few years back Gartner warned that a large percentage of data lake initiatives have failed or will fail – becoming more of a data swamp than a data lake. How do we prevent this? We have teamed up with one of our partners, Clarity Insights, to discuss the data challenges enterprises face, what caused data lakes to become swamps, discuss the characteristics of a robust data ingestion framework and how it can help make the data lake more agile. We have partnered with Clarity Insights on multiple customer engagements to build these robust ingestion and transformation frameworks to build their enterprise data lake solution.

Download Hadoop and Data Lakes now.
Download Now

Current Data Challenges:

Enterprises face many challenges with data today, from siloed data stores and massive data growth to expensive platforms and lack of business insights. Let’s take a look at these individually:

1. Siloed Data Stores

Nearly every organization is struggling with siloed data stores spread across multiple systems and databases. Many organizations have hundreds, if not thousands, of database servers. They’ve likely created separate data stores for different groups such as Finance, HR, Supply Chain, Marketing and so forth for convenience’s sake, but they’re struggling big time because of inconsistent results.

I have personally seen this across multiple companies: they can’t tell exactly how many active customers they have or what the gross margin per item is because they get varying answers from groups that have their own version of the data, calculations and key metrics.

2. Massive Data Growth

No surprise that data is growing exponentially across all enterprises. Back in 2002 when we first built a Terabyte warehouse, our team was so excited! But today even a Petabyte is still small. Data has grown a thousandfold—in many cases in less than two decades‚—causing organizations to no longer be able to manage it all with their traditional databases.

Traditional systems scale vertically rather than horizontally, so when my current database reaches its capacity, we just can’t add another server to expand; we have to forklift into newer and higher capacity servers. But even that will have limitations. IT has become stuck in this deep web and is unable to manage systems and data efficiently.

Diagram 1: Current Data Challenges


3. Expensive Platforms

 Traditional relational MPP databases are appliance-based and come with very high costs. There are cases where companies are paying more than $100K per terabyte and are unable to keep up with this expense as data volumes rapidly grow from terabytes to exabytes.

4. Lack of Business Insights

Because of all of the above challenges, business is just focused on descriptive analytics, like a rear mirror view of what happened yesterday, last month, last year, year over year, etc., instead of focusing on predictive and prescriptive analytics to find key insights on what to do next.

What is the Solution?

One possible solution is consolidating all disparate data sources into a single platform called a data lake. Many organizations have started this path and failed miserably. Their data lakes have morphed into unmanageable data swamps.

What does a data swamp look like? Here’s an analogy: when you go to a public library to borrow a book or video, the first thing you do is search the catalog to find out whether the material you want is available, and if so, where to find it. Usually, you are in and out of the library in a couple of minutes. But instead, let’s say when you go to the library there is no catalog, and books are piled all over the place—fiction in one area and non-fiction in another and so forth. How would you find the book you are looking for? Would you ever go to that library again? Many data lakes are like this, with different groups in the organization loading data into it, without a catalog or proper metadata and governance.

A data lake should be more like a data library, where every dataset is being indexed and cataloged, and there should be a gatekeeper who decides what data should go into the lake to prevent duplicates and other issues. For this to happen properly, we need an ingestion framework, which acts like a funnel as shown below.

Diagram 2: Data Ingestion Framework / Funnel

A data ingestion framework should have the following characteristics:
  • A Single framework to perform all data ingestions consistently into the data lake.
  • Metadata-driven architecture that captures the metadata of what datasets to be ingested, when to be ingested and how often it needs to ingest; how to capture the metadata of datasets; and what are the credentials needed connect to the data source systems.
  • Template design architecture to build generic templates that can read the metadata supplied in the framework and automate the ingestion process for different formats of data, both in batch and real-time
  • Tracking metrics, events and notifications for all data ingestion activities
  • Single consistent method to capture all data ingestion along with technical metadata, data lineage, and governance
  • Proper data governance with “search and catalog” to find data within the data lake
  • Data Profiling to collect the anomalies in the datasets so data stewards can look at them and come up with data quality and transformation rules

Diagram 3: Data Ingestion Framework Architecture

Modern Data Architecture Reference Architecture

Data lakes are a foundational structure for Modern Data Architecture solutions, where they become a single platform to land all disparate data sources and: stage raw data, profile data for data stewards, apply transformations, move data and run machine learning and advanced analytics, ultimately so organizations can find deep insights and perform what-if analysis.

Unlike traditional data warehouses, where business won’t see the data until it’s curated, using the modern data architecture businesses can ingest new data sources through the framework and analyze it within hours and days, instead of months and years.

In the next part of this series, we’ll discuss, “What is Metadata Driven Architecture?” and see how it enables organizations to build robust ingestion and transformation frameworks to build successful Agile data lake solutions. Let me know what your thoughts are in the comments and head to Clarity Insights for more info

The post Building Agile Data Lakes with Robust Ingestion and Transformation Frameworks – Part 1 appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

How code-heavy approaches to integration impede digital transformation

SnapLogic - Tue, 09/25/2018 - 15:30

At the heart of digital transformation lies the urge to survive. Any company, no matter how powerful, can go bankrupt, suffer a wave of layoffs, or get thrust into the bad end of an acquisition deal. Market disruption, led by those who have put digital transformation into practice, contributes heavily to such calamities. Look no[...] Read the full article here.

The post How code-heavy approaches to integration impede digital transformation appeared first on SnapLogic.

Categories: ETL

Overview: Talend Server Applications with Docker

Talend - Tue, 09/25/2018 - 08:53
Talend & Docker

Since the release of Talend 7, a major update in our software, users have been given the ability to build a complete integration flow in a CI/CD pipeline which allows to build Docker images. For more on this feature, I invite you to read the blog written by Thibault Gourdel on Going serverless with Talend through CI/CD and Containers.

Another major update is the support of Docker for server applications like Talend Administration Center (TAC). In this blog, I want to walk you through how to build these images. Remember, if you want to follow along, you can download your free 30-day trial of Talend Cloud here. Let’s get started!

Getting Started: Talend Installation Modes

In Talend provides two different installation modes when working with the subscription version. Once you received your download access to Talend applications, you have a choice:

  • Installation using the Talend Installer: The installer packages all applications and offers an installation wizard to help you through the installation.
  • Manual installation: Each application is available in a separate package. It requires a deeper knowledge of Talend installation, but it provides a lighter way to install especially for containers.

Both are valid choices based on your use case and architecture. For this blog, let’s go with manual installation because we will be able to define an image per application. It will be lighter for container layers and we will avoid overload these layers with unnecessary weight. For more information on Talend installation modes, I recommend you look at Talend documentation Talend Data Fabric Installation Guide for Linux (7.0) and also Architecture of the Talend products.

Container Images: Custom or Generic?

Now that we know a bit more about Talend Installation, we can start thinking about how we will build our container images. There are two directions when you want to containerize an application like Talend.  Going for a custom type image or a generic image:

  • A custom image embeds part of or a full configuration inside the build process. It means that when we will run the container, it will require less parameters than a generic image. The configuration will depend of the level of customization.
  • A generic image does not include specific configuration, it corresponds to a basic installation of the application. The configuration will be loaded at runtime.

To illustrate this, let’s look at an example with Talend Administration Center. Talend Administration Center is a central application in charge of managing users, projects and scheduling. Based on the two approaches for building an image of Talend Administration Centre:

  • A custom image can include:
    • A specific JDBC driver (MySQL, Oracle, SQL Server)
    • Logging configuration: Tomcat logging
    • properties: Talend Administration Centre Configuration
    • properties: Clustering configuration
  • A generic image
    • No configuration
    • Driver and configuration files can be loaded with volumes

The benefits and drawbacks of each approach will depend on your configuration, but :

  • A custom image:
    • Requires less configuration
    • Low to zero external storage required
    • Bigger images: more space required for your registry
  • A generic image
    • Lighter images
    • Reusability
    • Configuration required to run.
Getting Ready to Deploy

Once we have our images, and they are pushed to a registry, we need to deploy them. Of course, we can test them on a single server with a docker run command. But let’s face it, it is not a real-world use case. Today if we want to deploy a container application to on-premise or in the cloud, Kubernetes has become de facto the orchestrator to use. To deploy on Kubernetes, we can go with the standard YAML files or a Helm package. But to give a quick example and a way to test on a local environment, I recommend starting with a docker-compose configuration as in the following example:


version: '3.2' services:   mysql:     image: mysql:5.7     ports:     - "3306:3306"     environment:       MYSQL_ROOT_PASSWORD: talend       MYSQL_DATABASE: tac       MYSQL_USER: talend       MYSQL_PASSWORD: talend123     volumes:       - type: volume         source: mysql-data         target: /var/lib/mysql   tac:     image: mgainhao/tac:7.0.1     ports:     - "8080:8080"     depends_on:       - mysql     volumes:       - type: bind         source: ./tac/config/         target: /opt/tac/webapps/org.talend.administrator/WEB-INF/classes/       - type: bind         source: ./tac/lib/mysql-connector-java-5.1.46.jar         target: /opt/tac/lib/mysql-connector-java-5.1.46.jar volumes:   mysql-data:

The first MySQL service, creates a database container with one schema and a user tac to access it. For more information about the official MySQL image, please refer to:

The second service is my Talend Administration Centre image, aka TAC, a simplified version as it uses only the MySQL database. In this case, I have a generic image that is configured when you run the docker-compose stack.  The JDBC driver is loaded with a volume like the configuration.

In a future article, I’ll go in more details on how to build and deploy a Talend stack on Kubernetes. For now, enjoy building with Talend and Docker!



The post Overview: Talend Server Applications with Docker appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Data Scientists Never Stop Learning: Q&A Spotlight with Isabelle Nuage of Talend

Talend - Mon, 09/24/2018 - 14:21

Data science programs aren’t just for students anymore. Now, data scientists can turn to open online courses and other resources to boost their skill sets. We sat down with Isabelle Nuage, Director of Product Marketing, Big Data at Talend to get insight on what resources are out there:

Q: How would you characterize the differences between data science research processes and machine learning deployment processes?

Isabelle: In the grand scheme of things, Data Science is Science. Data Scientists do a lot of iterations, through trial & error, before finding the right model or algorithm that fit their needs and typically work on sample data. When IT needs to deploy machine learning at scale, they’ll take the work from the data scientists and try to reproduce at scale for the enterprise. Unfortunately it doesn’t always work right away as sample data is different in that real life data has inconsistencies often missing values as well as other data quality issues.

Q: Why is putting machine learning (ML) models into production hard?

Isabelle: Data Scientists work in a lab mode, meaning they are often operating like lone rangers. They take the time to explore data, try out various models and sometimes it can take weeks or even months to deploy their data models into production. By that time, the models have already become obsolete for the business. Causing them to have to go back to the drawing board. Another challenge for Data Scientists is data governance, and without it data becomes a liability. A good example of this is in clinical trial data where sensitive patient information has to be masked so it is not accessible by everyone in the organization.

Q: What are the stumbling blocks?

Isabelle:  There is a lack of collaboration between the Data Science team and IT where each tend to speak their own language and have their own set of skills that the other might not understand. Data Science is often considered to be a pure technology discipline and not connected to business needs as the asks are often tied to the need for fast decision making in order to innovate and outsmart the competition. Existing landscapes, such as enterprise warehouses, are not flexible enough to enable Data Science teams access to all the historical and granular information as some data is stored on tapes. IT is needed to create a Data Lake in order to store all that historical data to train the models and add the real-time data enabling real-time decisions.

Q: How are enterprises overcoming them?

Isabelle: Enterprises are creating Cloud data lakes (better suited for big data volumes and processing) and leveraging the new services and tools such as serverless processing to optimize the cost of machine learning processing on big data volume. Additionally they are also creating a center of excellence to foster collaboration across teams as well as hiring a Chief Data Officer (CDO) to really elevate data science to a business discipline.

Q: What advice might you offer enterprises looking to streamline the ML deployment process?

Isabelle: Use tooling to automate the manual tasks such as hand-coding that foster collaboration between the Data Science and IT teams. By letting the Data Science team explore and do their research, but let IT govern and deploy data so it’s not a liability for the organization anymore. And doing this in a continuous iteration and delivery fashion will enable continuous smart decision making throughout the organization.

Q: What new programs for learning data science skills have caught your attention and in what ways do they build on traditional learning programs?

Isabelle: I’m most interested in new tools that democratize data science, provide a graphical, easy-to-use UI and suggest the best algorithms for the dataset, rather than going through a multitude of lengthy trials and errors. These tools make data science accessible to more people, like business analysts, so more people within the enterprise can benefit from the sophisticated advanced analytics for decision-making. These tools help people get a hands-on experience without needing a PhD.

Q: What are some of your favorite courses and certifications?

Isabelle: I’d say, Coursera as it offers online courses where people can learn at their own pace, they even offer some free data science and  free machine learning courses too. Another great option is MIT eLearning, which also offers course for Data Science and Big Data.

Check out Talend Big Data and Machine Learning Sandbox to get started.


The post Data Scientists Never Stop Learning: Q&A Spotlight with Isabelle Nuage of Talend appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

The 2018 Strata Data Conference and the year of machine learning

SnapLogic - Fri, 09/21/2018 - 15:10

Recently, I represented SnapLogic at the September 2018 Strata Data Conference at the Javits Center in New York City. This annual event is a good indication of trends in data – big and otherwise. If 2017 was the year of Digital Transformation, 2018 is the year of machine learning. Many of the exhibitor’s booth themes[...] Read the full article here.

The post The 2018 Strata Data Conference and the year of machine learning appeared first on SnapLogic.

Categories: ETL

What’s In Store for the Future for Master Data Management (MDM)?

Talend - Fri, 09/21/2018 - 12:37

Master Data Management (MDM) has been around for a long time, and many people like myself, have been involved in MDM for many years. But, like all technologies, it must evolve to be successful. So what are those changes likely to be, and how will they happen?

In my view, MDM is and will change in important two ways in the coming years. First, there will be technological changes, such as moving MDM into the cloud or moving into more ‘software as a service’, or SaaS, offerings, which will change the way MDM systems are built and operated. Secondly, there are and will be more fundamental changes within MDM itself, such as moving MDM from single domain models into truly multi-domain models. Let’s look at these in more details.

Waves of MDM Change: Technical and Operational

New and disruptive technologies make fundamental changes to the way we do most things. In the area of MDM, I expect that to change in two main areas. First comes the cloud. In all areas that matter in data we are seeing moves into the cloud and I expect MDM to be no different. The reasons are simple and obvious, the move towards MDM being offered as a SaaS offering brings cost savings in build, support, operation, automation and maintenance and is therefore hugely attractive to all businesses. I expect that going forward we will see MDM more and more being offered as a SaaS.

The second area I see changes happening are more fundamental. Currently, many MDM systems concentrate on single-domain models. This is the way it has been for many years and currently manifests itself in the form of a ‘customer model’ or a ‘product model’. Over time I believe this will change. More and more businesses are looking towards multi-domain models that will, for example, allow models to be built that have the links between customer and partners, products, suppliers etc. This is the future for MDM models, and already at Talend, our multi-domain MDM tool allows you to build models of any domain you choose. Going forward, its clear that linking those multi-domain models together will be the key.

Watch The 4 Steps to Become a Master at Master Data Management now.
Watch Now MDM and Data Matching

Another area of change that is on the way is in regards to how MDM systems do matching. Currently, most systems do some type of probabilistic matching on properties within objects. I believe the future will see more of these MDM systems doing ‘referential matching’. By this, I mean making more use of the reference database, which may contain datasets like demographic data, in order to do better data matching. Today, many businesses use data that is not updated often enough and so becomes of less and less value. Using external databases to say, get the updated address of your customer or supplier, should dramatically change the value of your matching.

Machine Learning to the Rescue

The final big area of change coming in the future for MDM is the introduction of intelligence or machine learning. In particular, I forecast we will see intelligence in the form of machine learning survivorship. This will like take the form of algorithms which ‘learn’ how records survive and will, therefore, use these results to make predictions about which records survive, and which don’t. this will free up a lot of time for the data steward. 


Additional changes will likely also come around the matching of non-western names and details (such as addresses). At the moment they can be notoriously tricky as, for example, algorithms such as Soundex simply can’t be applied to many languages. This will change and we should see support for more and more languages.

One thing I am certain of though, many of these areas I mentioned are being worked on, all vendors will likely make changes in these areas and Talend will always be at the forefront of development in the future of Master Data Management. Do you have any predictions for the future of MDM? Let me know in the comments below.

Learn more about MDM with Talend’s Introduction to Talend’s Master Data Management tutorial series, and start putting its power to use today!


The post What’s In Store for the Future for Master Data Management (MDM)? appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL
Syndicate content