AdoptOS

Assistance with Open Source adoption

ETL

Talend and Splunk: Aggregate, Analyze and Get Answers from Your Data Integration Jobs

Talend - Wed, 08/01/2018 - 12:26

Log management solutions play a crucial role in an enterprise’s layered security framework— without them, firms have little visibility into the actions and events occurring inside their infrastructures that could either lead to data breaches or signify a security compromise in progress.

Splunk is the “Google for log files” heavyset enterprise tool that was the first log analysis software and has been the market leader ever since. So lots of customers will be interested in seeing how Talend can integrate with their enterprise Splunk and leverage Splunk’s out of the box features.

Splunk captures, indexes, and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards, and visualizations. It has an API that allows for data to be captured in a variety of ways.

Splunk’s core offering collects and analyzes high volumes of machine-generated data. It uses a standard API to connect directly to applications and devices. It was developed in response to the demand for comprehensible and actionable data reporting for executives outside a company’s IT department.

Splunk has several products but in this blog, we will only be working with Splunk Enterprise to aggregate, analyze and get answers from your Talend job logs. I’ll also cover an alternative approach where developers can also log customized events to a specific index using the Splunk Java SDK. Let’s get started!

Intro to Talend Server Log

Let’s start by introducing you to the Talend Log Server. Simply put, this is a logging engine based on Elasticsearch which is developed alongside a data-collection and log-parsing engine called Logstash, and an analytics and visualization platform called Kibana (or ELK).

These technologies are used to streamline the capture and storage of logs from Talend Administration Center, MDM Server, ESB Server and Tasks running through the Job Conductor. It is a tool for managing events and Job logs. Talend supports the basic installation but features like HA and APIs to read/write are beyond Talend scope of supportability.

To understand configuring Talend logging modules with an external Elastic stack please read this article.

Configure Splunk to Monitor Job Logs

Now that you have a good feel for the Talend Server Log, let’s set up Splunk to actually monitor and collect data integration job logs. After you log into your Splunk deployment, the Home page appears. To add data, click Add Data. The Add Data page appears. If your Splunk deployment is a self-service Splunk Cloud deployment, from the system bar, click Settings > Add Data.

The Monitor option lets you monitor one or more files, directories, network streams, scripts, Event Logs (on Windows hosts only), performance metrics, or any other type of machine data that the Splunk Enterprise instance has access to. When you click Monitor, Splunk Web loads a page that starts the monitoring process.

Select a source from the left pane by clicking it once. The page is displayed based on the source you selected. In our case we want to monitor Talend job execution logs, select “Files & Directories”, the page updates with a field to enter a file or directory name and specify how Splunk software should monitor the file or directory. Follow the on-screen prompts to complete the selection of the source object that you want to monitor. Click Next to proceed to the next step in the Add data process.

Creating a Simple Talend Spark Job

To start, log in to your Talend Studio and create a simple job that will read a string via context variable, extract first three characters and displays both actual and extracted string.

Creating Custom Log Events from Talend Spark Job

Now that we’ve gotten everything set up, we’ll want to leverage the Splunk SDK to create custom (based on each flow in the Talend job) events and send it back to Splunk server. A user routine is written to make Splunk calls and register the event to an index. The Splunk SDK jar is set up as a dependency to the user routines so that leverage Splunk SDK methods

Here is how to quickly build the sample Talend Job Below:

  • Splunk configuration is created as context and passed to routine via tJava component
  • Job started and its respective event is logged
  • Employee data is read and its respective event is logged
  • Department data is read and its respective event is logged
  • Employee and Department datasets are joined to form a de-normalized data and its respective event is logged

Switch back to Splunk and search with the index used in the above job – you’ll be able to see events published from job.

Conclusion:

Using the exercise and process above, it is clear that Talend can seamlessly connect to Enterprise Splunk and push customized events and complete job log files to Splunk.

The post Talend and Splunk: Aggregate, Analyze and Get Answers from Your Data Integration Jobs appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Making Sense of the 2018 Gartner Magic Quadrant for Data Integration Tools

Talend - Mon, 07/30/2018 - 18:40

It’s an exciting time to be part of the data market.  Never before have we seen so much innovation and change in a market, especially in the areas of cloud, big data, machine learning and real-time data streaming.  With all of this market innovation, we are especially proud that Talend was recognized by Gartner as a leader for the third time in a row in their 2018 Gartner Magic Quadrant for Data Integration Tools and remains the only open source vendor in the leaders quadrant.

According to Gartner’s updated forecast for the Enterprise Infrastructure Software market, data integration and data quality tools are the fastest growing sub-segment, growing at 8.6%. Talend is rapidly taking market share in the space with a 2017 growth rate of 40%, more than 4.5 times faster than the overall market.

The Data Integration Market: 2015 vs. 2018

Making the move from challengers to market leaders from 2015 to today was no easy feat for an emerging leader in cloud and big data integration. It takes a long time to build a sizeable base of trained and skilled users while maturing product stability, support and upgrade experiences. 

While Talend still has room to improve, it’s exciting recognition of all the investments Talend has made to see our score improve like that.

Today’s Outlook in the Gartner Magic Quadrant

Mark Byer, Eric Thoo, and Etisham Zaidi are not afraid to change things up in the Gartner Magic Quadrant as the market changes, and their 2018 report is proof of that.  Overall, Gartner continued to raise their expectations for the cloud, big data, machine learning, IoT and more.  If you read each vendor’s write up carefully and take close notes, as I did, you start to see some patterns. 

In my opinion, the latest report from Gartner indicates that in general, you have to pick your poison, you can have a point solution with less mature products and support and a very limited base of trained users in the market, or go with a vendor that has product breadth, maturity and a large base of trained users, but with expensive, complex and hard to deploy solutions.

Talend’s Take on the 2018 Gartner Magic Quadrant for Data Integration Tools

In our minds, this has left a really compelling spot in the market for Talend as the leader in the new cloud and big data use cases that are increasingly becoming the mainstream market needs. For the last 10+ years, we’ve been on a mission to help our customers liberate their data. As data volumes continue to grow exponentially along with growth in business users needing access to that data, this mission has never been more important. This means continuing to invest in our native architecture to enable customers to be the first to adopt new cutting-edge technologies like serverless, containers which significantly reduce total cost of ownership and can run on any cloud.

Talend also strongly believes that data must become a team sport for businesses to win, which is why governed self-service data access tools like Talend Data Preparation and Talend Data Streams are such important investments for Talend.  It’s because of investments like these that we believe Talend will quickly become the overall market leader in data integration and data quality. As I said at the beginning of the blog, our evolution has been a journey and we invite you to come along with us. I encourage you to download a copy of the report,  try Talend for yourself and become part of the community.

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. 
GARTNER is a federally registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and is used herein with permission. All rights reserved.

 

The post Making Sense of the 2018 Gartner Magic Quadrant for Data Integration Tools appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Making Sense of the 2018 Gartner Magic Quadrant for Data Integration Tools

Talend - Mon, 07/30/2018 - 09:56

It’s an exciting time to be part of the data market.  Never before have we seen so much innovation and change in a market, especially in the areas of cloud, big data, machine learning and real-time data streaming.  With all of this market innovation, we are especially proud that Talend was recognized by Gartner as a leader for the third time in a row in their 2018 Gartner Magic Quadrant for Data Integration Tools and remains the only open source vendor in the leaders quadrant.

According to Gartner’s updated forecast for the Enterprise Infrastructure Software market, data integration and data quality tools are the fastest growing sub-segment, growing at 8.6%. Talend is rapidly taking market share in the space with a 2017 growth rate of 40%, more than 4.5 times faster than the overall market.

The Data Integration Market: 2015 vs. 2018

Making the move from challengers to market leaders from 2015 to today was no easy feat for an emerging leader in cloud and big data integration. It takes a long time to build a sizeable base of trained and skilled users while maturing product stability, support and upgrade experiences. 

While Talend still has room to improve, it’s exciting recognition of all the investments Talend has made to see our score improve like that.

Today’s Outlook in the Gartner Magic Quadrant

Mark Byer, Eric Thoo, and Etisham Zaidi are not afraid to change things up in the Gartner Magic Quadrant as the market changes, and their 2018 report is proof of that.  Overall, Gartner continued to raise their expectations for the cloud, big data, machine learning, IoT and more.  If you read each vendor’s write up carefully and take close notes, as I did, you start to see some patterns. 

In my opinion, the latest report from Gartner indicates that in general, you have to pick your poison, you can have a point solution with less mature products and support and a very limited base of trained users in the market, or go with a vendor that has product breadth, maturity and a large base of trained users, but with expensive, complex and hard to deploy solutions.

Talend’s Take on the 2018 Gartner Magic Quadrant for Data Integration Tools

In our minds, this has left a really compelling spot in the market for Talend as the leader in the new cloud and big data use cases that are increasingly becoming the mainstream market needs. For the last 10+ years, we’ve been on a mission to help our customers liberate their data. As data volumes continue to grow exponentially along with growth in business users needing access to that data, this mission has never been more important. This means continuing to invest in our native architecture to enable customers to be the first to adopt new cutting-edge technologies like serverless, containers which significantly reduce total cost of ownership and can run on any cloud.

Talend also strongly believes that data must become a team sport for businesses to win, which is why governed self-service data access tools like Talend Data Preparation and Talend Data Streams are such important investments for Talend.  It’s because of investments like these that we believe Talend will quickly become the overall market leader in data integration and data quality. As I said at the beginning of the blog, our evolution has been a journey and we invite you to come along with us. I encourage you to download a copy of the report,  try Talend for yourself and become part of the community.

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. 
GARTNER is a federally registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and is used herein with permission. All rights reserved.

 

The post Making Sense of the 2018 Gartner Magic Quadrant for Data Integration Tools appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Talend & Apache Spark: Debugging & Logging

Talend - Mon, 07/30/2018 - 09:43

So far, our journey on using Apache Spark with Talend has been a fun and exciting one. The first three posts on my series provided an overview of how Talend works with Apache Spark, some similarities between Talend and Spark Submit, the configuration options available for Spark jobs in Talend and how to tune Spark jobs for performance. If you haven’t already read them you should do so before getting started here. Start with “Talend & Apache Spark: A Technical Primer”, “Talend vs. Spark Submit Configuration: What’s the Difference?” and “Apache Spark and Talend: Performance and Tuning”.

To finish this series, we’re going to talking about logging and debugging. When starting your journey with using Talend and Apache Spark you may have run into the error like below printed out in your console log:

“org.apache.spark.SparkContext - Error initializing SparkContext.  org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master”

How do you find out what caused this? Where should you be looking for more information? Did your Spark job even run? In the following sections, I will go over how you can find out what happened with your Spark job, and where you should look for more information that will help you resolve your issue.

Resource Manager Web UI

When you get an error message like the one above, you should always visit is the Resource Manager Web UI page first to locate your application, and see what errors may be reported in there.

Once you locate your application you’ll see in the bottom right corner that you have an option for retrieving container logs by clicking on the “logs” link that is provided next to each attempt to get more information about what happened.

In my experience, I never get all the logging information that I need from the Web UI alone, so it is better to login into one of your cluster edge nodes, and then use the YARN commandline tool to grab all the logging information for your containers and output it into a file like below.

Interpreting the Spark Logs (Spark Driver)

Once you have gotten the container logs through the command shown above and have the logs from your Studio, you now need to interpret them and see where our job may have failed. The first place to start is with the Studio logs that contain the logging information for the Apache Spark driver. These logs indicate in the first few lines that state that our Spark driver has started:

[INFO ]: org.apache.spark.util.Utils - Successfully started service 'sparkDriver' on port 40238. [INFO ]: org.apache.spark.util.Utils - Successfully started service 'sparkDriverActorSystem' on port 34737.

Now the next part that you should look for in the Studio log is the information of the Spark Web UI that is started by the Spark driver:

[INFO ]: org.apache.spark.ui.SparkUI - Started SparkUI at http://<ip_address>:4040


This is the Spark Web UI that is launched by the driver. The next step that you should see in the logs is the libraries needed by the executors being uploaded to the Spark cache:

[INFO ]: org.apache.spark.SparkContext - Added JAR ../../../cache/lib/1223803203_1491142638/talend-mapred-lib.jar at  spark://<ip_address>:40238/jars/talend-mapred-lib.jar with timestamp 1463407744593

Once all the information that will be needed by the executors for the job is uploaded to the Spark cache, you will then see the log the request for the Application Master in the Studio:

[INFO ]: org.apache.spark.deploy.yarn.Client - Will allocate AM container, with 896 MB memory including 384 MB overhead [INFO ]: org.apache.spark.deploy.yarn.Client - Submitting application 20 to ResourceManager [INFO ]: org.apache.spark.deploy.yarn.Client - Application report for application_1563062490123_0020 (state: ACCEPTED) [INFO ]: org.apache.spark.deploy.yarn.Client - Application report for application_1563062490123_0020 (state: RUNNING)

After this,  the executors will be registered and then the processing will be started:

[INFO ]: org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Registered executor NettyRpcEndpointRef(null) (hostname2:41992) with ID 2 [INFO ]: org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Registered executor NettyRpcEndpointRef(null) (hostname1:45855) with ID 1 [INFO ]: org.apache.spark.scheduler.TaskSetManager - Finished task 1.0 in stage 1.0 (TID 3) in 59 ms on hostname1 (1/2) [INFO ]: org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 1.0 (TID 2) in 79 ms on hostname2 (2/2)

At the end of the Spark driver log you’ll see the last stage which is the shutdown and cleanup:

[INFO ]: org.apache.spark.util.ShutdownHookManager - Shutdown hook called [INFO ]: org.apache.spark.util.ShutdownHookManager - Deleting directory /tmp/spark-5b19fa47-96df-47f5-97f0-cf73375e59e1 [INFO ]: akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.​

In your log, you may see every stage being reported, but depending on the issue some of them may not get executed.

This excercise should give you a very good indicator of where the actual failure happened, and if it was an issue encountered by the Spark driver. As a note, the Spark driver logs shown above will only be available through the Studio console log, if the job is run using YARN-client mode. If YARN-cluster mode is used, this logging information will not be available in the Studio, and  you will have to use the YARN command mentioned earlier to get that logging.

Interpreting the Spark Logs (Container Logs)

Now, let’s move on to reviewing the container logs (if your application started running at the cluster level) and start interpreting the information. The first step that we will see in these logs is the Application Master starting:

05/05/18 16:09:13 INFO ApplicationMaster: Registered signal handlers for [TERM, HUP, INT]

Then you will see the connectivity happening with the Spark Driver:

05/05/18 16:09:14 INFO ApplicationMaster: Waiting for Spark driver to be reachable. 05/05/18 16:09:14 INFO ApplicationMaster: Driver now available: <spark_driver>:40238

It will then proceed with requesting resources:

05/05/18 16:09:15 INFO YarnAllocator: Will request 2 executor containers,  each with 1 cores and 1408 MB memory including 384 MB overhead 05/05/18 16:09:15 INFO YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>) 05/05/18 16:09:15 INFO YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>)

Then once it gets the resources it will start launching the containers:

05/05/18 16:09:15 INFO YarnAllocator: Launching container container_e04_1463062490123_0020_01_000002 for on host hostname1 05/05/18 16:09:15 INFO YarnAllocator: Launching container container_e04_1463062490123_0020_01_000003 for on host hostname2

It then proceeds with printing the container classpath:

CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark__.jar<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$YARN_HOME/*<CPS>$YARN_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/* {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms1024m -Xmx1024m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.driver.port=40238' -Dspark.yarn.app.container.log.dir=<LOG_DIR> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@163.172.14.98:40238 --executor-id 1 --hostname hostname1 --cores 1 --app-id application_1463062490123_0020 --user-class-path file:$PWD/__app__.jar 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr

You’ll now see our executors starting up:

05/05/18 16:09:19 INFO Remoting: Starting remoting 05/05/18 16:09:19 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@tbd-bench-09:38635]

And then the executor communicating back to the Spark Driver:

05/05/18 16:09:20 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@163.172.14.98:40238 05/05/18 16:09:20 INFO CoarseGrainedExecutorBackend: Successfully registered with driver

Then proceeding to retrieve the libraries from the Spark cache and updating the classpath:

05/05/18 16:09:21 INFO Utils: Fetching spark://<ip_address>:40238/jars/snappy-java-1.0.4.1.jar to /data/yarn/nm/usercache /appcache/application_1563062490123_0020/spark-7e7a084d-e9e2-4c94-9174-bb8f4a0f47e9/fetchFileTemp7487566384982349855.tmp 05/05/18 16:09:21 INFO Utils: Copying /data/yarn/nm/usercache /appcache/application_1563062490123_0020/spark-7e7a084d-e9e2-4c94-9174-bb8f4a0f47e9/1589831721463407744593_cache to /data2/yarn/nm/usercache /appcache/application_1563062490123_0020/container_e04_1563062490123_0020_01_000002/./snappy-java-1.0.4.1.jar 05/05/18 16:09:21 INFO Executor: Adding file:/data2/yarn/nm/usercache /appcache/application_1563062490123_0020/container_e04_1563062490123_0020_01_000002/./snappy-java-1.0.4.1.jar to class loader

Next you will see the Spark executor start running:

05/05/18 16:09:22 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)

Finally, the Spark executor shutdowns and cleans up once the processing is done:

05/05/18 16:09:23 INFO Remoting: Remoting shut down 05/05/18 16:09:23 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.

As I mentioned in the previous section, you might see your Spark job go through all those stages, or it might not depending on what the issue was that we encountered. Understanding all the different steps involved, and being able to interpret all this information in the logging, will help you to get more information and a better understanding of why a Spark job may have failed.

Spark History Web UI

In previous blogs, I mentioned that, as a best practice, you should always enable the Spark event logging in your jobs, so that the information in the Spark History Web Interface is available even after the job ends.

This Web Interface is the next location that you should always check for more information regarding the processing of your job. The Spark History Web UI does a great job on giving a visual representation of the processing of a Spark Job. When you navigate to that web interface you will see the following tabs:

First, I suggest starting with the environment tab, to verify all the environment information that was passed to our job:

The next step now is to look at the timeline of events in Spark. When you click on the jobs tab, you will see what how our application was executed after our executors got registered:

As you can see in the image above, it will show you the addition of all the executors that were requested, then the execution of the jobs that our application was split into, if any of them run in parallel, and if there was any failure with any of them. Now you can proceed by clicking in one of those jobs, to get further information (example below):

Here you’ll see the stages that run in parallel and don’t depend on each other as also that the stages that depend on others to finish and don’t start until the first ones are done. Now the Spark History Web UI allows to look further inside those stages and see the different tasks that are executed:

Here you are looking at the different partitions (in this case we have 2) and how they are distributed among the different executors. You’ll also see how much of the execution time was spend on shuffling and on actual computation. Furthermore, if you are doing also joins in our Spark job, you will notice that in each job in the Web UI it shows also the execution DAG visualization that allows you to easily determine the type of join that was used:

As a final step, make sure to check the executors tab which will give you an overview of all the executors that were used by our Spark job, how many tasks each one of them processed, the amount of data processed, the amount of shuffling, and how much was the time spend on the task and in Garbage Collection:

All this information that you gather it is important as it will lead you to better understanding the root cause of a potential issue, and what is the corrective action you should take.

Conclusion

            This concludes my blog series on Talend with Apache Spark. I hope you enjoyed this journey, and had as much fun reading the blogs as I had while putting all this information together! I would love to hear from you on your experience with Spark and Talend, and if the information within the blogs was useful, so feel free to post your thoughts and comments below.

References

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-history-server.html

The post Talend & Apache Spark: Debugging & Logging appeared first on Talend Real-Time Open Source Data Integration Software.

Categories: ETL

Digital transformation: Three ways HR can use AI more effectively

SnapLogic - Fri, 07/27/2018 - 13:00

Orginially published on hrzone.com AI and analytics tools are only as good as the data they can access, so in order to make the best use of these, HR teams need to ensure that they have the right data integration strategy in place. Data is changing the world, but today’s enterprises remain inundated with an ever-rising[...] Read the full article here.

The post Digital transformation: Three ways HR can use AI more effectively appeared first on SnapLogic.

Categories: ETL

An improved SnapLogic Partner Connect Program – helping to transform business models

SnapLogic - Thu, 07/26/2018 - 13:06

Microsoft, Adobe, IBM, GE, and Netflix – have all been through it. In the case of Netflix, they’ve been through it twice. And both IBM and GE have been through it before and are now trying to do it again. The “it” I’m talking about is the fundamental transformation of business models that built each[...] Read the full article here.

The post An improved SnapLogic Partner Connect Program – helping to transform business models appeared first on SnapLogic.

Categories: ETL

Salesforce Connector in CloverETL

CloverETL - Tue, 08/09/2016 - 09:24

In an effort to constantly improve the lives of our users, we’ve enhanced our Salesforce connectivity and added a new, user-friendly (yet powerful) Salesforce connector into the CloverETL 4.3. You can now easily read, insert, update and delete Salesforce data with CloverETL, without having to expose yourself to the nuts and bolts of the two systems […]

The post Salesforce Connector in CloverETL appeared first on CloverETL Blog on Data Integration.

Categories: ETL

Building your own components in CloverETL

CloverETL - Tue, 07/19/2016 - 09:51

In this post, I’d like to cover a few things not only related to building components, but also related to: subgraphs and their ability to make your life easier; working with CloverETL public API; …and some other things I consider useful. This should give you a good idea of how to build your own (reusable) components and make them […]

The post Building your own components in CloverETL appeared first on CloverETL Blog on Data Integration.

Categories: ETL

Code Debugging in CloverETL Designer

CloverETL - Mon, 07/04/2016 - 03:53

EDIT: Updated to 4.3 Milestone 2 version, adding conditional breakpoints and watch/inspect options.   Code debugging is a productivity feature well known to developers from various programming environments. It allows you to control the execution of a piece of code line-by-line, and look for problems that are hard to spot during normal runs. Why would […]

The post Code Debugging in CloverETL Designer appeared first on CloverETL Blog on Data Integration.

Categories: ETL

Replacing legacy data software with CloverETL

CloverETL - Thu, 06/30/2016 - 02:56

  What does legacy data software mean to you: old software that’s currently outdated or existing software that works? Or, I should ask, are you a developer or a business stakeholder? No matter which side of the discussion you are on, replacing legacy data software is always a difficult conversation between developers and business stakeholders. On one […]

The post Replacing legacy data software with CloverETL appeared first on CloverETL Blog on Data Integration.

Categories: ETL

Data Partitioning: An Elegant Way To Parallelize Transformations Without Branching

CloverETL - Wed, 06/22/2016 - 05:54

Ever wondered what to do with those annoyingly slow operations inside otherwise healthy and fast transformations? You’ve done everything you could do to meet the processing time window, and now there’s this wicked API call that looks up some data, or a calculation that just sits there and takes ages to complete, record by record, […]

The post Data Partitioning: An Elegant Way To Parallelize Transformations Without Branching appeared first on CloverETL Blog on Data Integration.

Categories: ETL

Data Integration Challenges: Define Your Customer

Data integration blog - Fri, 04/29/2011 - 07:56

The IT and business alignment is a widely discussed challenge of data integration. The major data integration problem adds up to this: define customer.

Data from different functional areas doesn’t join up: sales orders are associated with the newly contracted customers, but the marketing campaign data is associated with prospects. Is a customer someone who’s actually bought something from you, or is a customer someone who’s interested in buying something from you? Should a definition include a certain demographic factor that reflects your typical buyer? If sales, marketing, service, and finance can all agree on a single definition of customer, then all the associated transactions could be easily integrated.

The thing is that all these specialists have their understanding of the word “customer”. That is why it is next to impossible for them to agree on a single definition and you have to somehow manage data integration without it.

To solve this issue, you can define what each functional area (and each CRM system) means by “customer”. This is how we know that customer data coming from a marketing system includes prospects, as well as existing customers. With this information, you can build a semantic model to understand how the different definitions of customer relate to one another.

Using this model, it would be possible to associate supply data with parts, cost data with product class, marketing data with brands, and so on. The relationships among these entities allow for data integration from different functional areas. This semantic model may be complex, but try to accept it and don’t head for simplifying it. The world is complex. Data integration requires a sophisticated understanding of your business, and standardizing vocabulary is not going to be the right answer to this challenge.

Categories: ETL

iPaaS: A New Trend In Data Integration?

Data integration blog - Wed, 04/20/2011 - 09:51

iPaaS (integration platform-as-a-service) is a development platform for building integration applications. It provides a set of capabilities for data integration and application integration in the Cloud and on-premises.

There are very few vendors offering iPaaS solutions at the moment. Although Gartner recognizes and uses the term, it still sounds confusing to researchers and data integration experts. So how does iPaaS work and can it benefit your data integration efforts?

Integration platform delivers a combination of data integration, governance, security and other capabilities to link applications, SOA services, and Cloud services. In addition to basic features that a Cloud solution should have, such as multi-tenancy, elasticity, and reliability, there are other capabilities relevant for iPaaS:

    1. Intermediation, the ability to integrate applications and services using the Cloud scenarios, which include SaaS and Cloud services, on-premises apps and resources.
    2. Orchestration between services, which requires connectivity and the ability to map data.
    3. Service containers to enable users publish their own services using either RESTful or SOAP technologies.
    4. Security covers the ability to authenticate and authorize access to any resource on the platform, as well as to manage this access.
    5. Enterprise Data Gateway installed on-premises and used as a proxy to access enterprise resources.

Data integration and application integration with and within the Cloud is the concept that business owners should consider nowadays. As of today, iPaaS would mostly appeal to companies that don’t mind building their own IT solutions or to ISVs that need to integrate Cloud silos they have created previously. It will be interesting to see whether iPaaS will become the next trend in the data integration discipline.

Categories: ETL

Salesforce Integration with QuickBooks: Out-of-the-box Solution on its Way

Data integration blog - Wed, 04/06/2011 - 05:41

Salesforce.com and Intuit have signed a partnership agreement to provide Salesforce integration with QuickBooks to Intuit’s four million customers. The companies promise to finish developing the integrated solution in summer.

The solution is going to make CRM processes more convenient and transparent by displaying customer data along with financial information. Salesforce integration with QuickBooks will enable businesses to synchronize customer data in Salesforce.com CRM with financial data in QuickBooks and QuickBooks Online. This will solve an issue of double data entry in two different systems.

Salesforce integration with QuickBooks will help small business owners to make better decisions. According to Intuit’s survey, more than 50% of small businesses perform CRM activities manually with pen and paper or with software, which is not designed for that.

With thousands of small businesses using both QuickBooks and Salesforce.com, the integration of two systems is a great way to leverage the power of cloud computing and data integration strategies to help businesses grow.

Categories: ETL

Is Your Data Integration Technology Outdated?

Data integration blog - Sat, 04/02/2011 - 10:49

Spring is a good time to get rid of the old stuff and check out something new. This might as well be the time to upgrade your data integration tools. How can you learn if your data integration solution is outdated and should be replaced by something more productive? May be it just needs a little tuning? Here are the main check points to see if your solution’s performance still fits the industry standards.

Data transformation schemas
deal with both data structure and content. If data mappings are not as well-organized as possible, then a single transformation may take twice as long. Mapping problems can cause small delays that add up. The solution to the transformation issue is to make sure that data maps are written as efficiently as possible. You can compare your data integration solution to the similar ones to understand if the data transformation runs with the required speed.

Business rules processing are specific rules for the data that has to be validated. Too many rules can suspend your data integration processes. You have to make sure that the amount of rules in you data integration system is optimal, meaning that there are not too many of them running at the same time.

Network bandwidth and traffic—in many cases the performance is hindered not by the data integration tool itself, but by the size of the network you use. To avoid this issue, you need calculate the predicted performance under various loads and make sure you use the fastest network available for the data integration needs.

Data integration solution reminds a car: it can run but become slow if it is not properly tuned and taken care of. As we become more dependent upon the data integration technology, our ability to understand and optimize the performance issues will make a substantial difference.

Categories: ETL

The Key Data Integration Strategies for Successful CRM

Data integration blog - Thu, 03/10/2011 - 09:39

One of the great values data integration provides is a possibility to gain a deeper insight into one’s customers. It is not surprising that data integration with CRM (customer relations management) systems is one of the main directions in the industry development. As more companies choose managing customers electronically, it is quite useful to apply the most effective data integration strategies to pay-off for CRM investments.

The recent survey by the data integration experts and authors—Christopher Barko, Ashfaaq Moosa, and Hamid Nemati, —explores the significant role of data integration in electronic customer relationship management (e-CRM) analytics. They reviewed 115 organizations including both B2B and B2C companies and sorted out four data integration initiatives that provide for better CRM:

    1. Integrating more data sources. The research shows that the total value of CRM project increases when you integrate more data sources. As sales people are using more channels than ever before to reach prospects and customers, no wonder that data integrated from all these channels is more efficient, than when stored in the isolated silos.

    2. Integrating offline data with online data gives a better picture of customer’s buying habits. 62 percent of respondents said they integrated these data sources, while 30 percent did not. Not surprisingly, those who integrated the online and offline data experienced greater value from their e-CRM projects.

    3. Integrating external data (e.g., from social media sites) into the central repository. 74 percent integrated external data in some form, while 26 percent did not. The companies that practice external data integration in their e-CRM projects enjoy significantly more benefits.

    4. Using a centralized data warehouse or a CRM-specific data repository does provide a deeper customer insight. Those who used a decentralized data repository (legacy databases, operational data stores) experienced significantly less benefits than those who centralized their data storage.

As the number of marketing channels used to communicate with customers continues to multiply, so does the number of places used to store the data. The research reveals that the most efficient data integration strategies include integrating different kinds of data from multiple channels and keeping it in the central repository. These data integration best practices help ensure marketing efforts have a positive effect on sales.

Categories: ETL

How Can Data Governance Serve Data Integration Projects?

Data integration blog - Sat, 03/05/2011 - 06:56

Data governance initiatives in an organization are intended to cover data quality, data management, and data policy issues. These activities are carried out by data stewards and a team that develops and implements business rules for administrating the use of data.

The focus on data governance is essential when the company has to implement a successful data integration strategy and use it for analysis, reporting, and decision-making. Here are some ways of making data integration projects more efficient with data governance:

    • It brings IT and business teams together. Data governance identifies what is really important to the business and helps establish business rules that are crucial for data integration.

    • A data governance program can help your company define and measure the potential ROI you get from maintaining data. You can use this information to calculate the ROI for data integration projects.

    • It helps you learn who’s responsible for the data quality. Data governance provides valuable information that enables to appoint data stewards and decision makers for data integration projects. Since data governance tells you who’s responsible for the data, you know where to go to resolve data quality issues.

    • Data governance can save you money, because it helps establish best practices and select cost-effective data integration and data quality tools.

Data governance and data integration are tightly connected with each other. You are not likely to enjoy data integration benefits without a strong governance program. On the other hand, data governance is only possible if your data is stored in an integrated system. My advice: make sensible use of both.

Categories: ETL

What Is The Difference Between Data Conversion and Data Migration?

Data integration blog - Thu, 02/24/2011 - 11:28

The terms data conversion and data migration are still sometimes used interchangeably on the internet. However, they do mean different things. Data conversion is the transformation of data from one format to another. It implies extracting data from the source, transforming it and loading the data to the target system based on a set of requirements.

Data migration is the process of transferring data between silos, formats, or systems. Therefore, data conversion is only the first step in this complicated process. Except for data conversion, data migration includes data profiling, data cleansing, data validation, and the ongoing data quality assurance process in the target system.

Both terms are used as synonymous by many internet resources. I think the reason for that might be that there are very few situations when a company has to convert the data without migrating it.

Data conversion possible issues

There are some data conversion issues to consider, when data is transferred between different systems. Operating systems have certain alignment requirements which will cause program exceptions if these requirements are not taken into consideration. Converting files to another format can be tricky as how you convert it depends on how the file was created. These are only few examples of possible conversion issues.

There are some ways to avoid data conversion problems:

    1. Always transform objects into printable character data types, including numeric data.
    2. Devise an operating system-neutral format for an object transformed into a binary data type.
    3. Include sufficient header information in the transformed data type so that the remainder of the encoded object can be correctly interpreted independent of the operating system.

Data conversion is often the most important part of data migration. You have to be very careful during this stage to assure data quality in your target system.

Categories: ETL

Data Integration in SharePoint 2010

Data integration blog - Thu, 02/17/2011 - 09:23

A survey by AIIM (Association for Information and Image Management) states that although SharePoint is being rapidly adopted by organizations, at least half of the companies that are implementing the platform don’t have business use in mind.

This might be a reason we don’t see millions of companies shifting their data integration initiatives into SharePoint. It may be only a question of time, as SharePoint 2010 comes with rich integration capabilities. Here are some of the features that can be leveraged for external data integration and application integration:

    1. Business Connectivity Services (BSC) is a new feature of the SharePoint platform that provides new means for external data integration into SharePoint 2010. It enables to create connections to external data sources through the use of SharePoint Designer or more complex scenarios with custom code development.

    2. Web Services can be leveraged by both SharePoint and external systems for data integration and application integration purposes. Common services include the ability to authenticate, search, and manage content. SharePoint 2010 also includes built-in RESTful Web services, which allows the integration of remote systems.

    3. Client Object Models are used to integrate SharePoint and other systems to provide a better usability. SharePoint 2010 introduces three new client API’s: ECMAScript Client, SilverLight Client, and .NET Managed Client. These object models enable users to access both SharePoint and other data sources from a single interface that does not have to be or look like the SharePoint interface.

    4. The CMIS (Content Management Interoperability Services) connector for SharePoint 2010 enables to perform content management functions between systems that comply with the CMIS specification.

There are many ways in which organizations can leverage SharePoint for their data integration needs. Nevertheless, the question on whether companies will start data migration and data integration into SharePoint 2010 in the nearest future remains open.

Categories: ETL
Syndicate content