Move data from oracle to HDFS, process and switch to Teradata from HDFS

My requirement is

  • Migrating data from Oracle to HDFS
  • HDFS data processing
  • Move processed data to Teradata.

It is also required to carry out all processing every 15 minutes. The amount of raw data may be close to 50 GB, and the processed data may also be the same.

After much searching on the Internet, I found that

  • ORAOOP to transfer data from Oracle to HDFS (enter the code with the shell script and schedule it to run at the required interval).
  • Perform large-scale processing with Custom MapReduce or Hive or PIG.
  • SQOOP - Teradata connector for moving data from HDFS to Teradata (again with a shell script with code, and then schedule it).

Is this the right option in the first place and is it possible for the necessary period of time (note that this is not a daily party or so)?

Other options I found are as follows

  • STORM (for real-time data processing). But I can not find oracle Spout or Teradata chatterbox out of the box.
  • Any open source ETL tools like Talend or Pentaho.

Share your thoughts with these options and with any other features.

+6
source share
2 answers

It sounds like you have a few questions, so try breaking them.

Import to HDFS

You seem to be looking for Sqoop . Sqoop is a tool that allows you to easily transfer data to / from HDFS and can connect to various databases, including Oracle natively. Sqoop is compatible with the thin Oracle JDBC driver. Here's how you moved from Oracle to HDFS:

sqoop import --connect jdbc:oracle: thin@myhost :1521/db --username xxx --password yyy --table tbl --target-dir /path/to/dir 

For more information: here and here . Please note that you can also import directly into the Hive table using Sqoop, which may be convenient for your analysis.

Treatment

As you noted, since your data is initially relational, it is recommended that you use Hive for analysis, as you may be more familiar with syntax like SQL. A pig is purer relational algebra, and the syntax is NOT SQL-like, it is more a matter of preference, but both approaches should work fine.

Since you can import data into Hive directly from Sqoop, your data should be directly prepared for processing after import.

In Hive, you can run your query and tell it to write the results to HDFS:

 hive -e "insert overwrite directory '/path/to/output' select * from mytable ..." 

Export to TeraData h2>

Cloudera released last year a connector for Teradata for Sqoop, as described here , so you should see how it looks exactly the way you want, Here's how you do it:

 sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir /path/to/hive/output 

All this, of course, is feasible in any period of time in which you want, in the end, what will matter is the size of your cluster, if you want it to quickly scale your cluster as needed. The good thing with Hive and Sqoop is that the processing will be distributed in your cluster, so you have full control over the schedule.

+5
source

If you have problems with overhead or the delay in moving data from Oracle to HDFS, Dell Softwares SharePlex might be a commercial solution. They recently released a connector for Hadoop that allows you to copy table data from Oracle to Hadoop. More details here .

I'm not sure that you need to recycle the entire data set each time, or maybe just use the delta. SharePlex also supports the replication of these changes to the JMS queue. Perhaps you can create a spout that reads from this queue. You may also be able to create your own trigger-based solution, but this will work a bit.

As a disclosure, I work for Dell software.

+1
source

Source: https://habr.com/ru/post/946377/


All Articles