BigData |Hadoop— 10 Common Mistakes in Sqoop

Jayvardhan Reddy
5 min readNov 19, 2023

--

If your daily activities involve performing Data Ingestion then here are a few issues you come across.

Some of the things a beginner as well as an experienced need to take care of.

Problem 1: If splitby is not declared for query

sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “select * from products p, categories c where p.product_category_id = c.category_id” \
-- target-dir /user/jvanchir/sqoop_prac/

O/P:

18/09/14 13:46:17 WARN tool.BaseSqoopTool: Setting your password on the command- line is insecure. Consider using -P instead.
When importing query results in parallel, you must specify — split-by.
Try — help for usage instructions.

Solution: Use — split-by or — autoreset-to-one-mapper

sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “select * from products p, categories c where p.product_category_id = c.category_id AND \$CONDITIONS” \
-- target-dir /user/jvanchir/sqoop_prac/ \
-- split-by product_id

Problem 2: ‘$CONDITIONS’ should be used for — query parameter

sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “select * from products p, categories c where p.product_category_id = c.category_id” \
-- target-dir /user/jvanchir/sqoop_prac/ \
-- split-by product_id

O/P:
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException:
Query [select * from products p, categories c where p.product_category_id = c.category_id] must contain ‘$CONDITIONS’ in WHERE clause.

Solution: Use — split-by or — autoreset-to-one-mapper

sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “select * from products p, categories c where p.product_category_id = c.category_id AND \$CONDITIONS” \
-- target-dir /user/jvanchir/sqoop_prac/ \
-- split-by product_id

Problem 3: If a directory with the same name or same data exists in hdfs directory we r trying to export

sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “select * from products p, categories c where p.product_category_id = c.category_id AND \$CONDITIONS” \
-- target-dir /user/jvanchir/sqoop_prac/ \
-- split-by product_id

O/P:
Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory hdfs://nn01.cloudera.com:8020/user/jvanchir/sqoop_prac already exists

solution: use — delete-target-dir

sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “select * from products p, categories c where p.product_category_id = c.category_id AND \$CONDITIONS” \
-- target-dir /user/jvanchir/sqoop_prac/ \
-- split-by product_id \
-- delete-target-dir

Problem 4: No space between the table name and new line parameter

sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- table products\
-- warehouse-dir /user/jvanchir/sqoop_prac/import_table_dir

O/P:
18/09/14 14:24:34 ERROR tool.BaseSqoopTool: Error parsing arguments for import:
18/09/14 14:24:34 ERROR tool.BaseSqoopTool: Unrecognized argument: /user/jvanchir/sqoop_prac/import_table_dir

Solution: give space between them

 — table products \

Problem 5: when trying to use — split-by attribute on a non-primary key and string Field

sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- table categories \
-- split-by category_name \
-- warehouse-dir /user/jvanchir/sqoop_prac/import_table_dir

O/P:
Caused by: Generating splits for a textual index column allowed only in case of
“-Dorg.apache.sqoop.splitter.allow_text_splitter=true” property passed as a parameter

Solution: the Dorg.apache.sqoop.splitter.allow_text_splitter=true option should be used as first-line after the import statement

sqoop import \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- table categories \
-- split-by category_name \
-- warehouse-dir /user/jvanchir/sqoop_prac/import_table_dir

Problem 6: You cannot use split-by attribute on multiple columns

sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- table products \
-- warehouse-dir /user/jvanchir/sqoop_prac/import_table_dir/mul_split \
-- split-by product_price,product_category_id

O/P:
Encountered IOException running import job: java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException:
Unknown column ‘product_name,product_category_id’ in ‘field list’

Solution: Use the — query option instead but on using query — target-dir is a must.

sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “ select * from products where product_name = ‘abc’ and product_category_id >= ‘1010’ and \$CONDITIONS “ \
-- target-dir /user/jvanchir/sqoop_prac/import_table_dir/mul_split/products \

Problem 7: During Sqoop Export operation i.e when exporting data from hive/hdfs to SQL table

sqoop export \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_export \
-- username retail_dba \
-- password cloudera \
-- table revenue_prac_col_order \
-- export-dir /apps/hive/warehouse/jvanchir_sqoop_prac.db/daily_revenue \
-- columns order_date,revenue \
-m2

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: java.lang.RuntimeException: Cannot parse input data:
‘2013–08–11 00:00:00.071149.59000000008’ at revenue_prac_col_order.__loadFromFields(revenue_prac_col_order.java:267) at

Solution:

 — input-fields-terminated-by ‘\001’

Problem 8: When we try to export values from hive/hdfs into MySQL non-null attribute field, if there is an empty data or null data then we get
mysql> desc revenue_prac_col_order;
+ — — — — — — -+ — — — — — — -+ — — — + — — -+ — — — — -+ — — — -+
| Field | Type | Null | Key | Default | Extra |
+ — — — — — — -+ — — — — — — -+ — — — + — — -+ — — — — -+ — — — -+
| revenue | float | YES | | NULL | |
| order_date | varchar(32) | YES | | NULL | |
| description | varchar(30) | NO | | NULL | |
+ — — — — — — -+ — — — — — — -+ — — — + — — -+ — — — — -+ — — — -+

hive (jvanchir_sqoop_import)> desc daily_revenue;
O/P:
OK
order_date string
_c1 double

Here hive table has just 2 fields while SQL has 3 fields. Their last column values are empty while exporting into MySQL and throws an error
org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: java.sql.SQLException:
Field ‘description’ does not have a default value at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1078)

Solution:
Either export hive table with 3 column non-null values
or remove the not null attribute feature in MySQL table for the column while creating.

Problem 9: When we try exporting the data multiple times from hive/hdfs into SQL table having a primary key constraint.

mysql> create table revenue_prac_update(order_date varchar(32) primary key, revenue float);
Execute the Below sqoop command twice to get Primary Key constraint error (duplicate insertion)

sqoop export \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_export \
-- username retail_dba \
-- password cloudera \
-- table revenue_prac_update \
-- export-dir /apps/hive/warehouse/jvanchir_sqoop_prac.db/daily_revenue \
-- input-fields-terminated-by ‘\001’ \
- m 4

O/P:
ERROR [Thread-12] org.apache.sqoop.mapreduce.AsyncSqlOutputFormat: Got exception in update thread:
com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry ‘2013–07–29 00:00:00.0’ for key ‘PRIMARY’

Solution: Use the — Update-key attribute

sqoop export \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_export \
-- username retail_dba \
-- password cloudera \
-- table revenue_prac_update \
-- export-dir /apps/hive/warehouse/jvanchir_sqoop_prac.db/daily_revenue \
-- input-fields-terminated-by ‘\001’ \
- m 4 \
-- update-key order_date
-- staging-table revenue_prac_update_stage \
-- clear-stagging-table

Problem 10: When we execute the below sqoop job it fails due to the alignment issue.

sqoop job —- create neunJob \
--import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- table orders \
-- warehouse-dir /user/jvanchir/sqoop_demo \
-- delete-target-dir
-- validate

solution: Give a space for import attribute.

sqoop job -- create neunJob \
-- import \ // Space to be given
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- table orders \
-- warehouse-dir /user/jvanchir/sqoop_demo \
-- delete-target-dir
-- validate

Now that you have reached here, you can easily tackle the problems that you would face as a beginner during the process of Data Ingestion using Sqoop.

If you would like too, you can connect with me on LinkedIn — Jayvardhan Reddy.

If you enjoyed reading it, you can click the clap and let others know about it. If you would like me to add anything else, please feel free to leave a response 💬

--

--

Jayvardhan Reddy
Jayvardhan Reddy

Written by Jayvardhan Reddy

Data Engineer. I write about Bigdata Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.

No responses yet