BigData |Hadoop— 10 Common Mistakes in Sqoop
If your daily activities involve performing Data Ingestion then here are a few issues you come across.
Some of the things a beginner as well as an experienced need to take care of.
Problem 1: If splitby is not declared for query
sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “select * from products p, categories c where p.product_category_id = c.category_id” \
-- target-dir /user/jvanchir/sqoop_prac/
O/P:
18/09/14 13:46:17 WARN tool.BaseSqoopTool: Setting your password on the command- line is insecure. Consider using -P instead.
When importing query results in parallel, you must specify — split-by.
Try — help for usage instructions.
Solution: Use — split-by or — autoreset-to-one-mapper
sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “select * from products p, categories c where p.product_category_id = c.category_id AND \$CONDITIONS” \
-- target-dir /user/jvanchir/sqoop_prac/ \
-- split-by product_id
Problem 2: ‘$CONDITIONS’ should be used for — query parameter
sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “select * from products p, categories c where p.product_category_id = c.category_id” \
-- target-dir /user/jvanchir/sqoop_prac/ \
-- split-by product_id
O/P:
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException:
Query [select * from products p, categories c where p.product_category_id = c.category_id] must contain ‘$CONDITIONS’ in WHERE clause.
Solution: Use — split-by or — autoreset-to-one-mapper
sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “select * from products p, categories c where p.product_category_id = c.category_id AND \$CONDITIONS” \
-- target-dir /user/jvanchir/sqoop_prac/ \
-- split-by product_id
Problem 3: If a directory with the same name or same data exists in hdfs directory we r trying to export
sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “select * from products p, categories c where p.product_category_id = c.category_id AND \$CONDITIONS” \
-- target-dir /user/jvanchir/sqoop_prac/ \
-- split-by product_id
O/P:
Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory hdfs://nn01.cloudera.com:8020/user/jvanchir/sqoop_prac already exists
solution: use — delete-target-dir
sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “select * from products p, categories c where p.product_category_id = c.category_id AND \$CONDITIONS” \
-- target-dir /user/jvanchir/sqoop_prac/ \
-- split-by product_id \
-- delete-target-dir
Problem 4: No space between the table name and new line parameter
sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- table products\
-- warehouse-dir /user/jvanchir/sqoop_prac/import_table_dir
O/P:
18/09/14 14:24:34 ERROR tool.BaseSqoopTool: Error parsing arguments for import:
18/09/14 14:24:34 ERROR tool.BaseSqoopTool: Unrecognized argument: /user/jvanchir/sqoop_prac/import_table_dir
Solution: give space between them
— table products \
Problem 5: when trying to use — split-by attribute on a non-primary key and string Field
sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- table categories \
-- split-by category_name \
-- warehouse-dir /user/jvanchir/sqoop_prac/import_table_dir
O/P:
Caused by: Generating splits for a textual index column allowed only in case of
“-Dorg.apache.sqoop.splitter.allow_text_splitter=true” property passed as a parameter
Solution: the Dorg.apache.sqoop.splitter.allow_text_splitter=true option should be used as first-line after the import statement
sqoop import \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- table categories \
-- split-by category_name \
-- warehouse-dir /user/jvanchir/sqoop_prac/import_table_dir
Problem 6: You cannot use split-by attribute on multiple columns
sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- table products \
-- warehouse-dir /user/jvanchir/sqoop_prac/import_table_dir/mul_split \
-- split-by product_price,product_category_id
O/P:
Encountered IOException running import job: java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException:
Unknown column ‘product_name,product_category_id’ in ‘field list’
Solution: Use the — query option instead but on using query — target-dir is a must.
sqoop import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- query “ select * from products where product_name = ‘abc’ and product_category_id >= ‘1010’ and \$CONDITIONS “ \
-- target-dir /user/jvanchir/sqoop_prac/import_table_dir/mul_split/products \
Problem 7: During Sqoop Export operation i.e when exporting data from hive/hdfs to SQL table
sqoop export \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_export \
-- username retail_dba \
-- password cloudera \
-- table revenue_prac_col_order \
-- export-dir /apps/hive/warehouse/jvanchir_sqoop_prac.db/daily_revenue \
-- columns order_date,revenue \
-m2
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: java.lang.RuntimeException: Cannot parse input data:
‘2013–08–11 00:00:00.071149.59000000008’ at revenue_prac_col_order.__loadFromFields(revenue_prac_col_order.java:267) at
Solution:
— input-fields-terminated-by ‘\001’
Problem 8: When we try to export values from hive/hdfs into MySQL non-null attribute field, if there is an empty data or null data then we get
mysql> desc revenue_prac_col_order;
+ — — — — — — -+ — — — — — — -+ — — — + — — -+ — — — — -+ — — — -+
| Field | Type | Null | Key | Default | Extra |
+ — — — — — — -+ — — — — — — -+ — — — + — — -+ — — — — -+ — — — -+
| revenue | float | YES | | NULL | |
| order_date | varchar(32) | YES | | NULL | |
| description | varchar(30) | NO | | NULL | |
+ — — — — — — -+ — — — — — — -+ — — — + — — -+ — — — — -+ — — — -+
hive (jvanchir_sqoop_import)> desc daily_revenue;
O/P:
OK
order_date string
_c1 double
Here hive table has just 2 fields while SQL has 3 fields. Their last column values are empty while exporting into MySQL and throws an error
org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: java.sql.SQLException:
Field ‘description’ does not have a default value at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1078)
Solution:
Either export hive table with 3 column non-null values
or remove the not null attribute feature in MySQL table for the column while creating.
Problem 9: When we try exporting the data multiple times from hive/hdfs into SQL table having a primary key constraint.
mysql> create table revenue_prac_update(order_date varchar(32) primary key, revenue float);
Execute the Below sqoop command twice to get Primary Key constraint error (duplicate insertion)
sqoop export \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_export \
-- username retail_dba \
-- password cloudera \
-- table revenue_prac_update \
-- export-dir /apps/hive/warehouse/jvanchir_sqoop_prac.db/daily_revenue \
-- input-fields-terminated-by ‘\001’ \
- m 4
O/P:
ERROR [Thread-12] org.apache.sqoop.mapreduce.AsyncSqlOutputFormat: Got exception in update thread:
com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry ‘2013–07–29 00:00:00.0’ for key ‘PRIMARY’
Solution: Use the — Update-key attribute
sqoop export \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_export \
-- username retail_dba \
-- password cloudera \
-- table revenue_prac_update \
-- export-dir /apps/hive/warehouse/jvanchir_sqoop_prac.db/daily_revenue \
-- input-fields-terminated-by ‘\001’ \
- m 4 \
-- update-key order_date
-- staging-table revenue_prac_update_stage \
-- clear-stagging-table
Problem 10: When we execute the below sqoop job it fails due to the alignment issue.
sqoop job —- create neunJob \
--import \
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- table orders \
-- warehouse-dir /user/jvanchir/sqoop_demo \
-- delete-target-dir
-- validate
solution: Give a space for import attribute.
sqoop job -- create neunJob \
-- import \ // Space to be given
-- connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
-- username retail_dba \
-- password cloudera \
-- table orders \
-- warehouse-dir /user/jvanchir/sqoop_demo \
-- delete-target-dir
-- validate
Now that you have reached here, you can easily tackle the problems that you would face as a beginner during the process of Data Ingestion using Sqoop.
If you would like too, you can connect with me on LinkedIn — Jayvardhan Reddy.
If you enjoyed reading it, you can click the clap and let others know about it. If you would like me to add anything else, please feel free to leave a response 💬