We provide IT Staff Augmentation Services!

Data Engineer Resume

SUMMARY

  • Over 5 years of professional IT experience, this includes hands on experience in Big Data Analytics and development.
  • Expertise with the tools in Hadoop Ecosystem including Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Scala, Kafka, Yarn, Oozie and Zookeeper for complex business problems.
  • Advanced TSQL knowledge including stored procedures functions.
  • Experience in developing, designing, testing and maintaining object - oriented application in Python, Scala and Java.
  • Experience in writing complex hive queries that work with different file formats like Csv, Sequence, Xml, Orc, parquet and Avro.
  • Performed different Optimization techniques like distributed cache, map side joins, partitioning and bucketing in Hive.
  • Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
  • Strong knowledge of Pig and Hive's analytical functions, extending Hive and Pig core functionality by writing custom UDFs and macros.
  • Python modules such as requests, urllib, urllib2 for web crawling.
  • Experienced in connectivity with SQLs usingPySpark& Python
  • Worked on developing ETL processes to load data from multiple data sources to HDFS using Kafka and SQOOP, perform structural modifications using Map-Reduce, HIVE and analyse data using visualization/reporting tools.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and data processing.
  • Wrote Lambda functions in python for AZURE Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
  • Hands on expertise in running the SPARK & SPARK SQL on AMAZON ELASTIC MAPREDUCE (EMR).
  • Involved in Agile SDLC during the development of project.
  • Integrated Apache Storm with Kafka to perform web analytics. Uploaded click stream data from Kafka to HDFS, HBase and Hive by integrating with Storm.

TECHNICAL SKILLS

Big Data Technologies: Apache Spark, PySpark, Apache Hadoop, Map Reduce, Apache Hive, Apache Pig, Apache Sqoop, Apache Flume, Apache oozie, Hue, Apache Zookeeper, HDFS, Amazon S3, EC2, EMR.

Programming Language: Scala, Python, Java, SQL

IDE: IntelliJ, Eclipse, Jupyter Notebook, Visual Studio 2005/2008/2012/2015/2017

Databases: Oracle 11g/12C, MySQL, MS-SQL Server, SSDT, SSIS, Teradata

Hadoop Distributions: Cloudera, Hortonworks

Operating systems: Mac OS, Windows 7/10, Linux (CentOS, Redhat, Ubuntu).

Methodologies: Agile, UML, Waterfall, CI/CD

Development Tools: IntelliJ, Maven, Scala Test, GitHub, Jenkins

PROFESSIONAL EXPERIENCE

Data Engineer

Confidential

Responsibilities:

  • Proficient in programming with Resilient Distributed Datasets (RDDs)
  • Develop spark programming using python APIs to compare the performance of spark with Hive and SQL.
  • Experience in tuning and debugging Spark application running
  • Experienced in connectivity with SQLs usingPySpark& Python
  • Used Spark API over cloudera Hadoop YARN to Perform analytics on data in Hive
  • Data Extractions & modeling support from Hive,PySpark& Teradata using Complex SQL queries
  • Fine tuning the SQL queries -Big Data - Hive,PySpark& Teradata.
  • Implemented spark using Python and SparkSQL for faster testing and processing data
  • Designed and created Hive External tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets
  • Involved in integration of Kafka with Spark for real time data processing
  • Imported data using Sqoop from db2, SQL server to HIVE, performance tuning, executinganddesigning complex Hive HQLs.
  • Developed user defined functions in Scala to support spark transformations
  • Loaded and presented data sets in multiple formats and multiple sources including JSON, text files, and log data
  • Developed ingestion scripts to load data from SQL Server, db2, different types flat files to HIVE using PySpark.
  • ETL process using python scripts withPySparkRDD

Environment: Cloudera, HDFS, MapReduce, Scala, Python, Spark, Hive, Pig, Sqoop, Shell Scripting, MySQL, SQL Server, Tableau, HBase

Confidential

Data Engineer (Co-op)

Responsibilities:

  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce,
  • Loaded data into HDFS and extracted the data from MySQL into HDFS using Sqoop.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Managing and scheduling jobs on a Hadoop cluster.
  • Developed Simple to complex Map/reduce Jobs using Hive.
  • Analyzed the data by performing Hive queries and running Pig scripts to know user behavior.
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Developed Simple to complex Map/reduce Jobs using Hive.
  • Implemented Partitioning and bucketing in Hive.
  • Mentored analyst and test team for writing Hive Queries.
  • Experience in managing and reviewing Hadoop log files.
  • Extensively used Pig for data cleansing.
  • Configured Flume to extract the data from the web server output files to load into HDFS.
  • Developed the Pig UDF'S to pre-process the data for analysis.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.

Environment: Cloudera Hadoop, MapReduce, HDFS, Hive, Java (jdk1.7), Pig, Linux, XML. HBase, Zookeeper, Sqoop, Amazon Web Services (AWS).

Confidential

Data Engineer Intern

Responsibilities:

  • Worked on Hortonworks Data Platform (HDP 2.4) Hadoop distribution for data querying using Hive to store and retrieve data.
  • Involved in data ingestion into HDFS using Sqoop for full load and Kafka for incremental load on variety of sources like web server, RDBMS and Data API's.
  • Experienced in implementing Spark RDD transformations, actions to implement business analysis and Worked with accumulators and broadcast variables. Involved in writing queries in Spark SQL using PySpark.
  • Extensive experience in Spark Streaming through core Spark API running Scala & Python Scripts to transform raw data from several data sources into forming baseline data.
  • Developed MapReduce/ EMR jobs to analyze the data and provide heuristics and reports. The heuristics were used for improving campaign targeting and efficiency.
  • Performed real-time analysis of the incoming data using Kafka consumer API, Kafka topics, Spark Streaming utilizing Scala.
  • Created Hive external tables and views, on the data imported into the HDFS and developed and implemented Hive scripts for transformations such as evaluation, filtering and aggregation.
  • Worked on partitioning and bucketing of Hive tables and running the scripts in parallel to reduce run-time of the scripts.
  • Developed User Defined Functions (UDF) in Python if required for hive queries.
  • Worked with data in multiple file formats including Parquet, Sequence files, ORC and Text(delimited)/CSV, JSON.
  • Worked on creating End-End data pipeline orchestration using Oozie.
  • Developed bash scripts to automate the above process of Extraction, Transformation and Loading.

Environment: Hadoop 2.2, AWS, MapReduce, Hive, Python, PySpark, Avro, Kafka, Storm, Linux, Sqoop, Shell Scripting, SQL Server, Oozie, Cassandra, Git, XML, Scala, Java, Maven, Eclipse, Horton Works.

Confidential

Big Data Developer

Responsibilities:

  • Installed and configured Hadoop, Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.
  • Installed and configured Pig for ETL jobs.
  • Troubleshooting the cluster by reviewing HadoopLOG files.
  • Imported data using Sqoop from Teradata using Teradata connector.
  • Used Oozie to orchestrate the work flow.
  • Creating Hive tables and working on them for data analysis in order to meet the business requirements.
  • Designed and implemented MapReduce-based large-scale parallel relation-learning system.
  • Installed and benchmarked Hadoop/HBase clusters for internal use.
  • Experience with data model concepts-star schema dimensional modeling Relational design (ER).
  • Created tables, stored procedures in SQL for data manipulation and retrieval, Database Modification using SQL, PL/SQL, Stored procedures, triggers, Views in Oracle 9i.
  • Used Python modules such as requests, urllib, urllib2 for web crawling.
  • Used other packages such as Beautiful soup for data parsing. Worked on writing and as well as read data from CSV and excel file formats.
  • Worked on resulting reports of the application and Tableau reports.
  • Utilize PyUnit, the Python unit test framework, for all Python applications.

Environment: Hadoop, MapReduce, HDFS, Hive, Java, Cloudera, Pig, HBase, Linux, XML, MySQL Workbench, Python, Java 6, Eclipse, Oracle 10g, PL/SQL, SQL*PLUS .

Confidential

SQL Developer intern

Responsibilities:

  • Worked as anSQL Developer with experience on analytical techniques in writing SQL queries and optimizing the queries in SQL Server 2008
  • Used Cold fusion front end to develop multiple web systems applications and automation tools
  • Coordinated automation change testing roll-outs
  • Provided recommendations based on technical issue research
  • Participated in software evaluation, business requirements analysis and application integration.
  • Knowledgeand work experience in RDBMS concepts and experience in producing tables, reports, graphs and listings using various procedures andhandling large databases to perform data manipulations
  • Analyzed components of end-to-end designs for web interfaces
  • Advised on appropriate reporting applications and analysis architecture
  • Collaborated with team members to optimize viable solutions
  • Documented user security-related constraints and functionality

Environment: SQLserver 2008, Visual Basic, Windows 2008, Excel.

Hire Now