We provide IT Staff Augmentation Services!

Jr. Data Engineer Resume

0/5 (Submit Your Rating)

Santa Monica, CA

SUMMARY

  • IT professional with over 5 years of experience as a strong experience in software development using Big Data/Hadoop ecosystem, Apache Spark, Python, SQL and ETL Technologies.
  • Hands on experience on Hortonworks and Cloudera Hadoop environments. Knowledge on Hadoop Cluster architecture and monitoring the cluster.
  • Hands on experience in developing Spark jobs using Spark Core, Spark SQL and Spark Streaming. Used Spark Streaming and also have good knowledge of its integration with Kafka.
  • Hands on experience in Amazon Web Services (AWS) like EMR, S3 and EC2 services.
  • Hands on experience in importing and exporting data using Sqoop from Relational Database Systems like Oracle, MySQL, Teradata to HDFS and vice versa.
  • Strong work experience in Data Mart life cycle development, performed ETL procedure to load data from different sources like SQL Server, Oracle, DB2, XML Files and flat files into data marts and data warehouse using Power Center Repository Manager, Designer, Server Manager, Workflow Manager, and Workflow Monitor.
  • Good Understanding of the Hadoop Distributed File System and Hadoop Ecosystem.
  • Strong understanding of NoSQL databases like HBase and Cassandra. experience in creating integration between Hive and HBase.
  • Developed scalable and reliable data solutions to move data across systems from multiple sources in real time as well as batch modes.
  • Hands on experience in different phases of big data applications like data ingestion, data analytics and data visualization and building data lake based data marts for supporting data science and Machine Learning.
  • Proficient in writing stored procedures, Complex SQL Queries, optimizing the SQL to improve performance, Packages, Functions and Database Triggers using SQL and Possess Strong data analysis skills using Python, Hive, Apache Spark, MS Excel and Access DB.
  • Experience in designing and handling of various Data Ingestion patterns (Batch and Near Real Time) using Sqoop, Distcp, Apache Storm, Flume and Apache Kafka.
  • Experience in designing and handling of various Data Transformation/Filtration patterns using Pig, Hive, and Python.
  • Strong Knowledge on Hadoop architecture and various components such as HDFS, Job and Task Tracker, Name and DataNode, Secondary NameNode and MapReduce programming.
  • Handling different file formats on Parquet, ORC, Avro, JSON, Sequence file and flat text files.
  • Experience working with cluster monitoring tools like Ambari and Cloudera Manager. Experience in setting up BDR jobs to copy data from one cluster to another using Cloudera Manager.
  • Experience in Importing and Exporting data using Sqoop from Oracle/Mainframe DB2 to HDFS and Data Lake.
  • Having good understanding of the end - to-end Systems processes and the interfaces and dependencies with other processes, application design documents, functionality, data flow and technical aspects.
  • Ability to multi-task for different applications and Ability to take ownership and provide coordination for tasks related to implementation to test and production.
  • Hands on experience in DevOps tools like Maven, Git/GitHub and Jenkins.
  • Have good interpersonal, communication skills, strong problem-solving skills, explore/adopt to new technologies with ease and a good team member.

TECHNICAL SKILLS

Big Data/Hadoop Technologies: Hadoop (Hortonworks, Cloudera): HDFS, MapReduce, Pig, Hive, HBase, Zookeeper, Oozie, Sqoop, Flume, Impala, Spark, Kafka, Cassandra

Cloud Services: Amazon Web Services

Languages: Python, SQL, PL/SQL, Pig Latin, HIveQL

Databases: Oracle, MysQL, MSSQL

Methodologies: Agile/Scrum

PROFESSIONAL EXPERIENCE

Confidential - Dallas, TX

Big Data Engineer

Responsibilities:

  • Handling JSON datasets and writing custom Python functions to parse through JSON data using Spark. monitoring the Spark applications through the Spark UI and identify executor failures, data skew and other runtime issues.
  • Designed and Architected Spark programs using Python to compare the performance of Spark with Hive and SQL and developed python scripts using both RDD and Data frames/SQL/Datasets in Spark for Data Aggregation, queries and writing data.
  • Identifying data domains, analyzing complex ETL's and performing data modelling and transforming them to Hadoop. created new PySpark application as ETL tool and ingested data to Corporate Data Lake.
  • Developed Sqoop jobs and Hive scripts to import 15 TB worth of rolling data from Oracle 10g/11g to Hive and also by performing reverse engineering on Hive fact tables created raw dataset having 850 Columns.
  • Created partitions with in the Hive table and transformed data from legacy tables to HDFS and HIVE
  • Worked on migrating PIG scripts to Spark DataFrames API and Spark SQL to improve performance
  • Developed Big Data Solutions that enabled the business and technology teams to make data-driven decisions on the best ways to acquire customers and provide them business solutions.
  • Exported the business-required information to RDBMS using Sqoop to make the data available for BI team to generate reports based on data.
  • Migrated the existing data to Hadoop from RDBMS (SQL Server and Oracle) using Sqoop for processing the data.
  • Developed Spark Programs for Batch and Real time Processing. Developed Spark Streaming applications for Real Time Processing.
  • Implemented Hive custom UDF's to achieve comprehensive data analysis.
  • Worked on providing a solution for compacting small files in Hadoop and running stats on Hive tables as a part of maintaining the overall stability of the Hadoop cluster and Impala.
  • Hands on experience in setting up Autosys jobs which trigger Oozie processes.
  • Worked with Sqoop for importing and exporting the calibration data that will be used to run the models from Oracle/Netezza to Hadoop.
  • Developed Kafka producer and consumer programs for message handling, developed spark programs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
  • Developed custom Spark/Kafka streaming applications for ingesting data from various Realtime datasets into Hive tables.
  • Worked on resolving the issues for the application team that occur in Hadoop Production Environment.
  • Working with offshore teams and communicating daily status on issues, roadblocks.

Environment: Hadoop, Cloudera, RedHat Linux, HDFS, Python, Spark, Spark SQL, Parquet, Avro, Hive, Sqoop, Oracle, Oozie, SVN, Shell, Autosys, JIRA

Confidential - Las Vegas, NV

Hadoop/Spark Developer

Responsibilities:

  • Involved in complete SDLC life cycle of big data project that includes requirement analysis, design, coding, testing and production.
  • Used Sqoop to import/export data between RDBMS and hive tables, incremental imports and created Sqoop jobs for last saved value.
  • Worked with spark Data frames for ingesting data from flat files into RDD's to transform unstructured data and structured data.
  • Created the SparkSQL context to load data from Hive tables into RDD's for performing complex queries and analytics on data present in data lake.
  • Used Spark transformations for data wrangling and ingesting the real-time data of various file formats.
  • Involved in implementing the solution for data preparation which is responsible for data transformation as wells as handling user stories.
  • Develop and tested data Ingestion/Dispatch jobs.
  • Develop Oozie workflows to automate the entire data pipeline and schedule them using Tidal scheduler.
  • Wrote Pig Latin Scripts to perform transformations (ETL) as per the use case requirement.
  • Develop Spark code using Spark-SQL for faster testing and data processing.
  • Import millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
  • Use Spark SQL to process the huge amount of structured data.
  • Created dispatcher jobs using sqoop export to dispatch the data into Teradata target tables.
  • Involved in indexing the files using solr for removing the duplicates in type 1 insert jobs.
  • Created HQL scripts to perform the data validation once transformations are done as per the use case.
  • Closely collaborated with both the onsite and offshore team
  • Written shell scripts to automate the process by scheduling and calling the scripts from scheduler.

Environment: Hadoop, Ambari, Centos Linux, HDFS, Parquet, Pig, Hive, Sqoop, Oracle, Oozie, Cassandra, GitHub

Confidential - Charlotte, NC

Hadoop Developer

Responsibilities:

  • Developed Pig Latin scripts in the areas where extensive coding needs to be reduced to analyze large data sets.
  • Used Sqoop tool to extract data from a relational database into Hadoop.
  • Involved in performance enhancements of the code and optimization by writing custom comparators and combiner logic.
  • Developed multiple Kafka Producers and Consumers as per the software requirement specifications.
  • Worked closely with data warehouse architect and business intelligence analyst to develop solutions.
  • Involved in creating Hive Tables, loading with data and writing Hive queries, which will invoke and run Mapreduce jobs in the backend.
  • Loaded data from AWS S3 buckets into spark RDD’s for processing and analyzing.
  • Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark SQL, Python.
  • Used Spark API over Hadoop YARN to perform analytics on data in Hive and Spark SQL.
  • Loaded the data into Spark RDD and do in memory data Computation to generate the Output response.
  • Developed Spark Streaming jobs to process incoming streams of data from Kafka sources.
  • Extensive knowledge on pyspark functions, transformations and actions on RDD’s.
  • Good understanding of Data frames, data source API’s to work with SparkSQL
  • Have done unit testing on developed shell, PIG, HQL code in DEV.
  • Closely worked with App support team to deploy the developed jobs into production

Environment: Hadoop, Ambari, Centos Linux, HDFS, MapReduce, Kafka, Spark, Spark Streaming, Spark SQL, Parquet, Pig, Hive, Sqoop, Oracle, Oozie, Cassandra, GitHub

Confidential - Santa Monica, CA

Jr. Data Engineer

Responsibilities:

  • Worked with Business Analyst and helped representing the business domain details, prepared documentation
  • Helped the Business Insights team with statistical predictions, business intelligence and data science efforts
  • Teamed up with Architects to design Spark model for the existing Mapreduce model
  • Created Hive/Impala tables and created Sqoop jobs to import the data from Oracle to HDFS and scheduled them in Autosys to run every day
  • Developed scripts, UDF's using both Spark SQL and Spark Core in Python for aggregative operations
  • Experienced in implementing Spark RDD/Data Frame transformations, actions to implement business analysis and Worked with Spark accumulators and broadcast variables
  • Analyzed the SQL scripts and Designed the Solution to Implement Using PySpark.
  • Efficiently joined raw data with the reference data using Pig scripting.
  • Used various file formats like Parquet, Avro, ORC and compression techniques like Snappy, LZO and GZip for efficient management of cluster resources.
  • Developed Oozie workflows to automate the jobs and Used SVN for version control on the project
  • Cloudera Manager was used to monitor the health of Jobs which are running on the cluster

Environment: Hadoop,Hive, Sqoop, Spark, Oozie, Impala, SQL, Tableau, Python and Autosys

Confidential

Software Engineer

Responsibilities:

  • Involved in the complete Software Development Life Cycle of project starting from requirements gathering phase to deployment and maintenance.
  • Prepared technical specifications for the development of Informatica (ETL) mappings to load data into various target tables and defining ETL standards.
  • Loading data from different sources like relational Tables and Flat Files to Target Data warehouse.
  • Worked with various Informatica client tools like Source Analyzer, Mapping designer, Mapplet Designer, Transformation Developer, Informatica Repository Manager and Workflow Manager.
  • Developed standard and re-usable Mappings and Mapplets using various transformations like expression, aggregator, joiner, source qualifier, router, lookup, filter.
  • Involved in performance tuning and optimization of Informatica mappings and sessions using features like partitions and data/index cache to manage very large volume of data.
  • Used Informatica debugging techniques to debug the mappings and used session log files and bad files to trace errors occurred while loading.
  • Created, Tested and debugged the Stored Procedures, Functions, Packages using PL/SQL developer.
  • Involved in Unit testing, System testing to check whether the data loads into target are accurate, which was extracted from different source systems according to the user requirements.
  • Worked heavily on several complex transformations in Informatica like HTTP, Web Service Consumer, and Java.
  • Use dynamic parameter file concept as enterprise best practice to write once and use across systems/applications.
  • Created & modified database objects like tables, views, procedures, functions, triggers, packages, indexes, synonyms, materialized views using Oracle tools like TOAD and SQL Navigator.
  • Updated procedures, functions, triggers and packages based on the change request from users.
  • Support activities like Job monitoring, enhancements and resolving defects.
  • Worked with testing teams; perform UAT testing with business users and also worked with release team for the staging & production move.
  • Implemented efficient error handling process by capturing errors into user managed tables.
  • Pair-program with developers to enhance current PL/SQL packages to fix production issues, build new functionality and improve processing time through code optimization.

We'd love your feedback!