Big Data Engineer Resume
Irving, TX
SUMMARY
- Over 4+ years of experience in IT industry with strong emphasis on Object Oriented Analysis,ETL design,development and Implementation,Testing and Deployment of Data Warehouse as well as with BigData Processing in ingestion,storage,querying and analysis.
- As aBig Data Developer, to contribute to high - quality technology solutions that address business needs by developing data applications for the customer business lines.
- Developed PIG scripts to transform the raw data into intelligent data as specified by business users.
- Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
- Processed data into HDFS by developing solutions, analyzed the data using MapReduce, Pig, Hive, and produce summary results from Hadoop to downstream systems.
- Expertise in Unix/Linux Shell Scripting.
- Responsible for SDLC Technical Writing/Requirements Gathering and Documentations.
- Developed shell, Perl, and python scripts to automate and provide control flow to pig scripts.
- CreatedHivetablesandworkedonthemusingHiveQL.
- Having knowledge on Kafka Streaming to consume and produce data real-time.
TECHNICAL SKILLS
- MapReduce
- HDFS
- Yarn
- Sqoop
- Flume
- Pig
- Hive
- Shell Scripting
- Unix
- Oozie
- Spark python
- Pyspark
- Pyspark SQL
- Kafka
- D-Series
- Hyperion
- Xenon
- HoneyComb
- Grafana
- Hue
- Service-now
- Jupyter Notebook.
PROFESSIONAL EXPERIENCE
Confidential, Irving, TX
Big Data Engineer
Responsibilities:
- Maintain active involvement in all phases of SDLC including requirements collection, design and analysis of the customer specifications, as well as development and customization of the application using SOA.
- Interact with Oracle DB by drafting SQL queries and stored procedures, including historical data, user security and datascrubbing.
- Take charge of installing and configuring Spark, Scala, Hive, Pig, Sqoop and Oozie on the Hadoop cluster by working on Cloudera distribution for Hadoop ecosystem.
- Worked on optimizing ETL jobs based on Spark in Scala to improve performance to meet demanding client requirements.
- Involved in converting Cassandra/Hive/SQL queries into Spark transformations using Spark RDD's.
- Scheduled and implemented the Spark workflows using D-series workflow scheduler for various jobs like GDPR Right of Access and Right of Portability and Anonymization.
- Worked with Kafka extensively for writing the streaming data to Kafka topics.
- Created Kafka producers for streaming real time click stream events from third party Rest services into our topics.
- Wrote python scripts for different utilities of the application.
- Developed Spark streaming applications for consuming the data from Kafka topics.
- Integrated Tableau with Hadoopdata source for building dashboard to provide various insights on sales of the organization.
- Worked on Spark in building BI reports using Tableau. Integrated Tableau with Spark using Shark.
- Gain in-depth knowledge in writing HDFS and Pig Latin commands.
- Utilize HIVE and IMPALA through the development of complex queries.
- Hold responsibility in designing, implementing and testing data pipelines using Hadoop, Hive and Pig.
- Complete extensive cryptographic operations to Encrypt and Decrypt the data using the internal encryption framework using web services in Java Batch and using UDF in Shell.
- Expertly create JUnit test cases for elements, web beans, handlers and view helper classes.
- Generate a detailed JUnit tests for each piece of functionality before actually writing the functionality by focusing on test driven development.
- Carry out test data preparation for integration, functional and regression testing including dynamic data and conversion of time formats such as epoch, unix time format and ISO time formats.
- Establish the build pipeline for multiple projects with one click provision, which build and deploy codebase using Jenkins and execution of the dev code and regression tests.
- Lead the running of Database jobs and development of various shell scripts for data preparation and data comparison process using UNIX by developing various types of Unix Shell scripts.
- Assume accountability in importing and exporting data usingSqoop from HDFS to Relational Database Systems/ Non-Relational Database Systems and vice-versa.
- Proactively participate in production support to resolve the production job failures, while interacting with the operation support group for resuming the failed jobs.
- Prepare scripts for performing data-analysis withPIG and HIVE while using the Maven script for creating and deploying.jar, ear and .war files.
Environment: Java, J2EE, JavaScript, JDBC, Maven, Web Sphere, HTML,CSS, Agile Methodology, Spring, Spring Batch, Spring Boot, Hibernate, Web Services, SOA, Soap, Restful, SQL, Linux, Hive, HDFS, Map reduce, Shell Scripting, Spark, Scala, Tableau, Oozie, D-Series, Splunk, SFTP, Connect: Direct, MySQL, Oracle HQL, IntelliJ, Eclipse, Hue, H2 DB, JIRA, Jenkins, GIT, Log4J, JUnit.
ConfidentiaL, Minneapolis, MN
Big Data Engineer
Responsibilities:
- Attended Scrum meetings, Sprint planning and Grooming sessions to collaborate with the Project Manager, Product Owner, Business Analyst, and fellow Developers to understand the Business that will be implemented in the project development.
- Involved in writingSparkapplications using Scala to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.
- DevelopedPythonscripts to collect data from source systems and store it on HDFS.
- Created Spark scripts to perform data quality (DQ) and data integrity (DI) checks.
- Extensively used spark-SQL parameters for performance tuning and resolve memory & container issues.
- Also created and used spark DataFrame objects to extract the data for faster computations and reusing the data.
- Used data cache and persist functions to speed up the applications using the same RDDs multiple times.
- Extensively worked on creating Hive external and internal tables and then applied HiveQL to transforms, joins, aggregate the data.
- Created Hive schemas using performance techniques like partitioning and bucketing. Written extensive Hive queries to do transformations on the data to be used by downstream models.
- Exploring with theSpark and improving the performance and optimization of the existing algorithms in Hadoop usingSparkContext, Spark-SQL, Data Frame, and Pair RDD's.
- Developed Spark code using Scala and Spark -SQL for batch processing of data. Utilized in-memory processing capability of Apache Spark to process data using Spark SQL, Spark Streaming usingPyspark and Scala scripts.
- CreatedSpark Scalascripts to load data from source files to RDDs, create data frames and perform transformations and aggregations and collect the output of the process.
- Consuming high volumes of data from multiple sources (Such as Hive, MySQL, HBase, xls) and performing transformations usingSpark.
- Designed an ideal approach for data movement from different sources to HDFS via Apache Kafka.
- Performance optimization when dealing with large datasets using partitions, broadcasts inSpark, effective and efficient joins, transformations during ingestion process.
- Implemented Partitioning, Dynamic Partitions and Bucketing in Hive for efficient data access.
- Worked on Sequence files, ORC files, bucketing, partitioning for Hive performance enhancement and storage improvement.
- Responsible for creating test plan, test cases, test data, test execution and reporting status ensuring accurate coverage of requirements and business processes.
- Used Oozie and Automation Portal for creating and Scheduling workflow and jobs that in recurring schedule based of file and time dependencies.
- Hadoop jobs such as MapReduce Jobs, Hive, Spark and automating Sqoop jobs.
- Extensively used Drone and Vela CI/CD tools for automation of Build and Deploy process.
Environment: Spark, Spark-SQL, Spark-Streaming, Python, Scala, Java, Sqoop, Kafka, HDFS, Hive, Spark-Shell, Jupyter, Oozie, Automation portal, Domo, Service-now, Hyperion, Xenon, HoneyComb, Grafana, Hue, Eclipse,Intellij, Linux, MapReduce, HBase, Oracle10g, Red Hat Linux.
Confidential
Hadoop/Big Data Developer
Responsibilities:
- Responsible for Writing MapReduce jobs to perform operations like copying data on HDFS and defining job flows on EC2 server, load and transform large sets of structured, semi-structured and unstructured data.
- Developed a process to Sqoop data from multiple sources like SQL Server, Oracle and Teradata.
- Responsible for creation of mapping document from source fields to destination fields mapping.
- Developed a shell script to create staging, landing tables with the same schema as the source and generate the properties used by Oozie jobs.
- Developed Oozie workflows for executing Sqoop and Hive actions.
- Worked with NoSQL databases like Hbase in creating Hbase tables to load large sets of semi-structured data coming from various sources.
- Performance optimizations on Spark/Scala. Diagnose and resolve performance issues.
- Populated HDFS with huge amounts of data using Apache Kafka.
- Responsible for developing Python wrapper scripts, which will extract specific date range, using Sqoop by passing custom properties required for the workflow.
- Developed scripts to run Oozie workflows, capture the logs of all jobs that run on cluster and create a metadata table, which specifies the execution times of each job.
- Developed Hive scripts for performing transformation logic and loading the data from staging zone to final landing zone.
- Involved in loading transactional data into HDFS using Flume for Fraud Analytics.
- Developed Python utility to validate HDFS tables with source tables.
- Designed and developed UDFs to extend the functionality in both PIG and HIVE.
- Import and Export of data using Sqoop between MySQL to HDFS on regular basis.
- Responsible for developing multiple Kafka Producers and Consumers from scratch as per the software requirement specifications.
- Automated all the jobs for pulling data from FTP server to load data into Hive tables using Oozie workflows.
- Involved in developing Spark code using Scala and Spark-SQL for faster testing and processing of data and exploring of optimizing it using Spark Context, Spark-SQL, Pair RDD's, Spark YARN.
- Migrating the needed data from Oracle, MySQL in to HDFS using Sqoop and importing various formats of flat files in to HDFS.
Environment: Hadoop, HDFS, Map Reduce, Hive, HBase, Kafka, Zookeeper, Oozie, Impala, Java(jdk1.8), Cloudera, Oracle, Teradata SQL Server, UNIX Shell Scripting, Flume, Scala, Spark, Sqoop, Python.