Sr Big Data Engineer Resume
Blue Ash, OH
SUMMARY
- Over 8+ years of IT development experience, including experience in Big Data ecosystem, and related technologies.
- Experience in Implement frameworks to import and export data from Hadoop to RDBMS
- Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
- Experience in manipulating/analysing large datasets and finding patterns and insights within structured and unstructured data.
- Strong experience with ETL and/or orchestration tools (e.g. Talend, Oozie, Airflow)
- Experience in using Teradata ETL tools and utilizes such as BTEQ, MLOAD, FASTLOAD, TPT, and Fast Export.
- Experience setting up AWS Data Platform - AWS CloudFormation, Development End Points, AWS Glue, EMR and Jupyter/Sagemaker Notebooks, Redshift, S3, and EC2 instances
- Database design, modeling, migration and development experience in using stored procedures, triggers, cursor, constraints and functions. Used My SQL, MS SQL Server, DB2, and Oracle
- Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase.
- Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning the HQL queries.
- Experience with Software development tools such as JIRA, Play, GIT.
- Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD)
- Highly experienced in importing and exporting data betweenHDFSandRelational Database Management systemsusingSqoop.
- Good knowledge in querying data fromCassandrafor searching grouping and sorting.
- Extensive hands-on experience in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark and effective use of Azure SQL Database, MapReduce, Hive, SQL and PySpark to solve big data type problems.
- Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
- Good understanding ofNoSQLData bases and hands on work experience in writing applications on No SQL data bases likeCassandraandMongo DB.
- Strong experience in core Java,Scala, SQL, PL/SQL and Restful web services.
- Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, calculated fields, Sets, Groups, Parameters etc., in Tableau experience in working with Flume and NiFi for loading log files into Hadoop.
- Strong experience in Snowflake.
- Creative skills in developing elegant solutions to challenges related to pipeline engineering
TECHNICAL SKILLS
Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, Snowflake, Airflow, Mahout, MongoDB, Cassandra, Avro, Storm, Parquet and Snappy.
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR and Apache
Languages: Java, Python, Jruby, SQL, HTML, DHTML, Scala, JavaScript, XML and C/C++
No SQL Databases: Cassandra, MongoDB and HBase
Java Technologies: Servlets, JavaBeans, JSP, JDBC, JNDI, EJB and struts
XML Technologies: XML, XSD, DTD, JAXP (SAX, DOM), JAXB
Development Methodology: Agile, waterfall
Development / Build Tools: Eclipse, Ant, Maven, IntelliJ, JUNIT and log4J
Frameworks: Struts, spring and Hibernate
App/Web servers: WebSphere, WebLogic, JBoss and Tomcat
SQL Databases: MySQL, MS SQL, PL/SQL, and Oracle
Cloud Technologies: AWS, Azure
PROFESSIONAL EXPERIENCE
Confidential, Blue Ash, OH
Sr Big Data Engineer
Responsibilities:
- Expertise AWS Lambda function and API Gateway, to submit data via API Gateway that is accessible via Lambda function.
- Managed configuration of Web App and Deploy to AWS cloud server through Chef.
- Making a data pipelining with help Data Fabric job, SQOOP, SPARK, Scala and KAFKA. Parallel working in data side oracle and MYSQL server for data designing to source to target.
- Developed PIG Latin scripts for the analysis of semi structured data.
- Involved in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Zookeeper, SQOOP, flume, Spark, Impala, and Cassandra with Horton work Distribution.
- Developed AWS Cloud Formation templates and set up Auto scaling for EC2 instances.
- Involved in Importing and exporting data from HDFS using Sqoop, resolution of access issues, performance issues and Patch/upgrade related issues.
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster.
- Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.
- Configured documents which allow Airflow to communicate to its PostgreSQL database.
- Experience in working with Map Reduce programs using Apache Hadoop for working with Big Data
- Involved in creating HiveQL on HBase tables and importing efficient work order data into Hive tables
- Extending Hive and Pig core functionality by writing customUDFs, UDTF and UDAFs.
- Used Oozie and Zookeeper operational services for coordinating cluster and scheduling workflows.
- Write programs using Spark to move data from Storage input location to output location by running data loading, validation, and transformation to the data
- Created data sharing between two snowflake accounts.
- Developed Kafka consumer API in Scala for consuming data from Kafka topics.
- Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards
- Developed Java Map Reduce programs for the analysis of sample log file stored in cluster.
- Working experience with data streaming process with Kafka, Apache Spark, Hive.
- Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers
- Working as Developer in hive and impala for more parallel processing data in Cloudera systems.
- Working in big data technologies like spark, Scala, Hive, Hadoop cluster (Cloudera platform).
- Design & implement Spark SQL tables, Hive scripts job with stone branch for scheduling and create work flow and task flow.
- Excellent understanding of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
- Imported documents into HDFS, HBase and creating HAR files.
- Expertise in data transformation & analysis usingSPARK,PIG, HIVE
- We generally used partitions and bucketing for data in hive to get query faster. This part of hive optimization
- Converted Talend Joblets to support the snowflake functionality.
- Responsible to manage data coming from different sources through Kafka.
- Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
- Created instances in AWS as well as worked on migration to AWS from data center.
- Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
- Compiling and validating data from all departments and Presenting to Director Operation.
- Working on designing the Map Reduce and Yarn flow and writing Map Reduce scripts, performance tuning and debugging.
- Utilized spark SQL to load data from AWS S3 to snowflake tables using data bricks.
- Working in relational SQL and NoSQL databases, including Oracle, Hive, Sqoop and HBase
- Experienced on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
- Created Notebooks using Databricks, Scala and spark and capturing the data from Delta tables in Delta lakes.
- Migrated Map reduce jobs to Spark jobs to achieve better performance.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
- Used Scala function, dictionary and data structure (array, list, map) for better code reusability
- Based on Development, we need to do the Unit Testing.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
- Unit tested the data between Redshift and Snowflake.
- Responsible for distributed applications across hybrid AWS
- Implemented data ingestion and handling clusters in real time processing usingKafka.
- Creating datamodel that correlates all the metrics and gives a valuable output.
- Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
- Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
- Pre-processing using Hive and Pig.
Environment: HDFS, Hive, Spark, Airflow, AWS, EC2, S3, Lambda, Auto Scaling, Cloud Watch, Linux, Kafka, python, Stone branch, Cloudera, Databricks, Talend, Snowflake, Oracle12c, PL/SQL, Unix, Json and Parquet File systems.
Confidential, Topeka, KS
Big Data Engineer
Responsibilities:
- ImplementedSQOOPfor large dataset transfer between Hadoop and RDBMS.
- Data visualization:Pentaho, Tableau, D3. Have knowledge of Numerical optimization, Anomaly Detection and estimation, A/B testing, Statistics, and Maple. Have big data analysis technique using Big data related techniques i.e.,Hadoop, Map Reduce, NoSQL, Pig/Hive, Spark/Shark, MLlibandScala, NumPy, SciPy, Pandas, scikit-learn.
- Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning usingPython scripts.
- Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
- Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
- Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.
- Implement IOT streaming with Databricks Delta tables and Delta lake to enable ACID transaction logging
- Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage
- Delivered de normalized data forPower BIconsumers for modeling and visualization from the produced layer in Data lake
- Extracted and updated the data into HDFS using Sqoop import and export.
- UtilizedAnsible playbookfor code pipeline deployment
- Used Delta Lake as it is an open-source data storage layer which delivers reliability to data lakes.
- Create custom logging framework for ELT pipeline logging using Append variables in Data factory
- Enabling monitoring and azure log analytics to alert support team on usage and stats of the daily runs
- Took proof of concept projects ideas from business, lead, developed and created production pipelines that deliver business valueusing Azure Data Factory
- Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing Developed Apache Spark applications by using spark for data processing from various streaming sources.
- Build the Logical and Physical data model for snowflake as per the changes required.
- Implement Continuous integration/continuous development best practice using Azure DevOps, ensuring code versioning
- Developed HIVE UDFs to in corporate external business logic into Hive script and Developed join data set scripts using HIVE join operations.
- Responsible to manage data coming from different sources through Kafka.
- Took proof of concept projects ideas from business, lead, developed and created production pipelines that deliver business valueusing Azure Data Factory
Environment: Hadoop, HDFS, Kafka, Azure, Data Factory, Data Lake, Data Storage, Data Bricks, Python, Sqoop, Hive, ETL, Snowflake, Power BI
Confidential, Weston, FL
Data Engineer
Responsibilities:
- Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
- Worked with HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HQL queries.
- Built and implemented automated procedures to split large files into smaller batches of data to facilitate FTP transfer which reduced 60% of execution time.
- Optimized and tuned ETL processes & SQL Queries for better performance.
- Developed PIG UDFs for manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders.
- Developed Python scripts to take backup of EBS volumes using AWS Lambda and Cloud Watch.
- Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature
- Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration.
- Developed and deployed stacks using AWS Cloud Formation Templates (CFT) and AWS Terraform.
- Developed data pipeline programs with Spark Scala APIs, data aggregations with Hive, and formatting data (JSON) for visualization, and generating.
- Developing Spark scripts, UDFS using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.
- Interacted with business partners, Business Analysts and product owner to understand requirements and build scalable distributed data solutions using Hadoop ecosystem.
- Developed Spark Streaming programs to process near real time data from Kafka, and process data with both stateless and state full transformations.
- Involved in converting Hive/SQL queries into transformations using Python
- Experience in report writing using SQL Server Reporting Services (SSRS) and creating various types of reports like drill down, Parameterized, Cascading, Conditional, Table, Matrix, Chart and Sub Reports.
- Used DataStax Spark connector which is used to store the data into Cassandra database or get the data from Cassandra database.
- Wrote Oozie scripts and setting up workflow using Apache Oozie workflow engine for managing and scheduling Hadoop jobs.
- Worked on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
- Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other compressed file formats Codecs like gZip, Snappy, Lzo.
- Strong understanding of Partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
Environment: Apache Spark, Map Reduce, Apache Pig, Python, SQL, Java, SSRS, HBase, AWS, Cassandra, PySpark, Apache Kafka, HIVE, SQOOP, FLUME, Apache Oozie, Zookeeper, ETL, UDF.
Confidential
Data Engineer
Responsibilities:
- Configured CI/CD tools Jenkins, Maven, Ant, and GitHub repository for the continuous smooth build process.
- Involved in Research of the project application architecture to support/resolve build, compile, and test issues/problems.
- Configured Chef into the environment for quickly deploy and maintain environment such as apache configuration, file level permissions.
- Releasing code to testing regions or staging areas according to the schedule published.
- Perform administration of users, supported team for parallel development. Wrote Shell/Perl/Ant files to automate certain processes for the smooth process of use of build tools like Build Forge.
- Resolved merging issues during rebasing and re-integrating branches by conducting meetings with Development Team Leads.
- Implemented CI/CD pipelines in Bamboo and Jenkins to deploy .Net, Python and PHP applications to Windows and Linux servers.
- Designed a CM solution that used Clear Case UCM integrated with Rational Clear Quest.
- Developed Shell/Perl Scripts for automation purpose.
- Responsible for designing and deploying best SCM processes and procedures.
- Worked with GitHub private repositories and plugin it to Jenkins.
- Maintained and coordinated environment configuration, controls, code integrity, and code conflict resolution.
- Created and maintained Jenkins jobs that execute Shell scripts for automation.
- Maintain Rapid and fluid deployment of infrastructure via Cloud Formation Template. Authored Templates to describe infrastructure to be deployed.
Environment: Jenkins, AWS, GIT, GitHub, MYSQL, Python, PHP, S3, SNS, Maven, Bamboo, .Net, Ant, Chef, Shell script, Git
Confidential
Data Analyst
Responsibilities:
- Worked in analyzing requirements, providing appropriate development strategies and implementation plan in the project.
- Designed and developeddata cleansing,data validation,load processes ETLusingOracle SQLand PL/SQL and UNIX.
- Worked effectively in analyzing Source System data and mapping it to the target system.
- Developed, implemented and maintained various database objects suchas stored procedures, triggers, functions, indexes and views.
- Creating Informatica Mappings, Mapplets, Workflows, Tasks, Scheduling and Monitoring
- Closely worked with the client to understand the current and proposed data model and architecture of the system.
- WorkedVarious SSIS Taskslike Execute SQL Task, bulk insert task, data flow task, file System task, ftp task, send mail task, active script task, Message Queue Task, XML task.
- Document the functional flowsusing MS Visio.
- Have undergone initial level Informatica Training and having hands-on.
- Proficient in theSQL Tuning and query optimization.
Environment: Oracle, PL/SQL, UNIX, ESDR Client Tool, Informatica power center Designer, Workflow Monitor, Microsoft SharePoint, subversion