Senior Big Data Engineer Resume
Alpharetta, GA
SUMMARY
- Over 8+ years of experience in software development life cycle design, development, and support of systems application architecture.
- Experience on Big Data/Apache Hadoop ecosystem components like MapReduce, HDFS, Hive, Impala, SQOOP, Pig, OOZIE, Zookeeper, Kafka and Apache Spark.
- Transforming and retrieving the data by using Spark, Impala, Pig, Hive, SSIS and Map Reduce.
- Experience in Hadoop streaming and writing MR jobs by using Perl, Python other than JAVA.
- Design and Implement the Data Distribution Mechanism on SQL Server (Transactional, Snapshot, Merge Replications, SSIS and DTS).
- Experience working with Cloudera, Amazon Web Services (AWS), Microsoft Azure and Hortonworks
- High Availability and Disaster Recovery Systems Design and Implementation on SQL Server (Always On, Mirroring and Log Shipping).
- Strong experience on AWS - EMR, Spark installation, HDFS and MapReduce Architecture. Along with that having a good knowledge on Spark, Scala and Hadoop distributions like Apache Hadoop, Cloudera.
- Experienced in Amazon Web Services (AWS) and Microsoft Azure, such as AWS EC2, S3, RD3, Azure HDInsight, Machine Learning Studio, Azure Storage, and AzureDataLake.
- Good understanding of Spark Architecture with Data bricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Data bricks
- Experienced in Identifying improvement areas for systems stability and providing end high availability architectural solutions.
- Good Knowledge in Azure Data bricks concepts like Event Hub and Azure Data Bricks Data Lake which provides fast and efficient processing of Big Data.
- Extensive experience in developing MapReduce Jobs using Java and Maven as well as thorough understanding of MapReduce infrastructure framework.
- Extracting and modelling datasets from verity of data sources like Hadoop (using Pig, Hive, Spark), Teradata and Snowflakes for ad-hoc analysis and have fair understanding of AGILE methodology and practice.
- Experience with ETL workflow Management tools like Apache Airflow and have significant experience in writing the python scripts to implement the workflow.
- Developed Shell and Python scripts to automate and provide Control flow to Pig scripts. Imported data from Linux file system to HDFS.
- Strong experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouse and into Azure Data Lake using Talend, SSIS.
- Experience with ETL workflow Management tools like Apache Airflow and have significant experience in writing the python scripts to implement the workflow.
- Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python.
- Good knowledge on converting complex RDMS (Oracle, MySQL & Teradata) queries into Hive query language.
- Experience in extracting source data from Sequential files, XML files, CSV files, transforming and loading it into the target Data warehouse.
- Knowledge of Hadoop administration activities using Cloudera Manager and Apache Ambari.
- Have good knowledge in Containers, Docker and Kubernetes for the runtime environment for the CI/CD system to build, test, and deploy.
- Experienced in job workflow scheduling and monitoring tools like Airflow, Oozie, TWS, Control-M and Zookeeper.
- Developed databases using SQL and PL/SQL and experience working on databases like Oracle, SQL Server, and MySQL.
- Worked on Build Management tools like SBT, Maven and Version control tools like Git.
- Vast experience of working in the area of data management including data analysis, gap analysis and data mapping.
TECHNICAL SKILLS
Big Data Tools/ Hadoop Ecosystem: Map Reduce, Spark, Airflow, Nifi, HBase, Hive, Pig, Sqoop, Kafka, Oozie, Hadoop
Databases: Oracle 12c/11g/10g, Teradata R15/R14, MY SQL, SQL Server, No SQL-Mongo DB, Cassandra, Hbase, Snowflake
ETL/Data warehouse Tools: Informatica and Tableau.
BI Tools: SSIS, SSRS, SSAS.
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
Cloud Platform: Amazon Web Services (AWS), Microsoft Azure
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena, MS Azure- Data Lake, Data Storage, Data Bricks, Data Factory
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
Operating System: Windows, Unix, Sun Solaris
Methodologies: System Development Life Cycle (SDLC), Agile
PROFESSIONAL EXPERIENCE
Confidential, Alpharetta, GA
Senior Big Data Engineer
Responsibilities:
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Develop RDD's/Data Frames in Spark using and apply several transformation logics to load data from Hadoop Data Lakes.
- Worked with Hadoop ecosystem and Implemented Spark using Scala and utilized Data frames and Spark SQL API for faster processing of data.
- Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
- Developed Spark Streaming job to consume the data from the Kafka topic of different source systems and push the data into HDFS locations.
- Converting Hive/SQL queries into Spark transformations using Spark RDDs and Pyspark
- Responsible for Automating Migration applications from on-premises to cloud.
- Using Python in spark to extract the data from Snowflake and upload it to Salesforce on Daily basis.
- Use python to write a service which is event based using AWS Lambda to achieve real time data to One-Lake (A Data Lake solution in Cap-One Enterprise).
- Used Talend for Big Data Integration using Spark and Hadoop.
- Responsible for analyzing large data sets and derive customer usage patterns by developing new MapReduce programs using Java.
- Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLTP reporting.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server usingPython.
- Used Airflow for scheduling the Hive, Spark and MapReduce jobs.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
- The individual will be responsible for design and development of High-performance data architectures which support data warehousing, real-time ETL and batch big-data processing.
- Analyzing SQL scripts and designed the solution to implement using PySpark
- Export tables from Teradata to HDFS using Sqoop and build tables in Hive.
- Use SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
- Developing Spark programs with Python, and applied principles of functional programming to process the complex structured data sets.
- Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in AWS.
- Filtering and cleaning data using Scala code and SQL Queries
- Troubleshooting errors in Hbase Shell/API, Pig, Hive and MapReduce.
- Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) onEC2.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big data technologies. Extracted Mega Data from Amazon Redshift, AWS, and Elastic Search engine using SQL Queries to create reports.
- Perform structural modifications using MapReduce, Hive and analyze data using visualization/reporting tools (Tableau).
- Designed Kafka producer client using Confluent Kafka and produced events into Kafka topic.
- Responsible for gathering requirements, system analysis, design, development, testing and deployment and
- Worked on SQL Server concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services) and SSRS (Reporting Services). Using Informatica & SSIS, SPSS, SAS to extract transform & load source data from transaction systems.
- Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
- Involved with writing scripts in Oracle, SQL Server and Netezza databases to extract data for reporting and analysis and Worked in importing and cleansing of data from various sources like DB2, Oracle, flat files onto SQL Server with high volume data
- Subscribing the Kafka topic with Kafka consumer client and process the events in real time using spark.
- Generate metadata, create Talend ETL jobs, mappings to load data warehouse, data lake.
- Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.
- Evaluated big data technologies and prototype solutions to improve our data processing architecture. Data modeling, development and administration of relational and NoSQL databases.
Environment: Hadoop, Spark, Scala, Hbase, Hive, Python, PL/SQL AWS, EC2, S3, Lambda, Auto Scaling, Cloud Watch, Cloud Formation, IBM Info sphere, DataStage, MapReduce, Oracle12c, Flat files, TOAD, MS SQL Server database, XML files, Cassandra, MongoDB, Kafka, MS Access database, Autosys, UNIX, Erwin.
Confidential, Rochester, MN
Big Data Engineer
Responsibilities:
- Used Agile Scrum methodology/ Scrum Alliance for development.
- Installed and configured apache airflow for workflow management and created workflows in python
- Developed Spark scripts by using Python in PySpark shell command in development.
- Experienced in Hadoop Production support tasks by analysing the Application and cluster logs
- Created Hive tables, loaded with data, and wrote Hive queries to process the data. Created Partitions and used Bucketing on Hive tables and used required parameters to improve performance. Developed Pig and Hive UDFs as per business use-cases
- Perform Big Data analysis using Scala, Spark, Spark SQL, Hive, Mlib, Machine Learning algorithms
- Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications
- Written the Map Reduce programs,HiveUDFsin Java.
- Experienced in working with Hadoop from Cloudera Data Platform and running services through Cloudera manager
- Developed automation system using PowerShell scripts and JSON templates to remediate the Azure services
- Implemented ETL jobs using Nifi to import from multiple databases such as Exadata, Teradata, MS-SQL to HDFS for Business Intelligence
- Worked on creating Data Pipelines for Copy Activity, moving, and transforming the data with Custom Azure Data Factory Pipeline Activities for On-cloud ETL processing
- Created reports using visualizations such as Bar chart, Clustered Column Chart, Waterfall Chart, Gauge, Pie Chart, Tree map etc. in Power BI.
- Extract Transform and Load data from Sources Systems to cloud Azure Data Storage services using a combination of Azure Cloud Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Cloud Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards
- Using Spark Dataframe API in Scala for analyzing data
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie
- Used Apache NiFi to automate data movement between different Hadoop components
- Used NiFi to perform conversion of raw XML data into JSON, AVRO
- Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making
- Utilized SQOOP, Kafka, Flume and Hadoop File system APIs for implementing data ingestion pipelines
- Worked on real time streaming, performed transformations on the data using Kafka and Spark Streaming
- Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows
- Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and Hbase
- Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and cloud migration processing the data in in Azure Databricks
- Used HBase for storing the Kafka topic, partition number and Offsets value. Also used phoenix jar to connect HBase table.
- Hands on experience in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager
- Used PySpark to creating batch job for merge multiple small files (Kafka stream files) into single larger files in parquet format.
- Designed and implemented an ETL framework using Java to load data from multiple sources into Hive and from Hive into Vertica
- Extract data from data lakes, EDW to relational databases for analysing and getting more meaningful insights using SQL Queries and PySpark
Environment: Hadoop Yarn, Azure, Databricks, Data lake, Data storage, Power BI, Azure SQL, Spark Core, Spark Streaming, Spark SQL, Spark MLlib, Python, Kafka, Hive, java, Scala, Sqoop, Impala, Cassandra, Tableau, Talend, Cloudera, MySQL, Linux.
Confidential, Houston, TX
Hadoop Developer/ Data Engineer
Responsibilities:
- Extensive usage of Spark for data streaming and data transformation for real time analytics.
- Experienced in working with various kinds of data sources such as Teradata and Oracle. Successfully loaded files to HDFS from Teradata, and load loaded from HDFS to hive and impala.
- Developed spark applications in python(PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Developed Spring Boot applications to read data from Kafka in an event based manner, these applications were developed to run as micro services that deals with parts of the problem and were deployed on Docker container that were built and deployed automatically using Jenkins pipelines.
- Installed and configured ApacheHadoopto test the maintenance of log files in Hadoop cluster.
- Built a dashboard of all theYARNapplications running on the cluster using YARN API.
- Have written applications that produced data toKafkaand also consumed data from it.
- Used Scala to convertHive/SQLqueries into RDD transformations inApache Spark.
- ImplementedSparksolutions to generate reports, fetch and load data inHive.
- Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
- Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for Tableau dashboards
- Created monitors, alarms and notifications for EC2 hosts using Cloud Watch, Cloud trail and SNS
- Thenear real time reportingwas achieved by anevent-based processingapproach adoption instead ofmicro-batchingto deal with data coming fromKafka.
- Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark.
- Have writtenHiveQL scriptsto populate table and brought data from various systems usingSqoop.
- Experienced in writing real-time processing and core jobs usingSpark StreamingwithKafkaas a data pipeline system
- Worked extensively on AWS Components such as Elastic Map Reduce (EMR)
- Experience in data ingestions techniques for batch and stream processing using AWS Batch, AWS Kinesis, AWS Data Pipeline
- Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark
- Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
- Data Import and Export from various sources through Script and Sqoop.
- UsedYarn Architecture and Map reducein the development cluster for POC.
- Supported Map Reduce Programs those are running on the cluster. Involved in loading data from UNIX file system to HDFS.
- WroteSpark applications and also mentoredother team members on the perks of spark.
- Wrote complex logic implementations using Spark to process data present in MapR DB and Hive.
- ImplementedSparkusingScala, Pythonand utilizing Data frames andSpark SQL APIfor faster processing of data.
Environment: Hadoop, HDFS, AWS, Apache Spark, Pyspark, Impala, MapReduce, Hive, Kafka, HBase, Sqoop, Python, Spark, Yarn, Spring boot, No Sql, Hbase, Cassandra, Relational Databases, Oracle 12c, SQL Server, Linux, Unix.
Confidential
Hadoop Developer
Responsibilities:
- Worked on different dataflow and control flow task, for loop container, sequence container, script task, executes SQL task and Package configuration.
- Data validation and cleansing of staged input records was performed before loading into Data Warehouse
- Created batch jobs and configuration files to create automated process using SSIS.
- Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
- Designed and implemented MapReduce-based large-scale parallel relation-learning system
- Setup and benchmarked Hadoop/HBase clusters for internal use
- Extensive use of Expressions, Variables, Row Count in SSIS packages
- Created SSIS packages to pull data from SQL Server and exported to Excel Spreadsheets and vice versa.
- Creating Hive tables and working on them using Hive QL. Experienced indefining jobflows.
- Importing and exporting data into HDFS from Oracle Database and vice versa using Sqoop.
- Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a map reduce way. Developed a custom File System plugin for Hadoop so it can access files on Data Platform.
- Installed and configured Pig and also written Pig Latin scripts.
- Loading data from various sources like OLEDB, flat files to SQL Server database Using SSIS Packages and created data mappings to load the data from source to destination.
- Created new procedures to handle complex logic for business and modified already existing stored procedures, functions, views and tables for new enhancements of the project and to resolve the existing defects.
- Deploying and scheduling reports using SSRS to generate daily, weekly, monthly and quarterly reports.
Environment: Hadoop, MapReduce, Pig,MS SQL Server, SQL Server Business Intelligence Development Studio, Hive, Hbase, SSIS, SSRS, Report Builder, Office, Excel, Flat Files, T-SQL.
Confidential
Data Analyst
Responsibilities:
- Analyzed currentbusinessprocess flow by understanding presentbusinessrules and conditions.
- Conducted formal interviews, Live Meetings and JAD sessions withbusinessusers and SME's.
- Designed and developed Use Cases, Activity Diagrams and Sequence Diagrams using UML.
- Utilized Rational Unified Process (RUP) to configure and develop process, agile, standards, and procedures to create a Business Requirement Document (BRD)
- Monitored installation and operations to consistently meet customer requirements.
- Created Technical documentation for the project using tools like Visio, Word, etc.
- Supported and managed RDBMS and designing SQL Server RDBMS architecture concepts, Usage of System catalogs on Query Analyzer, Space management including File groups.
- Performed Functional and GUI Testing to ensure that the user acceptance criteria are met.
- Was responsible for Testing Schemas, Joins, Data types and column values among source systems, Staging and Data mart.
- Extensively worked on Erwin, ER Studio, SQL Server, Oracle, PL/SQL, Quality Center
- Co-coordinated the UAT with the SMEs to make sure that all theBusinessRequirements are addressed in the application.
Environment: PL/SQL, SQL, Agile, FTP,Erwin, Quality Center, MS Office, Windows, Jira