Senior Data Engineer/big Data Engineer Resume
Scottsdale, ArizonA
SUMMARY
- Over 9+ years of diversified experience in Software Design & Development. Experience as Big Data Engineer solving business use cases for several clients. Experience in the field of software with expertise in backend applications.
- Solid experience developing Spark Applications for performing highly scalable data transformations using RDD, Data frame, Spark - SQL, and Spark Streaming.
- Experience in MVC and Microservices Architecture with Spring Boot and Docker, Swamp.
- Expertise in using Docker and setting up ELK with Docker and Docker-Compose. Actively involved in deployments on Docker using Kubernetes.
- Strong experience troubleshooting Spark failures and fine-tuning long running Spark applications.
- Strong experience working with various configurations of Spark like broadcast thresholds, increasing shuffle partitions, caching, repartitioning etc., to improve the performance of the jobs.
- Worked on Spark Streaming and Structured Spark streaming including Kafka for real time data processing.
- Strong experience of operating with cloud environments such as EC2 and S3 of Amazon Web Services (AWS).
- Continuous Delivery pipeline deployment experience with Maven, Ant, Jenkins, and AWS.
- Strong understanding of Distributed systems design, HDFS architecture, internal working details of MapReduce and Spark processing frameworks.
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Experience in using the cloud services like Amazon AWS EMR, S3, EC2, Red shift and Athena.
- Strong expertise in building scalable applications using various programming languages (Java, Scala, and Python).
- Proficient in Core Java concepts like Multi-threading, Collections and Exception Handling concepts.
- Experience of developing applications with Model View Architecture (MVC2) using Spring Framework and J2EE Design Patterns.
- Solid experience in using the various file formats like CSV, TSV, Parquet, ORC, JSON and AVRO.
- Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, Amazon EMR) to fully implement and leverage various Hadoop services.
- In depth knowledge on import/export of data from Databases using Sqoop.
- Well versed in writing complex hive queries using analytical functions.
- Knowledge in writing custom UDF’s in Hive to support custom business requirements.
- Experienced in working with structured data using HiveQL, join operations, writing custom UDFs and optimizing Hive queries.
- Migrating SQL database to Azure Data Lake, Azure data lake analytics, Azure SQL database, Data Bricks and Azure SQL Data warehouse and Managing and granting access to and migrating to Azure data lake on-site databases using Azure Data factory research to Azure Data lake store.
- Experience working with Data Lake is a system or repository of data stored in its natural/raw format, usually object blobs or files.
- Working with Data Lakeis usually a single store of all enterprisedataincluding raw copies of source systemdataand transformeddataused for tasks such as reporting, visualization, advanced analytics, and machine learning.
- ConfiguredSpark Streamingto receive real time data fromKafkaand store the stream data to HDFS and process it usingSparkandScala.
- Hands on experience on Kafka and Flume to load the log data from multiple sources directly in to HDFS.
- Strong Experience in working with Databases like Oracle, and MySQL, Teradata, Netezza and proficiency in writing complex SQL queries.
- Experience in version control tools like SVN, GitHub and CVS.
- Experienced working with JIRA for project management, GIT for source code management, JENKINS for continuous integration and Crucible for code reviews.
- Excellent communication, analytical skills, and quick learner. Also, capacity to work independently and highly motivated team player.
TECHNICAL SKILLS
Big Data Technologies: Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala, HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper
Hadoop Distribution: Cloudera CDH, Apache, AWS, Horton Works HDP
Programming Languages: SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Shell Scripting, Regular Expressions
Spark components: RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming
Cloud Infrastructure: AWS, MS Azure
Databases: Oracle, Teradata, My SQL, SQL Server, NoSQL Database (HBase, MongoDB)
Scripting & Query Languages: Shell scripting, Sql
Version Control: CVS, SVN and Clear Case, GIT
Build Tools: Maven, SBT
Containerization Tools: Kubernetes, Docker, Docker Swarm
Reporting Tools: Junit, Eclipse, Visual Studio, Net Beans, Azure Data bricks, UNIX Eclipse, Visual Studio, Net Beans, Junit, CI/CD, Linux, Google Shell, Unix, Power BI, SAS and Tableau
PROFESSIONAL EXPERIENCE
Confidential, Scottsdale, Arizona
Senior Data Engineer/Big Data Engineer
Responsibilities:
- Extensively utilized Databricks notebooks for interactive analysis utilizing Spark APIs.
- Developed a data pipeline using Kafka and Spark to store data into HDFS.
- Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Included myself in making database components like tables, views, triggers utilizing T-SQL to give structure and keep up information effectively.
- Broad involvement in working with SQL, with profound knowledge on T-SQL (MS SQL Server).
- Worked with data science group to do preprocessing and include feature engineering, helped Machine Learning algorithm in production.
- Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
- Used Databricks to integrate easily with the whole Microsoft stack.
- Designed and mechanized Custom-constructed input connectors utilizing Spark, Sqoop and Oozie to ingest and break down informational data from RDBMS to Azure Data lake.
- Involved in building an Enterprise Data Lake utilizing Data Factory and Blob storage, empowering different groups to work with more perplexing situations and ML solutions.
- Involvement in working with Azure cloud stage (HDInsight, Databricks, Data Lake, Blob, Data Factory, Synapse, SQL DB and SQL DWH).
- Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB).
- Experience in Configure, Design, Implement and monitorKafkaCluster and connectors.
- Used Azure Event Gridfor managing eventservice that enables you to easily manage events across many differentAzureservices and applications.
- Using machine learning algorithms for data analysis such as linear regression, multivariate regression, PCA, K-means, & KNN
- Delta lake supports merge, update and delete operations to enable complex use cases.
- Used Azure Synapse to oversee handling outstanding workloads and served data for BI and predictions.
- Responsible for design & deployment ofSpark SQLscripts andScalashell commands based on functional specifications.
- Implemented versatile microservices to deal with simultaneousness and high traffic. Advanced existing Scala code and improved the cluster execution.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in in Azure Databricks.
- Managed assets and scheduling over the cluster utilizing Azure Kubernetes Service.
- Performed information purging and applied changes utilizing Databricks and Spark information analysis.
- Extensive information in Data changes, Mapping, Cleansing, Monitoring, Debugging, execution tuning and investigating Hadoop clusters.
- Using Linked Services/Datasets/Pipeline/ to extract, transform and load data from various sources such as Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards, ADF pipelines were created.
- Used Azure Synapse to bring these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.
- Worked onKafkaandSparkintegration for real time data processing.
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Facilitated information for interactive Power BI dashboards and reporting.
- Scripting via Linux & OSX platforms: Bash, GitHub GitHub API.
- Developed Spark Scala scripts for mining information and performed changes on huge datasets to handle ongoing insights and reports.
- Supported analytical phases, dealt with data quality, and improved performance utilizing Scala's higher order functions, lambda expressions, pattern matching and collections.
- Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
- Provide guidance to development team working on PySpark as ETL platform
Environment: Hadoop, Spark, Hive, Sqoop, HBase, Oozie, Talend, Kafka Azure (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, AKS), Scala, Python, Cosmos DB, MS SQL, MongoDB, Ambari, PowerBI, Azure DevOps, Microservices, K-Means, KNN. Ranger, Git
Confidential, Columbia, South Carolina
Senior Data Engineer
Responsibilities:
- Use SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
- Reduced access time by refactoring information models, query streamlining and actualized Redis store to help Snowflake.
- Involved in Relational and Dimensional Data modeling for creating Logical and Physical Design of Database and ER Diagrams with all related entities and relationship with each entity based on the rules provided by the business manager using ERWIN r9.6.
- Worked inAWSenvironment for development and deployment of custom Hadoop applications.
- Strong experience in working withELASTIC MAPREDUCE(EMR) and setting up environments on AmazonAWSEC2 instances.
- Worked with Hadoop ecosystem and Implemented Spark using Scala and utilized Dataframes and Spark SQL API for faster processing of data.
- Generate metadata, create Talend ETL jobs, mappings to load data warehouse, data lake.
- Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Develop RDD's/Data Frames in Spark using and apply several transformation logics to load data from Hadoop Data Lakes.
- Developing Spark programs with Python, and applied principles of functional programming to process the complex structured data sets.
- Work in a fast-paced agile development environment to quickly analyze, develop, and test potential use cases for the business.
- The individual will be responsible for design and development of High-performance data architectures which support data warehousing, real-time ETL and batch big-data processing.
- Filtering and cleaning data using Scala code and SQL Queries
- Experience in data processing like collecting, aggregating, moving the data using Apache Kafka.
- Used Kafka to load data into HDFS and move data back to S3 after data processing
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server usingPython.
- Involved as primary on-site ETL Developer during the analysis, planning, design, development, and implementation stages of projects using IBM Web Sphere software (Quality Stage v9.1, Web Service, Information Analyzer, Profile Stage)
- Prepared Data Mapping Documents and Design the ETL jobs based on the DMD with required Tables in the Dev Environment.
- Worked on SQL Server concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services) and SSRS (Reporting Services). Using Informatica & SSIS, SPSS, SAS to extract transform & load source data from transaction systems.
- Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in AWS.
- Converting Hive/SQL queries into Spark transformations using Spark RDDs and Pyspark
- Analyzing SQL scripts and designed the solution to implement using PySpark
- Export tables from Teradata to HDFS using Sqoop and build tables in Hive.
- Loaded and transformed large sets of structured, semi structured and unstructured data usingHadoop/Big Data concepts.
- Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) onEC2.
- Used Informatica power center for (ETL) extraction, transformation and loading data from heterogeneous source systems and studied and reviewed application of Kimball data warehouse methodology as well as SDLC across various industries to work successfully with data-handling scenarios, such as data
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big data technologies. Extracted Mega Data from Amazon Redshift, AWS, and Elastic Search engine using SQL Queries to create reports.
- Used Talend for Big Data Integration using Spark and Hadoop.
- UsedKafkaandKafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
- UsedZookeeperto store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
- UsedKafkafunctionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and Created applications using Kafka, which monitors consumer lag withinApache Kafkaclusters.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Extensively worked with Join, Look up (Normal and Sparse) and Merge stages.
- Applying the Data modelling and Data Designing in-between staging and target for creating the views.
- Responsible for gathering requirements, system analysis, design, development, testing and deployment and
- Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
- Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLTP reporting.
- Involved with writing scripts in Oracle, SQL Server and Netezza databases to extract data for reporting and analysis and Worked in importing and cleansing of data from various sources like DB2, Oracle, flat files onto SQL Server with high volume data
- Evaluated big data technologies and prototype solutions to improve our data processing architecture. Data modeling, development and administration of relational and NoSQL databases (Big Query, Elastic Search)
- Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
Environment: Hadoop, Spark, Scala, Hbase, Hive, UNIX, Erwin, TOAD, MS SQL Server database, XML files, AWS, Cassandra, MongoDB, Kafka, IBM Info Sphere Data Stage, PL/SQL, Oracle 12c, Flat files, Autosys, MS Access database.
Confidential, Chicago, Illinois
Big Data Engineer
Responsibilities:
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, Zookeeper and Sqoop.
- Effectively Communicated plans, project status, project risks and project metrics to the project team planned test strategies in accordance with project scope
- Migrate existing architecture to Amazon Web Services and utilize several technologies likeKinesis,RedShift,AWS Lambda,Cloud watch metrics.
- Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data into HDFS for analysis.
- Query inAmazon Athenawith the alerts coming fromS3buckets and finding out the alerts generation difference from the Kafka cluster and Kinesis cluster.
- Extensive use ofPythonfor managing services in AWS using boto library.
- Close monitoring and analysis of the MapReduce job executions on cluster at task level and optimized Hadoop clusters components to achieve high performance
- Used Hive and created Hive tables and involved in data loading and writing Hive UDFs and worked with Linux server admin team in administering the server hardware and operating system.
- Integrated HDP clusters with Active Directory and enabled Kerberos for Authentication.
- Evaluated existing infrastructure, systems, and technologies and provided gap analysis, and
- Installed and Configured Sqoop to import and export the data into Hive from Relational databases.
- Administering large Hadoop environments build and support cluster set up, performance tuning and monitoring in an enterprise environment.
- Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
- Design and implement large scale distributed solutions in AWS
- Monitoring the Hadoop cluster functioning through MCS and worked on NoSQL databases including HBase.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
- Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
- Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS.
- Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
- Collected data using Spark Streaming fromAWS S3bucket in near-real- time and performs necessary Transformations and Aggregations to build the data model and persists the data inHDFS.
Environment: Spark SQL, Scala, Kafka, Python, Hive, Sqoop, Hadoop YARN, Spark, Spark Streaming, Impala, Tableau, Talend, Oozie, Java, AWS, S3, Oracle 12c, Linux
Confidential
Data Engineer
Responsibilities:
- Developed Spark jobs and Hive Jobs to summarize and transform data.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and then loading data into HDFS.
- Used Sqoop to transfer data between relational databases and Hadoop.
- Created reports in TABLEAU for visualization of the data sets created and tested Spark SQL connectors.
- Implemented Hive complex UDF's to execute business logic with Hive Queries.
- Experienced in Maintaining the Hadoop cluster onAWS EMR.
- Loaded data intoS3 bucketsusingAWS GlueandPySpark. Involved in filtering data stored inS3 bucketsusingElasticsearchand loaded data intoHive external tables.
- Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and Python.
- Responsible for building scalable distributed data solutions using Hadoop.
- Developed a different kind of custom filters and handled pre-defined filters on HBase data using API.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Experienced in developing Spark scripts for data analysis in both python and Scala.
- Wrote Scala scripts to make spark streaming work with Kafka as part of spark Kafka integration efforts.
- Built on premise data pipelines using Kafka and spark for real-time data analysis.
- Use terraform to setup security groups and CloudWatch metrics in AWS.
- Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
- Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
- Experience in managing and reviewing Hadoop Log files.
- Worked on HDFS to store and access huge datasets within Hadoop.
- Good hands on experience with GitHub.
Environment: Cloudera Manager (CDH), HDFS, Sqoop, Scala, Oozie, Kafka, Pig, Hive, AWS, Tableau, Python flume, MySQL, Java, Git.
Confidential
Data Engineer
Responsibilities:
- Utilized Agile and Scrum methodology for team and project management.
- Provided business intelligence analysis to decision-makers using an interactive OLAP tool
- Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's
- Performed Tableau administering by using tableau admin commands.
- Worked on SQL Server Integration Services (SSIS) to integrate and analyze data from multiple heterogeneous information sources
- Developed complex SQL statements to extract the Data and packaging/encrypting Data for delivery to customers.
- Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig.
- Created T/SQL statements (select, insert, update, delete) and stored procedures.
- Defined Data requirements and elements used in XML transactions.
- Created Informatica mappings using various Transformations like Joiner, Aggregate, Expression, Filter and Update Strategy.
- Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
- Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
- Used Git for version control with colleagues.
Environment: Spark, AWS Redshift, Python, Tableau, Informatica, Pandas, Pig, Pyspark, Sql Server, T-Sql, XML, Git.