Senior Big Data Engineer Resume
Englewood, CO
PROFESSIONAL SUMMARY:
- Over 8+ years of IT experience in Analysis, design, development, implementation, maintenance, and support with experience in developing strategic methods for deploying big data technologies to efficiently solve Big Data processing requirement.
- Experience using Job scheduling tools like Cron, Tivoli and Automic.
- Experienced in troubleshooting errors in Hbase Shell/API, Pig, Hive and MapReduce.
- Implemented various algorithms for analytics usingCassandrawithSpark and Scala.
- Hands - on experience withAmazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQSand other services of the AWS family.
- Expertise in using various Hadoop infrastructures such asMap Reduce, Pig, Hive, Zookeeper, Hbase, Sqoop, Oozie, Flume, Drillandsparkfor data storage and analysis.
- Excellent and experience and knowledge of Machine Learning, Mathematical Modeling and Operations Research. Comfortable with R, Python, SAS and Weka, MATLAB, Relational databases. Deep understanding & exposure of Big Data Eco-system.
- Selecting appropriate AWS services to design and deploy an application based on given requirements.
- Experienced in managing Hadoop clusters and services usingClouderaManager.
- Good knowledge in querying data fromCassandrafor searching grouping and sorting.
- Good Knowledge inAmazon AWSconcepts likeEMR and EC2web services which provides fast and efficient processing of Big Data.
- Highly experienced in importing and exporting data betweenHDFSandRelational Database Management systemsusingSqoop.
- Extensive experience asHadoopandsparkengineer and Big Data analyst.
- Excellent understanding ofHadoop architectureand underlying framework includingstorage management.
- Haveexperience ininstalling,configuringandadministratingHadoop cluster for major Hadoop distributions likeCDH4, and CDH5.
- Experience in working with the Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries
- Strong experience in core Java,Scala, SQL, PL/SQL and Restful web services.
- Experienced in Identifying improvement areas for systems stability and providing end to end high availability architectural solutions.
- Good experience in GeneratingStatistics and reportsfrom the Hadoop.
- Experience in developingcustomUDFsfor Pig and Hive to in corporate methods and functionality of Python/Java intoPig LatinandHQL(HiveQL) and Used UDFs from Piggybank UDF Repository.
- Experienced in running query - usingImpalaand used BI tools to run ad-hoc queries directly on Hadoop.
- Good experience inOozieFramework and Automating daily import jobs.
- Experience in developing a data pipeline through Kafka-Spark API.
- Proficient in data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
- AssistedDeploymentteamin setting upHadoop clusterand services.
- Having good knowledge in Benchmarking & Performance Tuning of cluster.
- Designed and implemented a product search service using Apache Solr.
- Good experience in Generating Statistics/extracts/reports from the Hadoop.
- Experienced of buildingData WarehouseinAzure platformusingAzure data bricksanddata factory.
- Good understanding ofNoSQLData bases and hands on work experience in writing applications on No SQL data bases likeCassandraandMongo DB.
- Experienced in Creating Vizboards for data visualization inPlatforafor real - time dashboard on Hadoop.
- Collected logs data from various sources and integrated in to HDFS usingFlume.
- Determined, committed and hardworking individual with strong communication, interpersonal and organizational skills.
TECHNICAL SKILLS:
Big Data Tools/ Hadoop Ecosystem: Map Reduce, Spark, Airflow, Nifi, HBase, Hive, Pig, Sqoop, Kafka, Oozie, Hadoop
Databases: Oracle 12c/11g/10g, Teradata R15/R14, MY SQL, SQL Server, No SQL-Mongo DB, Cassandra, Hbase, Snowflake
ETL/Data warehouse Tools: Informatica and Tableau.
BI Tools: SSIS, SSRS, SSAS.
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
Cloud Platform: Amazon Web Services (AWS), Microsoft Azure
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena, MS Azure- Data Lake, Data Storage, Data Bricks, Data Factory
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
Operating System: Windows, Unix, Sun Solaris
Methodologies: System Development Life Cycle (SDLC), Agile
EXPERIENCE:
Confidential, Englewood, CO
Senior Big Data Engineer
Responsibilities:
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server usingPython.
- Used Airflow for scheduling the Hive, Spark and MapReduce jobs.
- Use SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Develop RDD's/Data Frames in Spark using and apply several transformation logics to load data from Hadoop Data Lakes.
- Developing Spark programs with Python, and applied principles of functional programming to process the complex structured data sets.
- Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in AWS.
- Using Python in spark to extract the data from Snowflake and upload it to Salesforce on Daily basis.
- Worked with Hadoop ecosystem and Implemented Spark using Scala and utilized Data frames and Spark SQL API for faster processing of data.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
- Subscribing the Kafka topic with Kafka consumer client and process the events in real time using spark.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Developing ETL pipelines in and out of data warehouse using combination of python and Snowflakes SnowSQL writing SQL queries against Snowflake
- Analyzing SQL scripts and designed the solution to implement using PySpark
- Extensively used Terraform in AWS Virtual Private Cloud automatically setup and modify settings by interfacing with control layer.
- Practical understanding of the data modeling (Dimensional & Relational) concepts like Star - Schema Modeling, Snowflake Schema Schema Modeling, fact and Dimension tables.
- Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
- Developed Spark Streaming job to consume the data from the Kafka topic of different source systems and push the data into HDFS locations.
- Converting Hive/SQL queries into Spark transformations using Spark RDDs and Pyspark
- Filtering and cleaning data using Scala code and SQL Queries
- Troubleshooting errors in Hbase Shell/API, Pig, Hive and MapReduce.
- Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) onEC2.
- Used Terraform in managing resource scheduling, disposable environments, and multitier applications.
- Expertise in Terraform for Multi cloud deployment using single configuration.
- The individual will be responsible for design and development of High-performance data architectures which support data warehousing, real-time ETL and batch big-data processing.
- Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.
- Use python to write a service which is event based using AWS Lambda to achieve real time data to One-Lake (A Data Lake solution in Cap-One Enterprise).
- Used Talend for Big Data Integration using Spark and Hadoop.
- Responsible for analyzing large data sets and derive customer usage patterns by developing new MapReduce programs using Java.
- Perform structural modifications using MapReduce, Hive and analyze data using visualization/reporting tools (Tableau).
- Designed Kafka producer client using Confluent Kafka and produced events into Kafka topic.
- Responsible for gathering requirements, system analysis, design, development, testing and deployment and
- Worked on SQL Server concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services) and SSRS (Reporting Services). Using Informatica & SSIS, SPSS, SAS to extract transform & load source data from transaction systems.
- Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
- Involved with writing scripts in Oracle, SQL Server and Netezza databases to extract data for reporting and analysis and Worked in importing and cleansing of data from various sources like DB2, Oracle, flat files onto SQL Server with high volume data
- Generate metadata, create Talend etl jobs, mappings to load data warehouse, data lake.
- Involved in Relational and Dimensional Data modeling for creating Logical and Physical Design of Database and ER Diagrams with all related entities and relationship with each entity based on the rules provided by the business manager using ERWIN r9.6.
- Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Used Informatica power center for (ETL) extraction, transformation and loading data from heterogeneous source systems and studied and reviewed application of Kimball data warehouse methodology as well as SDLC across various industries to work successfully with data-handling scenarios, such as data
Environment: Hadoop, Spark, Scala, Hbase, Hive, Python, PL/SQL AWS, EC2, S3, Lambda, Auto Scaling, Cloud Watch, Cloud Formation, IBM Info sphere, DataStage, MapReduce, Oracle12c, Flat files, TOAD, MS SQL Server database, XML files, Cassandra, Snowflake, MongoDB, Kafka, MS Access database, Autosys, UNIX, Erwin.
Confidential, St. Louis, MO
Big Data Engineer
Responsibilities:
- Involved in complete Big Data flow of the application starting from data ingestion upstream to HDFS, processing the data in HDFS and analyzing the data and involved.
- Good Experience on importing and exporting the data from HDFS and Hive into Relational Database Systems like MySQL and vice versa using Sqoop.
- Implemented sentiment analysis and text analytics on Twitter social media feeds and market news using Scala and Python.
- Extensive experience in loading and analyzing large datasets with Hadoop framework MapReduce, HDFS, PIG, HIVE, Flume, Sqoop, SPARK, Impala, Scala, NoSQL databases like MongoDB, HBase, Cassandra.
- Responsible for designing and developing data ingestion from Kroger using Apache NiFi/Kafka.
- Experience with scalable architectures using Azure App Service, API management, serverless technologies
- Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and Hbase.
- Hands-on experience with large-scale, big data methods including Hadoop (worked with components including HDFS, Oozie), Spark,Hive(data transformation),Impala & Hue
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Developed Map Reduce Programs for data analysis and data cleaning
- Developed Python, Shell/Perl Scripts and Power shell for automation purpose and Component unit testing using Azure Emulator.
- Good working experience onHadoopCluster architecture and monitoring the cluster. In-depth understanding of Data Structure and Algorithms.
- Developed automation system using PowerShell scripts and JSON templates to remediate teh Azure services.
- Performance tune up Hbase, Phoenix, Hive queries and Spark streaming code.
- The data is ingested into this application by using Hadoop technologies likePIG and HIVE.
- Experience in usingZookeeperandOozieoperational services to coordinate clusters and scheduling workflows
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity. Build an ETL which utilizes spark jar inside which executes the business analytical model.
- Migrated several on premise solutions to azure cloud, infrastructure, network cloud integration IaaS
- Developed Oozie workflow engine to run multiple Hive, Pig, Tealeaf, Mongo DB, Git, Sqoop and Spark jobs.
- Written core java code to format XML documents, uploaded them to Solve server for indexing.
- Worked on Ad hoc queries, Indexing, Replication, Load balancing, Aggregation in MongoDB.
- Extensive Experience on importing and exporting data using Flume and Kafka.
- Installed and configured Hive and writtenHive UDFsand Used Map Reduce and Junit for unit testing.
- Exploratory Data Analysis and Data wrangling with R and Python.
- Experience in implementing standards and processes forHadoopbased application design and implementation.
- Writing scripts for creating, truncating, dropping, altering HBase tables to store the data after execution of map reduce job and to use that for later analytics.
- Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow to run multiple Spark Jobs in sequence for processing data
- Processed the Web server logs by developing Multi-hop flume agents by using Avro Sink and loaded into MongoDB for further analysis, also extracted files from MongoDB through Flume and processed.
- InstalledOozieworkflow engine to run multipleHive.
- Developing ETL pipelines in and out of data warehouse using combination of python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
- Migrate databases to cloud platform SQL Azure and as well the performance tuning.
- Worked as a Hadoop Developer on Hadoop eco-systems including Hive, Zookeeper, Spark Streaming with MapR distribution.
- Supported Map Reduce Programs those are running on the cluster. Involved in loading data from UNIX file system to HDFS.
- Experience in migrating the data using Sqoop from HDFS and Hive to Relational Database System and vice-versa according to client's requirement.
- Experience in Hadoop streaming and writing MR jobs by using Perl, Python other than JAVA.
- Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSQL databases such as HBase and Cassandra.
- Installed and configured Hive, Pig, Sqoop, Flume andOozieon the Hadoop cluster.
- Involved in installation, configuration, supporting and managing Hadoop clusters, Hadoop cluster administration.
- Good knowledge in Cluster coordination services through Zookeeper and Kafka.
- Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a map reduce way.
- Used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS, HBase and Elastic Search.
- Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
- Writing PySpark and spark sql transformation in Azure Data bricks to perform complex transformations for business rule implementation
- Worked with team of developers designed, developed and implement BI solutions for multiple projects
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- UsedZookeeperto provide coordination services to the cluster.
- Enabling monitoring and azure log analytics to alert support team on usage and stats of the daily runs
- Created complex dashboard using parameters, sets, groups, and calculations to drill down and drill up in worksheets and customization using filters and actions.
- Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Responsible for developing data pipeline usingflume, Sqoopandpigto extract the data from weblogs and store inHDFS.
- Involved in converting Map Reduce programs into Spark transformations using Spark RDD's using Scala and Python
- Utilized Spark's parallel processing capabilities to ingest data.
- Created and executed HQL scripts that creates external tables in a raw layer database in Hive.
Environment: Hadoop, Hive, Spark, Hbase, MapReduce, Snowflake, PL/SQL, Kafka, Unix, Cucumber JVM, Mongo DB, GitHub, BitBucket, SQL, Oracle 12c, NoSQL database, API, Java, Jenkins.
Confidential, Houston, TX
Big Data Engineer
Responsibilities:
- Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy, and consistency
- Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB
- Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using ERwin and subsequent deployment to Enterprise Data Warehouse.
- Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics
- Wrote various data normalization jobs for new data ingested into Redshift.
- Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
- Implementing and Managing ETL solutions and automating operational processes.
- Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
- Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
- Defined and deployed monitoring, metrics, and logging systems on AWS.
- Developed code to handle exceptions and push the code into the exception Kafka topic.
- Was responsible for ETL and data validation using SQL Server Integration Services.
- Created and maintained documents related to business processes, mapping design, data profiles and tools.
- Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin
- Integrated Kafka with Spark Streaming for real time data processing
- Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
- Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin\
- Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
- Applied various machine learning algorithms and statistical modeling like decision tree, logistic regression, Gradient Boosting Machine to build predictive model using scikit-learn package in Python
- Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day
- Created various complex SSIS/ETL packages to Extract, Transform and Load data
- Used Oozie Scheduler system to automate the pipeline workflow and orchestrate the map reduces jobs that extract the data on a timely manner
- Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
- Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
Environment: SQL Server, Erwin, Kafka, Python, MapReduce, Oracle, AWS, Redshift, Informatica RDS, NOSQL, MySQL, PostgreSQL.
Confidential
Data Engineer
Responsibilities:
- Created HBase tables to load large sets of structured data.
- Managed and reviewed Hadoop log files.
- Worked extensively with HIVE DDLs and Hive Query language (HQLs).
- Analyzed the data using Map Reduce, Pig, Hive and produce summary results from Hadoop to downstream systems.
- Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
- Developed data pipeline using flume, Sqoop and pig to extract the data from weblogs and store in HDFS.
- Used Sqoop to import and export data from HDFS to RDBMS and vice-versa.
- Exported the analyzed data to the relational database MySQL using Sqoop for visualization and to generate reports.
- Used AWS Glue for the data transformation, validate and data cleansing.
- Used Sqoop widely in order to import data from various systems/sources (like MySQL) into HDFS.
- Created components like Hive UDFs for missing functionality in HIVE for analytics.
- Developing Scripts and Batch Job to schedule a bundle (group of coordinators) which consists of various.
- Used different file formats like Text files, Sequence Files, Avro.
- Cluster co-ordination services through Zookeeper.
- Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
- Implemented SQOOP for large dataset transfer between Hadoop and RDBMs.
- Processed data into HDFS by developing solutions.
- Created Map Reduce Jobs to convert the periodic of XML messages into a partition avro Data.
- Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts.
Environment: Hadoop, HDFS, Map Reduce, AWS, Hive, Pig, Sqoop, HBase, Shell Scripting, Oozie, Oracle 11g, Ad-hoc Queries, MS Excel, Windows
Confidential
Associate Data Engineer
Responsibilities:
- Experience in developing scalable & secure data pipelines for large datasets.
- Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.
- Created yam files for each data source and including glue table stack creation
- Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
- Manipulate and build datasets from given data to support analyses through use of efficient SQL code
- Writing complex SQL Queries, Stored Procedures, Triggers, Views, Cursors, Joins, Constraints, DDL, DML and User Defined Functions to implement the business logic and created clustered and non-clustered indexes
- Supported data quality management by implementing proper data quality checks in data pipelines.
- Tuned performance of Informatica mappings and sessions for improving the process and making it efficient after eliminating bottlenecks.
- Worked on complex SQL Queries, PL/SQL procedures and convert them to ETL tasks
- Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
- Performed Cross Validation on single holdout set to evaluate the model’s performance on data sets and fine tuning the model upon arriving of new data.
Environment: XML Files, JSON files,, Java, PL/SQL, SQL, Tableau, Python, MS Office, Windows, Unix.