Senior Big Data Engineer Resume
Las Vegas, NV
SUMMARY
- Over 7+ years of IT experience as a Big Data Developer, Designer & QA Engineer with cross - platform integration experience using Hadoop Ecosystem, Java, and functional automation
- Practiced Agile Scrum methodology, contributed to TDD, CI-CD and all aspects of SDLC
- Hands on experience in installing, configuring, and architecting Hadoop and Hortonworks clusters and services - HDFS, MapReduce, Yarn, Pig, Hive, Oozie, Flume, HBase, Spark, Sqoop, Flume and Oozie.
- Scheduled all Hadoop/Hive/Sqoop/HBase jobs using Oozie
- Complete application builds for Web Applications, Web Services, Windows Services, Console Applications, and Client GUI applications.
- Experienced in troubleshooting and automated deployment to web and application servers like WebSphere, WebLogic, JBOSS and Tomcat.
- Solid Experiences in Cloud Platforms like Amazon Web Services and Microsoft Azure Cloud Platform.
- Experienced in deploy to Integrate with multiple build systems and to provide an application model handling multiple projects.
- Hands on experience with integrating Rest APIs to cloud environment to access resources.
- Developed spark programs and created the data frames and worked on transformations.
- Worked on data processing and transformations and actions in spark by using Python (Spark) language.
- Develop framework for converting existing PowerCenter mappings and to Spark (Python and Spark) Jobs.
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Set up clusters in Amazon EC2 and S3 including the automation of setting & extending the clusters in AWS
- Experienced in defining detailed application software test plans, including organization, participant, schedule, test, and application coverage scope
- Gathered and defined functional and UI requirements for software applications
- Experienced in real time analytics with Apache Spark RDD, Data Frames and Streaming API
- Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data
- Experienced in integrating Hadoop with Kafka, experienced in uploading Clickstream data from to HDFS.
- Expert in utilizing Kafka for messaging and publishing subscribe messaging system.
- Experienced with Docker and Kubernetes on multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
- Monitoring tasks and DAG’s, then trigger the task instances once their dependencies are complete.
- Spinning up sub process, which monitors and stays in sync with all DAGs in the specified DAG directory. Once per minute, by default, the scheduler collects DAG parsing results and checks whether any active tasks can be triggered.
- Installed and configured OpenShift platform in managing Docker containers and Kubernetes Clusters.
- DevOps Practice for Micro Services using Kubernetes as Orchestrator.
- Created templates and wrote Shell scripts (Bash), Ruby, Python and PowerShell for automating tasks.
- Good knowledge and hands on Experience in monitoring tools like Splunk, Nagios.
- Knowledge of using Routed Protocols as FTP, SSH, HTTP, TCP/IP, HTTPS, DNS, VPN'S and Firewall Groups.
- Responsible for writing MapReduce programs.
- Experienced in loading data to Hive partitions and created buckets in Hive and developed MapReduce jobs to automate transfer the data from HBase
- Experienced in developing Java UDFs for Hive and Pig
- Experienced in NoSQL DBs like HBase, MongoDB and Cassandra and wrote advanced query and sub-query
- Ability to multitask multiple schedulers concurrently for performance, efficiency, and resiliency.
- Developed mappings in Informatica to load the data including facts and dimensions from various sources into the Data Warehouse, using different transformations like Source Qualifier, JAVA, Expression, Lookup, Aggregate, Update Strategy and Joiner.
- Loaded the flat files data using Informatica to the staging area.
- Review existing code, lead efforts to tweak and tune the performance of existing Informatica processes.
TECHNICAL SKILLS
Hadoop/Big Data: Hadoop, Map Reduce, HDFS, Zookeeper, Kafka, Hive, Pig, Sqoop, Oozie, Flume, Yarn, HBase, Spark with Scala
No SQL Databases: HBase, Cassandra, Mongo DB
Languages: Java, Python, Scala, Pyspark, UNIX shell scripts
Java/J2EE Technologies: Applets, Swing, JDBC, JNDI, JSON, JSTL
Frameworks: Spring, Hibernate
Operating Systems: Red Hat Linux, Ubuntu Linux, and Windows XP/Vista/7/8
Web/Application servers: Apache Tomcat, WebLogic, JBoss
Databases: SQL Server, MySQL
IDE: Eclipse, IntelliJ
PROFESSIONAL EXPERIENCE
Confidential, Las Vegas, NV
Senior Big Data Engineer
Responsibilities:
- Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
- Experienced in using the spark application master to monitor the spark jobs and capture the logs for the spark jobs.
- Developed multiple Kafka Producers and Consumers as per the software requirement specifications.
- Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances
- Developed efficient MapReduce programs for filtering out the unstructured data and developed multiple MapReduce jobs to perform data cleaning and pre-processing on Hortonworks.
- Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs
- Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
- Involved in Cassandra Cluster planning and had good understanding in Cassandra cluster mechanism that includes replication strategies, snitch, gossip, consistent hashing and consistency levels.
- Responsible in development of Spark Cassandra connector to load data from flat file to Cassandra for analysis, modified Cassandra. Yaml and Cassandra-env.sh files to set various configuration properties.
- Used Sqoop to import the data on to Cassandra tables from different relational databases like Oracle, MySQL and Designed Column families in Cassandra performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation
- Experience in building Real-time Data Pipelines with Kafka Connect and Spark Streaming.
- Developed Scala scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop
- Developed Hive queries to pre-process the data required for running the business process
- Developed a Spark job in Java which indexes data into ElasticSearch from external Hive tables which are in HDFS.
- Used Impala and Presto for querying the datasets.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Used Kafka and Kafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
- Developed code in Java which creates mapping in ElasticSearch even before data is indexed into.
- Using Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra
- Developed end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka for persisting data into Cassandra.
- Implemented Data Interface to get information of customers using Rest API and Pre-Process data using MapReduce 2.0 and store into HDFS (Hortonworks).
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Responsible in development of Spark Cassandra connector to load data from flat file to Cassandra for analysis.
- Worked on Hortonworks-HDP distribution.
- Used Hortonworks Apache Falcon for data management and pipeline process in the Hadoop cluster.
- Used impala to query the data into the publish layers where all the other teams or business users can access for faster processing.
- Maintained ELK (Elastic Search, Logstash, Kibana) and Wrote Spark scripts using Scala shell.
- Worked in AWS environment for development and deployment of custom Hadoop applications.
- Developed shell scripts to generate the hive create statements from the data and load data to the table.
- Involved in writing custom Map-Reduce programs using java API for data processing.
- The Hive tables are created as per requirement were Internal or External tables defined with appropriate static, dynamic partitions and bucketing, intended for efficiency.
- Developed Hive queries for the analysts by loading and transforming large sets of structured, semi structured data using hive.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server using Python.
- Got chance working on Apache NiFi like executing Spark script, Sqoop scripts through NiFi, worked on creating scatter and gather pattern in NiFi, ingesting data from Postgres to HDFS, Fetching Hive metadata and storing in HDFS, created a custom NiFi processor for filtering text from Flow files etc.
- Cluster coordination services through Zookeeper.
Environment: HDP, Hadoop, AWS, EC2, S3 Bucket, Redshift, Cassandra, Hive, HDFS, Spark, Spark-SQL, Spark-Streaming, Scala, KAFKA, Hortonworks, Map Reduce, Apache Nifi, Impala, Zookeeper, ELK, Sqoop, Java Oracle 12c, Sql Server, T-SQL, MongoDB, Hbase, Python and Agile Methodologies.
Confidential, Chicago, IL
Big Data Engineer
Responsibilities:
- Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
- Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
- Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
- Monitored cluster health by Setting up alerts using Nagios and Ganglia
- Working on tickets opened by users regarding various incidents, requests
- Writing UNIX shell scripts to automate the jobs and scheduling Cron jobs for job automation using commands with Crontab.
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
- Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell
- Designing the business requirement collection approach based on the project scope and SDLC methodology.
- Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in in Azure Databricks.
- Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
- Have real-time experience of Kafka-Storm on HDP platform for real time analysis.
- Developed Kafka producers and consumers efficient ingested data from various data sources
- Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file system to HDFS
- Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
- Used Apache Spark Data frames, Spark-SQL, Spark MLlib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
- Applied various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, R, and Matlab. Collaborate with Data Engineers and Software Developers to develop experiments and deploy solutions to production.
- Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
- Creation of data aggregation and pipelining using Kafka and Storm
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Involved in Unit Testing the code and provided the feedback to the developers. Performed Unit Testing of the application by using NUnit.
- Write research reports describing the experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management. Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter Notebook, Hive and NoSql.
- Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.
Environment: Hadoop, Kafka, Spark, Sqoop, Spark SQL, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, Hbase, Zookeeper, Azure, DataBricks, Data Lake, Data Factory, Unix/Linux Shell Scripting, Python, PyCharm, Informatica, Linux, Shell Scripting, Informatica PowerCenter.
Confidential, Patskala, OH
Data Engineer
Responsibilities:
- Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB
- Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin.
- Advanced knowledge on Confidential Redshift and MPP database concepts.
- Migrated on premise database structure to Confidential Redshift data warehouse
- Was responsible for ETL and data validation using SQL Server Integration Services.
- Defined and deployed monitoring, metrics, and logging systems on AWS.
- Connected to Amazon Redshift through Tableau to extract live data for real time analysis.
- Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day
- Strong understanding of AWS components such as EC2 and S3
- Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
- Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
- Implementing and Managing ETL solutions and automating operational processes.
- Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics
- Designed solutions to process high volume data stream ingestion, processing and low latency data provisioning using Hadoop Ecosystems Hive, Pig, Scoop and Kafka, Python, Spark, Scala, NoSql, Nifi, Druid.
- Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin.
- Designed and implemented big data ingestion pipelines to ingest multi TB data from various data source using Kafka, Spark streaming including data quality checks, transformation, and stored as efficient storage formats Performing data wrangling on Multi-Terabyte datasets from various data sources for a variety of downstream purposes such as analytics using PySpark.
- Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running Adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
- Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
- Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
- Built performant, scalable ETL processes to load, cleanse and validate data
- Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
- Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
- Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using ERwin and subsequent deployment to Enterprise Data Warehouse.
- Wrote various data normalization jobs for new data ingested into Redshift.
- Created various complex SSIS/ETL packages to Extract, Transform and Load data
- Collaborate with team members and stakeholders in design and development of data environment
- Preparing associated documentation for specifications, requirements, and testing
- Optimized the TensorFlowModel for efficiency
- Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
- Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
- Compiled data from various sources to perform complex analysis for actionable results
- Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
- Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies.
Environment: Oracle, Kafka, Python, Redshift, Informatica, AWS, EC2, S3, SQL Server, Erwin, RDS, NOSQL, Snowflake Schema, MySQL, Dynamo DB, Docker, PostgreSQL, Tableau, Git Hub.
Confidential
Hadoop Developer
Responsibilities:
- Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
- Developed Sqoop scripts for loading data into HDFS from DB2 and preprocessed with PIG.
- Automated the tasks of loading the data into HDFS and pre-processing with Pig by developing workflows using Oozie.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Loaded data from UNIX file system to HDFS and written Hive User Defined Functions.
- Used Sqoop to load data from DB2 to HBase for faster querying and performance optimization.
- Worked on streaming to collect this data from Flume and performed real time batch processing.
- Developed Hive scripts for implementing dynamic partitions.
- Developed MapReduce jobs in both PIG and Hive for data cleaning and pre-processing.
- Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
- Wrote various SQL, PLSQL queries and stored procedures for data retrieval.
- Collected the logs data from web servers and integrated in to HDFS using Flume.
- Implemented Spark RDD transformations, actions to implement business analysis.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Developed Python scripts to find vulnerabilities with SQL Queries by doing SQL injection.
- Developed suit of Unit Test Cases for Mapper, Reducer and Driver classes using testing library.
- Written ETL jobs to visualize the data and generate reports from MySQL database using DataStage.
Environment: Hadoop, HDFS, Hive, Pig, Flume, Mapper, Flume, AWS, ETL Workflows, HBase, Python, Sqoop, Oozie, DataStage, Linux, Relational Databases, SQL Server, DB2.