Senior Big Data Engineer Resume
Bloomfield, CT
SUMMARY
- Big Data/Data Engineer with over8+years of overall experience as software developer in design, development, deploying and large scale supporting large scale distributed systems.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, MR, Hadoop GEN2Federation, High Availability and YARN architecture and good understanding of workload management, scalability and distributed platform architectures.
- Implemented various algorithms for analytics using Cassandra with Spark and Scala.
- Experienced of buildingData WarehouseinAzure platformusingAzure data bricksanddata factory.
- Expertise in using various Hadoop infrastructures such asMap Reduce, Pig, Hive, Zookeeper, H base, Sqoop, Oozie, Flume, Drillandsparkfor data storage and analysis.
- Experienced in managing Hadoop clusters and services usingClouderaManager.
- Experience in developingcustomUDFsfor Pig and Hive to in corporate methods and functionality of Python/Java intoPig LatinandHQL(HiveQL) and Used UDFs from Piggybank UDF Repository.
- Experienced in running query - usingImpalaand used BI tools to run ad-hoc queries directly on Hadoop.
- Experience in writingPL/SQLstatements - Stored Procedures, Functions, Triggers and packages.
- Good Knowledge and experience inAmazon Web Service (AWS)concepts likeEMR and EC2web services successfully loaded files toHDFSfromOracle, SQL Server, Teradata and Netezza using Sqoop.
- Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and Spark jobs on AWS.
- Good knowledge onAWScloud formation templates and configuredSQSservice through javaAPIto send and receive the information.
- Hands-on experience withAmazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQSand other services of the AWS family.
- Experience in developing a data pipeline through Kafka-Spark API.
- Skilled in Tableau Desktop versions 10x for data visualization, Reporting and Analysis.
- Developed reports, dashboards using Tableau for quick reviews to be presented to Business and IT users.
- Analyzed data and provided insights with R Programming and Python Pandas
- Good knowledge in implementing various data processing techniques using Apache HBase for handling the data and formatting it as required.
- Excellent and experience and knowledge of Machine Learning, Mathematical Modeling and Operations Research. Comfortable with R, Python, SAS and Weka, MATLAB, Relational databases. Deep understanding and exposure of Big Data Eco-system.
- AssistedDeploymentteamin setting upHadoop clusterand services.
- Having good knowledge in Benchmarking & Performance Tuning of cluster.
- Designed and implemented a product search service using Apache Solr.
- Good experience in Generating Statistics/extracts/reports from the Hadoop.
- Determined, committed and hardworking individual with strong communication, interpersonal and organizational skills.
TECHNICAL SKILLS
Languages: Python, Scala, PySpark, Shell Scripting, SQL, PL/SQL and UNIX Bash
Operating Systems: UNIX, LINUX, Solaris, Mainframes
Big Data: Hadoop, Sqoop, Apache Spark, NiFi, Kafka, Snowflake, Cloudera, Horton Works, PySpark, SparkSpark SQL
Data bases: Oracle, SQL Server, My SQL, DB2, Sybase, Netezza, Hive, Impala
Cloud Technologies: AWS, AZURE
IDE Tools: Aginitiy for Hadoop, PyCharm, Toad, SQL Developer, SQL *Plus, Sublime Text, VI Editor
PROFESSIONAL EXPERIENCE
Confidential
Senior Big Data Engineer
Responsibilities:
- Used Agile methodology in developing the application, which included iterative application development, weekly Sprints, stand up meetings and customer reporting backlogs.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, Spark and loaded data into HDFS.
- Understanding of AWS Product and Service suite primarily EC2, S3, VPC, Lambda, Redshift, Spectrum,
- Athena, EMR(Hadoop) and other monitoring service of products and their applicable use cases, best practices and implementation, and support considerations
- Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing. Created Lambda jobs and configured Roles using AWS CLI.
- Automating the ETL tasks and data workflows for the data pipeline of the ingest process through UC4 scheduling tool.
- Experience in change implementation, monitoring and troubleshooting of AWS Snowflake databases and cluster related issues
- Assist with the analysis of data used for the tableau reports and creation of dashboards.
- Design and implement large scale distributed solutions in AWS.
- Analyze and develop programs by considering the extract logic and the data load type using Hadoop ingest processes using relevant tools such as Sqoop, Spark, Scala, Kafka, Unix shell scripts and others.
- Design the incremental, historical extract logic to load the data from flat files into Massive Event Logging Database (MELD) from various servers.
- Developed UDFs in Java as and when necessary to use in PIG and HIVE queries.
- Automated the cloud deployments using chef, python and AWS Cloud Formation Templates.
- Optimized Map Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Used ORC and Parquet file formats in Hive.
- Development of efficient pig and hive scripts with joins on datasets using various techniques.
- Write documentation of program development, subsequent revisions and coded instructions in the project related GitHub repository.
- Deployment support including change management and preparation of deployment instructions.
- Prepare release notes, validation document for user stories to be deployed to production as part of release.
- Created and managed cloud VMs with AWS EC2 Command line clients and AWS management console.
- Migrated on premise database structure to Confidential Redshift data warehouse. Worked on AWS Data Pipeline to configure data loads from S3 into Redshift
- Writing technical design document based on the data mapping functional details of the tables.
- Extracting batch and Real time data from DB2, Oracle, Sql server, Teradata, Netezza to Hadoop (HDFS) using Teradata TPT, Sqoop, Apache Kafka, Apache Storm.
- Developing Apache Spark jobs for data cleansing and pre-processing.
Environment: RHEL, HDFS, Map-Reduce, Hive, AWS, EC2, S3, Lambda, Redshift, Pig, Sqoop, Oozie, Teradata, Oracle SQL, UC4, Kafka, GitHub, Hortonworks data platform distribution, Spark, Scala.
Confidential, Bloomfield, CT
Big Data Engineer
Responsibilities:
- Involved in managing and monitoringHadoopcluster using Cloudera Manager.
- Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instance.
- Involved in HBASE setup and storing data into HBASE, which will be used for further analysis.
- Worked on Written a python script which automates to launch the EMR cluster and configures the Hadoop applications using boto3.
- Involved in working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
- Experienced in analyzing and Optimizing RDD's by controlling partitions for the given data
- Experienced in writing live Real-time Processing using Spark Streaming with Kafka
- Implemented Spark using Python and Spark SQL for faster processing of data and Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark)
- Written Spark-SQL and embedded the SQL in SCALA files to generate jar files for submission onto the Hadoop cluster
- Extensively worked with Avro and Parquet, XML, JSON files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark.
- Assisted in Cluster maintenance, cluster monitoring, adding and removing cluster nodes and Installed and configured Hadoop, Map Reduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and pre-processing.
- Stored and retrieved data from data-warehouses using Amazon Redshift.
- Involved in file movements between HDFS andAWSS3 and extensively worked with S3 bucket inAWS.
- Converted allHadoopjobs to run in EMR by configuring the cluster according to the data size
- Involved in ConfiguringHadoopcluster and load balancing across the nodes.
- Involved in Hadoopinstallation, Commissioning, Decommissioning, Balancing, Troubleshooting, Monitoring and, debugging Configuration of multiple nodes using Hortonworks platform.
- Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data.
- Created various types of data visualizations using Python and Tableau.
- Automated and monitored complete AWS infrastructure with terraform.
- Created data partitions on large data sets in S3 and DDL on partitioned data.
- Used Python and Shell scripting to build pipelines.
- Developed data pipeline using SQOOP, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS.
Environment: HDFS, Hive, AWS (EC2, S3, EMR, Redshift, ECS, Glue, S3, VPC, RDS etc.), Spark, Tableau, Yarn, Cloudera, Scala, Sqoop, DataStage, SQL, Terraform, Splunk, RDBMS, Python, Elastic search, data Lake, Kerberos, Jira, Confluence, Shell/Perl Scripting, Zookeeper, NIFI, Ranger, Git, Kafka, CI/CD(Jenkins), Kubernetes.
Confidential, Englewood, CO
Big Data Engineer
Responsibilities:
- Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler, and database engine tuning advisor to enhance performance.
- Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
- Created various complex SSIS/ETL packages to Extract, Transform and Load data
- UsedZookeeperto store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
- Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using ERwin and subsequent deployment to Enterprise Data Warehouse.
- Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
- Migrated on premise database structure to Confidential Redshift data warehouse
- Was responsible for ETL and data validation using SQL Server Integration Services.
- Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB
- Defined and deployed monitoring, metrics, and logging systems on AWS.
- Wrote various data normalization jobs for new data ingested into Redshift.
- Developed SSRSreports, SSIS packages to Extract, Transform and Load data from various source systems
- Implementing and Managing ETL solutions and automating operational processes.
- UsedKafkafunctionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and Created applications using Kafka, which monitors consumer lag withinApacheKafkaclusters.
- Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS codepipeline.
- Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
- Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin.
Environment: Informatica, RDS, NOSQL, Snow Flake Schema, AWS, Apache Kafka, Python, Zookeeper, SQL Server, Erwin, Oracle, Redshift, MySQL, PostgreSQL
Confidential, Englewood, CO
Big Data Engineer
Responsibilities:
- Involved in design and development phases of Software Development Life Cycle (SDLC) using Scrum methodology.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
- Doing data synchronization between EC2 and S3, Hive stand-up, and AWS profiling
- Created Data Pipeline of Map Reduce programs using Chained Mappers.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Importing and exporting data intoHDFSfrom database and vice versa usingSqoop.
- Loaded the aggregated data onto DB2 for reporting on the dashboard.
- Implemented optimization and performance tuning in Hive and Pig.
- Implemented Optimized join base by joining different data sets to get top claims based on state using Map Reduce
- Developed data pipeline using Flume, Sqoop, Pig and Java Map Reduce to ingest behavioral data into HDFS for analysis.
- Worked with NoSQL databases like Hbase, Cassandra, DynamoDB (AWS) and MongoDB
- Developed job flows in Oozie to automate the workflow for extraction of data from warehouses and weblogs.
- UsedMavenextensively for building jar files ofMapReduceprograms and deployed to Cluster.
- Created customized BI tool for manager team that perform Query analytics using HiveQL.
- Migrated high-volume OLTP transactions from Oracle to Cassandra
- Installed Oozie workflow engine to run multiple Hive.
- Used Pig as ETL tool to do transformations, event joins, filters and some pre-aggregations before storing the data onto HDFS.
Environment: RHEL, HDFS, Map-Reduce, Hive, Pig, Sqoop, Flume, AWS, EC2, S3, Oozie, Mahout,HBase, Hortonworks data platform distribution, GCP, GCS, Big Query, Data Proc, Cassandra, Mongo DB.
Confidential
Hadoop Developer/ Data Engineer
Responsibilities:
- Importing and exporting data into HDFS from Oracle Database and vice versa using sqoop.
- Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce.
- Installed and configured Pig and also written Pig Latin scripts.
- Designed and implemented MapReduce-based large-scale parallel relation-learning system
- Created SSIS packages to pull data from SQL Server and exported to Excel Spreadsheets and vice versa.
- Setup and benchmarked Hadoop/HBase clusters for internal use
- Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
- Creating Hive tables and working on them using Hive QL. Experienced indefining jobflows.
- Designed and implemented data transfer from and to Hadoop and AWS.
- Deploying and scheduling reports using SSRS to generate daily, weekly, monthly and quarterly reports.
- Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a mapreduce way. Developed a custom File System plugin for Hadoop so it can access files on Data Platform.
Environment: Hadoop, MapReduce, AWS, Amazon S3. Pig,SQL Server, Hive, Hbase, SSIS, SSRS, Report Builder, MS Office, Excel, Flat Files, T-SQL.
