Senior Big Data Engineer Resume
SUMMARY:
- Around 8+ years of professional IT experience involving project development, implementation, deployment, and maintenance using Bigdata technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase, Spark integration with Cassandra, Avro, Solr and Zookeeper.
- 7+Years of experience As Developer using Big Data Technologies like Databricks/Spark and Hadoop Ecosystems.
- Hands on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL.
- Good understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters In Databricks, Managing the Machine Learning Lifecycle
- Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, CloudWatch, SNS, DynamoDB, SQS.
- Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server.
- Extensive knowledge on QlikView Enterprise Management Console (QEMC), QlikView Publisher, QlikView Web Server.
- Implemented a batch process to load the heavy volume data loading using Apache Dataflow framework using Nifi in Agile development methodology.
- Worked as team JIRA administrator providing access, working assigned tickets, and teaming with project developers to test product requirements/bugs/new improvements.
- Created Snowflake Schemas by normalizing the dimension tables as appropriate and creating a Sub Dimension named Demographic as a subset to the Customer Dimension.
- Experienced in Pivotal Cloud Foundry (PCF) on Azure VM's to manage the containers created by PCF.
- Hands on experience in test driven development (TDD), Behavior driven development (BDD) and accepta
PROFESSIONAL EXPERIENCE:
Confidential
Senior Big Data Engineer
Responsibilities:
- Build scalable and reliable ETL systems to pull large and complex data together from different systems efficiently. Experienced in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Sqoop, Kafka, Spark with Cloudera distribution. Used Amazon
- Web Services (AWS) which include EC2, S3, Cloud Front, Elastic File System, RDS, VPC, Direct Connect, Route53, Cloud Watch, Cloud Trail, Cloud Formation, and IAM which allowed automated operations. Worked on Cloudera distribution and deployed on AWS EC2 Instances. Hands on experience on
- Cloudera Hue to import data on the GUI. Worked on integrating Apache Kafka with Spark Streaming process to consume data from external REST APIs and run custom functions. Involved in performance tuning of Spark jobs using Cache and using complete advantage of cluster environment. Developed Spark scripts by using Scala Shell commands as per the requirement. Configured, deployed, and maintained multi - node Dev and Tested Kafka Clusters. Developed in scheduling Oozie workflow engine to run multiple Hive and Pig jobs. Involved in running Hadoop streaming jobs to process terabytes of text data.
- Worked with different file formats such as Text, Sequence files, Avro, ORC and Parquet. Configured, supported, and maintained all network, firewall, storage, load balancers, operating systems, and software in AWS EC2. Implemented the use of Amazon EMR for Big Data processing among a Hadoop Cluster of virtual servers on Amazon related EC2 and S3. Worked on custom Pig Loaders and storage classes to work with variety of data formats such as JSON and XML file formats. Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS,
- SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances Implementations of generalized solution model using AWS
- SageMaker. Extensive expertise using the core Spark APIs and processing data on an EMR cluster Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena. Creating
- S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS. Experience to manage IAM users by creating new users, giving them a limited access as per needs, assign roles and policies to specific user. Developed analytical component using Scala, Spark and Spark Stream. Act as technical liaison between customer and team on all AWS technical aspects. Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS. Involved in converting Hive/SQL queri
Confidential
Sr. Data Engineer
Responsibilities:
- Experienced in development using Cloudera distribution system. Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake. Worked on creating tabular models on Azure analysis services for meeting business reporting requirements. Have good experience working with Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics (DW) As a Hadoop Developer my responsibility is managing the data pipelines and data lake. Have experience of working on Snow -flake data warehouse. Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data. Designed custom Spark REPL application to handle similar datasets. Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation. Performed Hive test queries on local sample files and HDFS files. Used Spark Streaming to divide streaming data into batches as an input to spark engine for batch processing. Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop. Exported data from HDFS to
- RDBMS via Sqoop for Business Intelligence, visualization, and user report generation. Developed Spark Applications by using Scala, Java and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources. Developed Spark Programs using Scala and Java
- API's and performed transformations and actions on RDD's. Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python. Develop ETL Process usingSPARK, SCALA, HIVE and HBASE. Developed REST APIs using Scala, Play framework and Akka. Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL. Assigned name to each of the columns using case class option in Scala. Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS
- Developed Spark SQL to load tables into HDFS to run select queries on top. Developed analytical component using Scala, Spark, and Spark Stream. Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports. Worked on the NoSQL databases HBase and mongo DB.
- Perform validation and verify software at all testing phases which includes Functional Testing, System Integration Testing, End to End Testing, Regression Testing, Sanity Testing, User Acceptance Testing, Smoke Testing, Disaster Recovery Testing, Production Acceptance Testing and Pre-prod Testing phases.
- Have good experience in logging defects in Jira and Azure Devops tools. Experienced in Installation, Configuration, and Administration of Informatica Data Quality and Informatica Data Analyst. Expertise in address data cleansing using Infor
Confidential
Data Engineer/ Data Scientist
Responsibilities:
- Experience in developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns. Experienced Data Scientist with over 1 year experience in Data
- Extraction, Data Modelling, Data Wrangling, Statistical Modeling, Data Mining, Machine Learning and Data Visualization. Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake
- Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks. Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster. Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS so as to use it for the analysis. Migrated Existing MapReduce programs to Spark Models using Python. Migrating the data from Data Lake (hive) into S3 Bucket. Done data validation between data present in Data Lake and S3 bucket. Used Spark Data Frame API over
- Cloudera platform to perform analytics on hive data. Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs. Used Kafka for real time data ingestion. Created different topic for reading the data in Kafka. Read data from different topics in Kafka. Moved data from s3 bucket to Snowflake data warehouse for generating the reports. Created database objects like Stored Procedures, UDFs, Triggers, Indexes and Views using TSQL in both OLTP and Relational data warehouse in support of ETL. Developed complex ETL Packages using SQL Server 2008
- Integration Services to load data from various sources like Oracle/SQL Server/DB2 to Staging Database and then to Data Warehouse. Created report models from cubes as well as relational data warehouse to create ad-hoc reports and chart reports Written Hive queries for data analysis to meet the business requirements. Migrated an existing on-premises application to AWS. Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS. Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting. Created many Spark UDF and UDAFs in Hive for functions that were not preexisting in Hive and Spark Sql. Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala. Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc. Good knowledge on Spark platform parameters like memory, cores and executors By using Zookeeper implementation in the cluster, provided concurrent access for hive tables with shared and ex
Confidential
Hadoop Developer
Responsibilities:
- Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka Installed and configured Hadoop MapReduce, HDFS and developed multiple MapReduce jobs in Java for data cleansing and preprocessing. Involved in loading data from UNIX file system to HDFS. Installed and configured Hive and written Hive UDFs. Importing and exporting data into HDFS and Hive using Sqoop Used Cassandra CQL and Java APIs to retrieve data from Cassandra table. Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files. Worked hands on with ETL process using Informatica. Handled importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
- Extracted the data from Teradata into HDFS using Sqoop. Analyzed the data by performing Hive queries and running Pig scripts to know user behavior. Exported the patterns analyzed back into Teradata using Sqoop. Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customers churn and classification of customers. Worked with applications like R, SPSS and Python to develop neural network algorithms, cluster analysis, ggplot2 and shiny in R to understand data and developing applications.
- Partner with technical and non - technical resources across the business to leverage their support and integrate our efforts. Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams. Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning. Data analysis using regressions, data cleaning, excel v-look up, histograms and TOAD client and data representation of the analysis and suggested solutions for investors
Environment: Hadoop, MapReduce, HDFS, UNIX, Hive, Sqoop, Cassandra, ETL, Oozie, Bigdata ECO systems, PIG, Cloudera, Python, Informatica Cloud Services, Salesforce, Unix scripts, FlatFiles, XML files
Confidential
Data Analyst/ Python Developer
Responsibilities:
- Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding. Recommended structural changes and enhancements to systems and databases. Conducted Design reviews and Technical reviews with other project stakeholders. Was a part of the complete life cycle of the project from the requirements to the production support. Created test plan documents for all back - end database modules. Used MS Excel, MS Access, and SQL to write and run various queries. Worked extensively on creating tables, views, and SQL queries in MS SQL
- Server. Worked with internal architects and assisting in the development of current and target state data architectures. Coordinate with the business users in providing appropriate, effective, and efficient way to design the new reporting needs based on the user with the existing functionality. Remain knowledgeable in all areas of business operations to identify systems needs and requirements. Perform troubleshooting, fixed and deployed many Python bug fixes of the two main applications that were a main source of data for both customers and internal customer service team. Write Python scripts to parse
- JSON documents and load the data in database. Generating various capacity planning reports (graphical) using Python packages like Numpy, matplotlib. Analyzing various logs that are been generating and predicting/forecasting next occurrence of event with various Python libraries. Performed Exploratory
- Data Analysis, trying to find trends and clusters. Built models using techniques like Regression, Tree based ensemble methods, Time Series forecasting, KNN, Clustering and Isolation Forest methods. Worked on data that was a combination of unstructured and structured data from multiple sources and automated the cleaning using Python scripts. Extensively performed large data read/writes to and from csv and excel files using pandas. Tasked with maintaining RDD's using SparkSQL. Communicated and coordinated with other departments to collection business requirement. Created Autosys batch processes to fully automate the model to pick the latest as well as the best bond that fits best for that market. Created a framework using plotly, dash and flask for visualizing the trends and understanding patterns for each market using the history data. Used python APIs for extracting daily data from multiple vendors. Used Spark and SparkSQL for data integrations, manipulations.Worked on a POC for creating a docker image on azure to run the model
Environment: SQL, SQL Server, MS Office, MS Visio, SQL Server 2012, Jupyter, R 3.1.2, Python, SSRS, SSIS, SSAS, MongoDB, HBase, HDFS, Hive, Pig, Microsoft office, SQL Server Management Studio, Business Intelligence Development Studio.
