Sr. Data Engineer Resume
Bothell, WA
SUMMARY
- Around 8+ years of professional experience involving project development, implementation, deployment and maintenance using BigData technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase.
- Experience in working with different Hadoop distributions like CDH and Hortonworks.
- Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Experience developing Pig Latin and HiveQL scripts for Data Analysis and ETL purposes and extended the default functionality by writing User Defined Functions (UDFs), User Defined Aggregate Function (UDAFs) for custom data specific processing.
- Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data Bricks andAzure SQL Data warehouseand Controlling and granting database accessandMigrating On premise databases toAzure Data lake storeusing Azure Data factory.
- Strong Knowledge on Architecture of Distributed systems and Parallel processing, In-depth understanding of MapReduce programing paradigm and Spark execution framework.
- Good experience in creating data ingestion pipelines, data transformations, data management, data governance and real time streaming at an enterprise level.
- Solid experience in using the various file formats like CSV, TSV, Parquet, ORC, JSON and AVRO.
- Experience working with Data Lake is a system or repository of data stored in its natural/raw format, usually object blobs or files.
- Excellent understanding of Data Ingestion, Transformation and Filtering.
- Provides Output for multiple stake holders at the same time
- Coordinated with the Machine Learning team to perform Data Visualization
- Developed Spark and Scala applications for performing event enrichment, data aggregation, de-normalization for different stake holders.
- Designed new data pipelines and made the existing data Pipelines to be more efficient.
- Hands on Experience in designing and developing applications in Spark using Scala and PySpark to compare the performance of Spark with Hive and SQL/Oracle.
- Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.
- In depth understanding of Hadoop Architecture and its various components such as Resource Manager, Application Master, Name Node, Data Node, HBase design principles etc.
- Experience developing iterative algorithms using Spark Streaming in Scala and Python to builds near real-time dashboards.
- Experience with migrating data to and from RDBMS and unstructured sources into HDFS using Sqoop.
- Experience in job workflow scheduling and monitoring tools like Oozie and good knowledge on Zookeeper.
- Profound understanding of Partitions and Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Worked on NoSQL databases including HBase and Mongo DB.
- Experienced with performing CRUD operations using HBase Java Client API and Solr API
- Good experience in working with cloud environment like Amazon Web Services (AWS) EC2 and S3.
- Experience in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins and AWS.
- Experience writing Shell scripts in Linux OS and integrating them with other solutions.
- Strong Experience in working with Databases like Oracle 10g, DB2, SQL Server 2008 and MySQL and proficiency in writing complex SQL queries.
- Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.
- Excellent communication, interpersonal and analytical skills and a highly motivated team player with the ability to work independently.
TECHNICAL SKILLS
Hadoop Technologies: HDFS, MapReduce, YARN, Hive, Pig, Tez, Kafka, HBASE, Impala, Zookeeper, Sqoop, OOZIE, Apache Cassandra, Flume, Spark, Azure, AWS, EC2
Web Technologies: HTML, CSS, JavaScript
Languages: C, Java, SQL, PL/SQL, Python, Scala, Shell Scripting, PySpark
Operating Systems: Linux, UNIX, Windows
Databases: NoSQL, Oracle, DB2, MySQL, PostgreSQL, SQL Server, Snowflake, MS Access, HBase, MongoDB, Apache NiFi
Application Servers: WebLogic, WebSphere, Apache Tomcat, JBOSS
IDE s: Eclipse, Tableau, Visual Studio Code, NetBeans, JDeveloper, IntelliJ IDEA.
Version Control: GIT, GIT HUB
Reporting Tools: Jaspersoft, Qlik Sense, Tableau, JUnit
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, Bothell, WA
Responsibilities:
- Developed ETL data pipelines using Spark, Spark streaming and Scala.
- Loaded data from RDBMS to Hadoop using Sqoop
- Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark.
- Responsible for loading Data pipelines from web servers using Sqoop, Kafka and Spark Streaming API.
- Have experience of working on Snowflake data warehouse.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, Azure Data Catalog, HDInsight, Azure SQL Server, Azure ML and Power BI.
- Used Azure Databricks for fast, easy and collaborative spark-based platform on Azure.
- Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
- Data Processing: Processed data using Map Reduce and Yarn. Worked on Kafka as a proof of concept for log processing.
- Designing and Developing Apache NiFi jobs to get the files from transaction systems into data lake raw zone.
- Tested Apache Tez, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs
- Monitoring the Hive Meta store and the cluster nodes with the help of Hue.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- UsedAzure Data Catalogwhich helps in organizing and to get more value from their existing investments.
- Developed various UDFs in Map-Reduce and Python for Pig and Hive.
- Data Integrity checks have been handled using hive queries, Hadoop and Spark
- Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala
- Implemented the Machine learning algorithms using Spark with Python
- Defined job flows and developed simple to complex Map Reduce jobs as per the requirement.
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
- Developed PIG UDFs for manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders.
- Responsible in handling Streaming data from web server console logs
- Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
- Developed PIG Latin scripts for the analysis of semi structured data.
- Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
- Used Sqoop to import data into HDFS and Hive from other data systems.
- Installed and configured Apache Hadoop to test the maintenance of log files in Hadoop cluster.
- Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
- Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE. Provide guidance to development team working on PySpark as ETL platform.
- Analysed the SQL scripts and designed it by using PySpark SQL for faster performance.
- Involved in NoSQL database design, integration and implementation
- Loaded data into NoSQL database HBase.
- Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, cosmos DB)
- Developed Kafka producer and consumers, HBase clients, Spark and Hadoop MapReduce jobs along with components on HDFS, Hive.
- Very good understanding of Partitions, Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
Environment: Spark, Spark Streaming, Apache Kafka, Apache NiFi, Hive, Tez, Azure, Azure data grid, Azure Synapse analytics, Azure data catalog, ETL, PIG, PySpark, UNIX, Linux, Tableau, Teradata, Pig, Sqoop, Hue, Oozie, Java, Scala, Python, GIT, GIT HUB
Sr. Big Data Engineer
Confidential, New York, NY
Responsibilities:
- Experienced in development using Cloudera distribution system.
- As a Hadoop Developer, my responsibility is managing the data pipelines and data lake.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Designed custom Spark REPL application to handle similar datasets
- Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation
- Performed Hive test queries on local sample files and HDFS files
- Used AWS services like EC2 and S3 for small data sets.
- Created AWS EC2 instances and used JIT servers.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
- Developed the application on Eclipse IDE
- Developed Hive queries to analyze data and generate results
- Used Spark Streaming to divide streaming data into batches as an input to spark engine for batch processing.
- Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop.
- Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization and user report generation.
- Used Scala to write code for all Spark use cases.
- Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL
- Assigned name to each of the columns using case class option in Scala.
- Developed multiple Spark Sql jobs for data cleaning
- Created Hive tables and worked on them using Hive QL
- Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS
- Developed Spark SQL to load tables into HDFS to run select queries on top.
- Developed analytical component using Scala, Spark and Spark Stream.
- Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.
- Worked on the NoSQL databases HBase and mongo DB.
Environment: Hadoop, Hive, Oozie, Java, Linux, Maven, Oracle 11g/10g, Zookeeper, Agile, MySQL, Spark.
Big Data Engineer
Confidential, Irving, TX
Responsibilities:
- Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS so as to use it for the analysis
- Migrated Existing MapReduce programs to Spark Models using Python.
- Migrating the data from Data Lake (hive) into S3 Bucket.
- Done data validation between data present in data lake and S3 bucket.
- Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.
- Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
- Analysed theSQL scriptsand designed the solution to implement usingPySpark.
- Used Kafka for real time data ingestion.
- Created different topic for reading the data in Kafka.
- Read data from different topics in Kafka.
- Moved data from s3 bucket to Snowflake data warehouse for generating the reports.
- Written Hive queries for data analysis to meet the business requirements.
- Migrated an existing on-premises application to AWS.
- Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Created many Spark UDF and UDAFs in Hive for functions that were not preexisting in Hive and Spark Sql.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning and bucketing in hive, doing map side joins etc.
- Good knowledge on Spark platform parameters like memory, cores and executors.
- By using Zookeeper implementation in the cluster, provided concurrent access for hive tables with shared and exclusive locking.
Environment: Linux, Apache Hadoop Framework, HDFS, YARN, HIVE, HBASE, AWS (S3, EMR), Scala, Spark, SQOOP, SQL, PySpark.
Hadoop Developer
Confidential
Responsibilities:
- Installed and configured Hadoop MapReduce, HDFS and developed multiple MapReduce jobs in Java for data cleansing and preprocessing.
- Involved in loading data from UNIX file system to HDFS.
- Installed and configured Hive and also written Hive UDFs.
- Importing and exporting data into HDFS and Hive using Sqoop
- Used Cassandra CQL and Java API's to retrieve data from Cassandra table.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
- Worked hands on with ETL process.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
- Extracted the data from Teradata into HDFS using Sqoop.
- Analyzed the data by performing Hive queries and running Pig scripts to know user behavior.
- Exported the patterns analyzed back into Teradata using Sqoop.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Installed Oozie workflow engine to run multiple Hive.
- Developed Hive queries to process the data and generate the data cubes for visualizing.
Environment: Hadoop, MapReduce, HDFS, UNIX, Hive, Sqoop, Cassandra, ETL, Pig Script, Cloudera, Oozie.
SQL Developer
Confidential
Responsibilities:
- Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
- Recommended structural changes and enhancements to systems and databases.
- Conducted Design reviews and technical reviews with other project stakeholders.
- Was a part of the complete life cycle of the project from the requirements to the production support.
- Created test plan documents for all back-end database modules.
- Used MS Excel, MS Access and SQL to write and run various queries.
- Worked extensively on creating tables, views and SQL queries in MS SQL Server.
- Worked with internal architects and assisting in the development of current and target state data architectures.
- Coordinate with the business users in providing appropriate, effective and efficient way to design the new reporting needs based on the user with the existing functionality.
- Remain knowledgeable in all areas of business operations to identify systems needs and requirements.
Environment: SQL, SQL Server, MS Office and MS Visio.
