Sr. Data Engineer Resume
CA
SUMMARY
- 8 years of experience in Hadoop components like MapReduce, Flume, Kafka, Pig, Hive, Spark, HBase, Oozie, Sqoop and Zookeeper.
- Experience in working with different Hadoop distributions like CDH and Hortonworks.
- Good knowledge on MAPR distribution & Amazon's EMR.
- Experienced in working with various Python IDE’s using PyCharm, PyScripter, Spyder, PyStudio, PyDev, IDLE, NetBeans and Sublime Text
- Experience with Requests, Report Lab, NumPy, SciPy, Pytables, cv2, imageio, Python - Twitter, Matplotlib, HTTPLib2, Urllib2, Beautiful Soup, Data Frame and Pandas python libraries during development lifecycle.
- Hands-on experience in handling database issues and connections with SQL and NoSQL databases like MongoDB, Cassandra, Redis, CouchDB, DynamoDB by installing and configuring various packages in python.
- Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Experience with Snowflake Multi - Cluster Warehouses.
- Understanding of SnowFlake cloud technology.
- Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
- Hands-on experience with Snowflake utilities, SnowSQL, SnowPipe, Big Data model techniques using Python / Java.
- ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
- Experience developing Pig Latin and HiveQL scripts for Data Analysis and ETL purposes and extended the default functionality by writing User Defined Functions (UDFs), User Defined Aggregate Function (UDAFs) for custom data specific processing.
- Strong Knowledge on Architecture of Distributed systems and parallel processing, In-depth understanding of MapReduce programing paradigm and Spark execution framework.
- Good experience in creating data ingestion pipelines, data transformations, data management, data governance and real time streaming at an enterprise level.
- Solid experience in using the various file formats like CSV, TSV, Parquet, ORC, JSON and AVRO.
- Experience working with Data Lake is a system or repository of data stored in its natural/raw format, usually object blobs or files.
- Excellent understanding of Data Ingestion, Transformation and Filtering.
- Provides Output for multiple stake holders at the same time
- Coordinated with the Machine Learning team to perform Data Visualization using Cognos TM1, PowerBI and Tableau.
- Developed Spark and Scala applications for performing event enrichment, data aggregation, and de-normalization for different stake holders.
- Designed new data pipelines and made the existing data Pipelines to be more efficient.
- Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing, and optimizing the HiveQL queries.
- In depth understanding of Hadoop Architecture and its various components such as YARN, Resource Manager, Application Master, Name Node, Data Node, HBase design principles etc.
- Experience developing iterative algorithms using Spark Streaming in Scala and Python to builds near real-time dashboards.
- Extensive working knowledge and experience in building and automating processes using Airflow.
- Experience with migrating data to and from RDBMS and unstructured sources into HDFS using Sqoop.
- Experience in job workflow scheduling and monitoring tools like Oozie and good knowledge on Zookeeper to coordinate the servers in clusters and to maintain the data consistency.
- Profound understanding of Partitions and Bucketing concepts in Hive and designed both
- Managed and External tables in Hive to optimize performance.
- Worked on NoSQL databases like HBase, Cassandra and MongoDB.
- Experienced with performing CRUD operations using HBase Java Client API and Solr API
- Good experience in working with cloud environment like Amazon Web Services (AWS) EC2 and S3.
- Experience in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins, and AWS.
- Experience writing Shell scripts in Linux OS and integrating them with other solutions.
- Strong Experience in working with Databases like Oracle 10g, DB2, SQL Server 2008 and
- MySQL and proficiency in writing complex SQL queries.
- Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.
- Hands-on experience on fetching the live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka.
- Good knowledge on creating Data Pipelines in SPARK using SCALA.
- Experience in developing Spark Programs for Batch and Real-Time Processing. Developed Spark
- Streaming applications for Real Time Processing.
- Good knowledge on Spark components like Spark SQL, MLlib, Spark Streaming and GraphX.
- Expertise in integrating the data from multiple data sources using Kafka.
- Knowledge about unifying data platforms using Kafka producers/ consumers, implement pre- processing using storm topologies.
- Kafka Deployment and Integration with Oracle databases.
- Experience data processing like collecting, aggregating, moving from various sources using Apache Kafka.
- Experienced in moving data from Hive tables into Cassandra for real time analytics on hive tables and Cassandra Query Language (CQL) to perform analytics on time series data.
- Good Knowledge in custom UDF's in Hive & Pig for data filtering.
- Experience in Apache NIFI which is a Hadoop technology and Integrating Apache NIFI and Apache Kafka.
- Hands-on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS.
- Excellent communication, interpersonal and analytical skills. Also, a highly motivated team player with the ability to work independently.
- Comprehensive knowledge of Software Development Life Cycle (SDLC).
- Exposure to Waterfall, Agile and Scrum models.
- Highly adept at promptly and thoroughly mastering new technologies with a keen awareness of new industry developments and the evolution of next generation programming solutions.
TECHNICAL SKILLS
Languages: C, C++, J2SE 1.6, 1.7, 1.8, JEE, Scala, JUnit & Shell Scripting, Python
Big Data Technologies: Apache Hadoop, Apache Spark, Apache Kafka, Apache Sqoop, Apache Crunch, Apache Hive, Map Reduce, Oozie, Apache NiFi and Apache Pig.
Frameworks: Spring, Spring Boot.
Web Services: RESTFUL.
Data Formats: JSON, AVRO, ORC, CSV, XML and Proto Buffer.
Data Indexing Technology: Apache SOLR.
Deployments: Pivotal Cloudy Foundry, Chef.
Integration Tools: Jenkins, Team City.
Operating Systems: Mac OS, Windows XP/ Visa/ 7
Packages & Tools: MS Office Suite (Word, Excel, PowerPoint, SharePoint, Outlook, Project)
Development Tools: Eclipse Juno, IntelliJ
Database: JDBC, MySQL, SQL Server, Oracle 10g
NoSQL Database: HBase and MongoDB.
UML Modeling Tools: Visual Paradigm for UML 10.1, Visio
BI Tools: SAP Business Objects 4.1, Information Design Tool and Web Intelligence.
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, CA
Responsibilities:
- Installing, configuring and testing Hadoop ecosystem components like MapReduce, HDFS, Pig, Hive, Sqoop, Flume, Oozie, Hue and HBase.
- Imported data from various sources into HDFS and Hive using Sqoop.
- Exporting data from HDFS into PostgreSQL using python based Hawq framework
- Involved in writing custom MapReduce, Pig and Hive programs.
- Building data platforms for analytics, advanced analytics in Azure Databricks.
- Designed and implemented end-to-end systems for Data Analytics and Automation, integrating custom visualization tools using R, Tableau, and Power BI.
- Utilized Spark, Snowflake, Presto, Scala, Hadoop, HQL, VQL, Oozie, PySpark, Data Lake, TensorFlow, HBase, Cassandra, Athena, Redshift, MongoDB, Kafka, Kinesis, Spark.
- Creating Reports in Looker based on Snowflake Connections
- Unit tested the data between Redshift and Snowflake.
- Consulting on Snowflake Data Platform Solution Architecture, Design, Development and deployment focused to bring the data driven culture across the enterprises
- Strong Experience in implementing Data warehouse solutions in Confidential Redshift; Worked on various projects to migrate data from on premise databases to Confidential Redshift, RDS and S3Configured, deployed, and maintained multi-node Dev and Tested Kafka Clusters.
- Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development.
- Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig.
- Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Processed some simple statistical analysis of data profiling like cancel rate, var, skew, kurt of trades, and runs of each stock every day group by 1 min, 5 min, and 15 min.
- Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, postgreSQL, Data Frame, OpenShift, Talend, pair RDD's.
- Implemented the use of Amazon EMR for Big Data processing among a Hadoop Cluster of virtual servers on Amazon related EC2 and S3.
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed
- Worked on custom Pig Loaders and storage classes to work with variety of data formats such as JSON and XML file formats.
- Encoded and decoded JSON objects using PySpark to create and modify the data frames in Apache Spark
- Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
- Good knowledge in using Data Manipulations, Tombstones, Compactions in Cassandra. Well experienced in avoiding faulty Writes and Reads in Cassandra.
- Performed data analysis with Cassandra using Hive External tables.
- Worked on data processing and transformations and actions in spark by using Python (Pyspark) language.
- Designed the Column families in Cassandra.
- Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
- Configured the above jobs in Airflow.
- Helped develop validation framework using Airflow for the data processing.
- Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
- Implemented YARN Capacity Scheduler on various environments and tuned configurations according to the application wise job loads.
- Configured Continuous Integration system to execute suites of automated test on desired frequencies using Jenkins, Maven & GIT.
- Involved in loading data from LINUX file system to HDFS.
- Followed Agile Methodologies while working on the project.
Environment: Hadoop, HDFS, Hive, Spark, Cloudera, AWS EC2, S3, ERM, Sqoop, Kafka, Yarn, ShellScripting, PySpark, Scala, Pig, Cassandra, Oozie, Agile methods, MySQL, Snowflake
Sr. Data Engineer
Confidential, Charlotte, NC
Responsibilities:
- Experienced in development using Cloudera distribution system.
- As a Hadoop Developer my responsibility is managing the data pipelines and data lake.
- Have experience of working on Snow -flake data warehouse.
- Evaluate Snowflake Design considerations for any change in the application
- Build the Logical and Physical data model for snowflake as per the changes required
- Define virtual warehouse sizing for Snowflake for different type of workloads.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data
- Designed custom Spark REPL application to handle similar datasets
- Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation
- Performed Hive test queries on local sample files and HDFS files
- Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon, etc
- Designed and Implemented Error-Free Data Warehouse-ETL and Hadoop Integration.
- Proficient in data modelling with Hive partitioning, bucketing, and other optimization techniques in Hive
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and preprocessing with Pig.
- Used MLlib, Spark's Machine learning library to build and evaluate different models and used AWS Recognition for image analysis.
- Set up standards and processes for Hadoop based application design and implementation.
- Wrote Shell scripts for several day-to-day processes and worked on its automation.
- Collected the logs data from web servers and integrated in to HDFS using Flume.
- Implemented Fair Schedulers on the Job tracker to share the resources of the Cluster for the Map r educe jobs given by the users.
- Worked on establishing connectivity between Tableau and SpotfiUsed Scala to write code for all Spark use cases.
- Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL
- Assigned name to each of the columns using case class option in Scala.
- Developed multiple Spark Sql jobs for data cleaning
- Created Hive tables and worked on them using Hive QL
- Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS
- Developed Spark SQL to load tables into HDFS to run select queries on top.
- Developed analytical component using Scala, Spark and Spark Stream.
- Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.
- Worked on the NoSQL databases HBase and mongo DB.
Environment: Hadoop, Hive, Oozie, Java, Linux, Maven, Oracle 11g/10g, Zookeeper, MySQL, Spark, Azure, PySpark, Java/JEE 7, VMware 5.1, HIVE, Eclipse, PIG, Hive, HBase, Sqoop, Flume, Linux, UNIX.
Data Engineer
Confidential, Bentonville, AR
Responsibilities:
- Utilized AWS Lambda in c related user-friendly interface for a quick view of reports by using C#, JSP, XML and developed an expandable menu that shows drill-down data on graph click
- In-depth knowledge of Scala and experienced in building the Spark applications using Scala.
- Configured Flume to stream data into HDFS and Hive using HDFS Sinks and Hive sinks.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Involved in scheduling Oozie workflow engine to run multiple Hive, Pig and Spark jobs.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster, SQL to retrieve data from the Oracle database, and used ETL for data transformation.
- Done performance tuning in the hive at all point of phases.
- Developed pig UDF's in java for cleaning the bad records/data.
- Coordinated in all testing phases and worked closely with Performance testing team to create a baseline for the new application.
- Experience in Commissioning, Decommissioning Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
- Good knowledge on Spark platform parameters like memory, cores and executors
- By using Zookeeper implementation in the cluster, provided concurrent access for hive tables with shared and exclusive locking
Environment: Linux, Apache Hadoop Framework, Hadoop, JEE8, MongoDB 3.5.9, HDFS, AWS, PySpark, Power BI, Qlik, Azure, Pig, Hive, MapReduce, Sqoop, LINUX, PySpark and Big Data, Scala, Spark, SQOOP
Hadoop Developer
Confidential
Responsibilities:
- Participated in all the phases of the Software development life cycle (SDLC) which includes Development, Testing, Implementation and Maintenance.
- Installed and configured Hadoop MapReduce, HDFS and developed multiple MapReduce jobs in Java for data cleansing and preprocessing.
- Involved in loading data from UNIX file system to HDFS.
- Installed and configured Hive and written Hive UDFs.
- Importing and exporting data into HDFS and Hive using Sqoop
- Used Cassandra CQL and Java API's to retrieve data from Cassandra table.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
- Worked hands on with ETL process using Informatica.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
- Extracted the data from Teradata into HDFS using Sqoop.
- Analyzed the data by performing Hive queries and running Pigscripts to know user behavior.
- Exported the patterns analyzed back into Teradata using Sqoop.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Installed Oozie workflow engine to run multiple Hive.
- Developed Hive queries to process the data and generate the data cubes for visualizing.
- Built various graphs for business decision making using Python matplotlib library.
- Implement code in Python to retrieve and manipulate data.
Environment: Hadoop, MapReduce, HDFS, UNIX, Hive, Sqoop, Cassandra, ETL, Pig Script, Cloudera, Oozie.
Data Analyst
Confidential
Responsibilities:
- Involved in gathering, understanding and documenting the data mapping requirements from the user through daily calls and continuous follow-ups.
- Involved in designing and building of High level and low-level ETL technical design documents.
- Develop complex ETL mappings and packages\procedures\functions to extract data from the flat files and load it into the Oracle database.
- Error validation of the data moving from flat file to the Oracle database.
- Involved in the Unit testing, self- review and peer review.
- Work with team to achieve timely resolution of all production issues meeting or exceeding Service Level Agreements.
- Used Bulk Collections for better performance and easy retrieval of data, by reducing context switching between SQL and PL/SQL engines.
- Extensively participated in translating business needs into Business Intelligence reporting solutions by ensuring the correct selection of toolset available across the Tableau BI suite.
- Conduct code review to ensure the work delivered by the team is of high-quality standards.
- Maintain relationships with assigned customers post integration support their needs and build the relationship to encourage future growth of business with the customer.
- Used shell scripts and PMCMD commands to conduct basic ETL functionalities.
Environment: SQL, SQL Server, MS Office, MS Visio, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology.