Sr. Data Engineer/python Developer Resume
Washington, DC
SUMMARY
- Data Engineer/Python Developer with 8 years of experience in implementing various Big Data/ Cloud Engineering, Snowflake, Data Warehouse, Data Modelling, Data Mart, Data Visualization, Reporting, Data Quality, Data virtualization and Data Science Solutions. A good experience on understanding of architecting, designing and operationalization of largescale data and analytics solutions on Snowflake Cloud Data Warehouse.
- Experience in BIGDATA using HADOOP framework and Analysis, Design, Development, Documentation, Deployment and Integration Big Data technologies as well as Java / J2EE technologies with AWS, AZURE, specialized in Data Ecosystem including Data Aggregation, Querying, Storage, Analysis, Developing and Implementation of data models
- Experience in Data Warehousing, ETL, Reporting, Visualization and Database - Driven Solutions with expertise in Development, Testing, Debugging, Implementation, Documentation, Production Support.
- Experienced in building Automation for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python
- Experience inHadoopEcosystem components like Hive, HDFS, Sqoop, Spark, Kafka, Pig.
- Good experience in developing applications and implementing Model View Control (MVC) architecture using server-side applications with Python (Django, Flask and Pyramid)
- Experience in architecting, designing, installation, configuration and management of Apache Hadoop Clusters, MapR, Hortonworks & Cloudera Hadoop Distribution.
- Good understanding of Hadoop architecture and Hands on experience with Hadoop components such as Resource Manager, Node Manager, Name Node, Data Node and Map Reduce concepts and HDFS Framework.
- Expertise in Data Migration, Data Profiling, Data Ingestion, Data Cleansing, Transformation, Data Import, and Data Export through the use of multiple ETL tools such as Informatica Power Centre.
- Working knowledge of Spark RDD, Data frame API, Data set API, Data Source API, Spark SQL and Spark Streaming.
- Experience in exporting as well as importing the data using Sqoop between HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
- Worked on HQL for required data extraction and join operations as required and having good experience in optimizing Hive Queries.
- Experience in Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Experience in working with business intelligence and data warehouse software, including SSAS/SSRS/SSIS, Business Objects, Amazon Redshift, Azure Data Warehouse and Teradata.
- Developed Spark code using Scala, Python and Spark-SQL/Streaming for faster processing ofdata.
- Implemented Spark Streaming jobs in Scala by developing RDD's (Resilient Distributed Datasets) and used Pyspark and spark-shell accordingly.
- Profound experience in creating real time data streaming solutions using Apache Spark /Spark Streaming, Kafka and Flume.
- Used Apache NiFi to automate the data movement between different Hadoop Systems.
- Good experience in handling messaging services using Apache Kafka.
- Knowledge in Data mining and Data warehousing using ETL Tools and Proficient in Building reports and dashboards in Tableau (BI Tool).
- Excellent knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
- Good understanding and knowledge of NoSQL databases like HBase and Cassandra.
- Good understanding of Amazon Web Services (AWS) like EC2 for computing and S3 as storage mechanism and EMR, Step functions, Lambda, RedShift, DynamoDB.
- Hands on working experience on Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps.
TECHNICAL SKILLS
Hadoop Components / Big Data: HDFS, Hue, MapReduce, PIG, Hive, HCatalog, Hbase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, pyspark, Airflow, Kafka Snowflake
Languages: Scala, Python, SQL, Python, Hive QL, KSQL. Boto3
IDE Tools: Eclipse, IntelliJ, pycharm.
Cloud platform: AWS, Azure
Reporting and ETL Tools: Tableau, Power BI, Talend, AWS GLUE.
Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (Hbase, Cassandra, Mongo DB)
Big Data Technologies: Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Machine Learning, Pandas, NumPy, Seaborn, Impala, Zookeeper, Flume, Airflow, Informatica, Snowflake, DataBricks, Kafka, Cloudera
Data Analysis Libraries: Pandas, NumPy, SciPy, Scikit-learn, NLTK, Plotly, Matplotlib
BI Tools: Alteryx, Tableau Power BI, Sisense
Containerization: Docker, Kubernetes
CI/CD Tools: Jenkins, Bamboo, GitLab
Operating Systems: UNIX, LINUX, Ubuntu, CentOS.
Software Methodologies: Agile, Scrum, Waterfall
Reporting Tools: PowerBI, Qlikview, Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.
PROFESSIONAL EXPERIENCE
Confidential, Washington, DC
Sr. Data Engineer/Python Developer
Responsibilities:
- Programming using Python, Scala along with Hadoop framework utilizing Cloudera Hadoop Ecosystem projects (HDFS, Spark, Sqoop, Hive, HBase, Oozie, Impala, Zookeeper etc.).
- Involved in developing spark applications using Scala, Python for Data transformations, cleansing as well as validation using Spark API.
- Worked on all the Spark APIs, like RDD, Data frame, Data source and Dataset, to transform the data.
- Worked on both batch processing and streaming data Sources. Used Spark streaming and Kafka for the streaming data processing with Python scripts
- Worked on predictive analytics use-cases using Python language.
- Developed Python batch processors to consume and produce various feeds.
- Performed testing using Django’s Test Module.
- Clean data and processed third party spending data into maneuverable deliverables within specific format with python libraries such as NumPy, SQLAlchemy and matplotlib
- Used Python and Django creating graphics, XML processing, data exchange and business logic implementation.
- Development of Python APIs to dump the array structures in the Processor at the failure point for debugging.
- Developed Spark Streaming script which consumes topics from distributed messaging source Kafka and periodically pushes batch of data to spark for real time processing.
- Builtdatapipelines for reporting, alerting, anddatamining. Experienced with table design anddata management using HDFS, Hive, Impala, Sqoop, MySQL, and Kafka.
- Worked on Apache Nifi to automate the data movement between RDBMS and HDFS.
- Created shell scripts to handle various jobs like Map Reduce, Hive, Pig, Spark etc, based on the requirement.
- Used Hive techniques like Bucketing, partitioning to create the tables.
- Experience on Spark-SQL for processing the large amount of structured data.
- Experienced working with source formats, which includes - CSV, JSON, AVRO, JSON, Parquet,
- Developing ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
- Worked on AWS to aggregate clean files in Amazon S3 and also on Amazon EC2 Clusters to deploy files into Buckets.
- Worked onPythonOpen stack API's, usedPythonscripts to update content in the database and manipulate files
- Involvedin Data Modeling usingStar Schema, Snowflake Schema.
- Used AWS EMR clusters for creating Hadoop and spark clusters. These clusters are used for submitting and executing scala and python applications in production.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
- Migrated the data from AWS S3 to HDFS using Kafka.
- Integrating Kubernetes with network, storage of security to provide a comprehensive infrastructure and orchestrating the Kubernetes containers across the multiple hosts.
- Implementing Jenkins and built pipelines to drive all microservice builds out to Docker registry and deploying to Kubernetes.
- Experienced in loading and transforming of large sets of structured, semi structured data using ingestion tool Talend.
- Worked with NoSQL databases like HBase, Cassandra to retrieve and load the data for real time processing using Rest API.
- Worked on creating data models for Cassandra from the existing Oracle data model.
- Responsible for transforming and loading the large sets of structured, semi structured and unstructured data.
- UsingPython, included Boto3 to supplement automation provided by Ansible and Terra form for tasks such as encrypting EBS volumes backing AMIs and scheduling Lambda functions for routine AWS tasks.
Environment: Hadoop 2.7.7, HDFS 2.7.7, Apache Hive 2.3, Apache Kafka 0.8.2.X, Apache Spark 2.3, Spark-SQL, Spark-Streaming, Zookeeper, Pig, Oozie, Java 8, Python3, S3, EMR, EC2, Redshift, Cassandra, Nifi, Talend, HBase, Cloudera (CHD 5.X), snowflake, Power BI, Tableau.
Confidential, Oakland, CA
Sr. Data Engineer/Python Developer
Responsibilities:
- Develop a data set process for data mining and data modeling and also recommend the ways to improve data quality, efficiency and reliability.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- Responsible for writing Hive Queries to analyze the data in Hive warehouse using Hive Query Language (HQL). Involved in developing Hive DDLs to create, drop and alter tables.
- Extracted the data and updated it into HDFS using Sqoop Import from various sources like Oracle, Teradata, SQL server etc.
- Created Hive staging tables and external tables and also joined the tables as required.
- Implemented Dynamic Partitioning, Static Partitioning and also Bucketing.
- Installed and configured Hadoop Map Reduce, Hive, HDFS, Pig, Sqoop, Flume and Oozie on Hadoop
- Worked on Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps and also done POC on Azure Data Bricks.
- Implemented Sqoop jobs for data ingestion from the Oracle to Hive.
- Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats.
- Extensively implemented the python libraries like Pandas, Matplotlib, Numpy to manipulate and visualize the data using interactive charts.
- Extensively used Python modules such as requests, urllib, beautiful soup for web crawling.
- Developed custom the Unix/BASH SHELL scripts for the purpose of pre and post validations of the master and slave nodes, before and after the configuration of the name node and data nodes
- Developed job workflows in Oozie for automating the tasks of loading the data into HDFS.
- Implemented compact and efficient file storage of big data by using various file formats like Avro, Parquet, JSON and using compression methods like GZip, Snappy on top of the files.
- Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame and Pair RDD's.
- Worked on Spark using Python as well as Scala and Spark SQL for faster testing and processing of data.
- Wrote and executed various MYSQL database queries fromPythonusingPython-MySQL connector and MySQL db package
- Worked on various data modelling concepts like star schema, snowflake schema in the project.
Environment: Hadoop 2.7, HDFS, Microsoft Azure services like HDinsight, BLOB, ADLS, Logic Apps etc, Hive 2.2, Sqoop 1.4.6, snowflake, Apache Spark 2.3, Spark-SQL, ETL, Maven, Oozie, Java 8, Python3, Unix.
Confidential, Wilmington, DE
Data Engineer
Responsibilities:
- Handled importing of data from various data sources, performed data control checks using PySpark and loaded data into HDFS.
- Involved in converting Hive/SQL queries into PySpark transformations using Spark RDD, python.
- Used PySpark SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL.
- Developed PySpark Programs using python and performed transformations and actions on RDD's.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
- Used PySpark and Spark SQL to read the parquet data and create the tables in hive using the python API.
- Implemented PySpark using python and utilizing Data frames and PySpark SQL API for faster processing of data.
- Developed python scripts, UDFs using both Data frames/SQL/Data sets and RDD/Map Reduce in Spark 1.6 for data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Experienced in handling large datasets using Partitions, PySpark in Memory capabilities, Broadcasts in PySpark, effective & efficient Joins, Transformations and other during ingestion process itself.
- Processing the schema oriented and non-schema-oriented data using python and Spark.
- Involved in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in HDFS.
- Worked on streaming pipeline dat uses PySpark to read data from Kafka, transform it and write it to HDFS.
- Designed Data Marts by following Star Schema and Snowflake Schema Methodology, using industry leading Data Modeling tools.
- Worked on Snowflake database on queries and writing Stored Procedures for normalization.
- Worked with Snowflake’s stored procedures, used procedures with corresponding DDL statements, used JavaScript API to easily wrap and execute numerous SQL queries.
Environment: Cloudera (CDH3), AWS, Snowflake, HDFS, Pig 0.15.0, Hive 2.2.0, Kafka, Sqoop, Shell Scripting, Spark 1.8, Linux- Cent OS, Map Reduce, python 2, Eclipse 4.6.
Confidential
Data Modeler/Engineer
Responsibilities:
- Working with open-source Apache Distribution tan Hadoop admins has to manually setup all the configurations- Core-Site, HDFS-Site, YARN-Site and Map Red-Site. However, when working with popular Hadoop distribution like Hortonworks, Cloudera or MapR the configuration files are setup on startup and the Hadoop admin need not configure them manually.
- Used Sqoop to import data from Relational Databases like MySQL, Oracle.
- Involved in importing structured and unstructured data into HDFS, AWS
- Responsible for fetching real-time data using Kafka and processing using Spark and Scala.
- Worked on Kafka to import real-time weblogs and ingested the data to Spark Streaming.
- Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
- Worked on Building and implementing real-time streaming ETL pipeline using Kafka Streams API.
- Worked on Hive to implement Web Interfacing and stored the data inHive tables.
- Migrated Map Reduce programs into Spark transformations using Spark and Scala.
- Experienced with Spark Context, Spark-SQL, Spark YARN.
- Implemented Spark, Python Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
- Implemented data quality checks using Spark Streaming arranged passable and bad flags on the data.
- Implemented Hive Partitioning and Bucketing on the collected data in HDFS.
- Implemented Sqoop jobs for large data exchanges between RDBMS and Hive clusters.
- Extensively used Zookeeper as a backup server and job scheduled for Spark Jobs.
- Developed Spark scripts using Scala shell commands as per the business requirement.
- Worked on Cloudera distribution and deployed on AWS EC2 Instances.
- Experienced in loading the real-time data to a NoSQL database like Cassandra.
- Experience in retrieving the data present in Cassandra cluster by running queries in CQL (Cassandra Query Language).
- Worked on connecting Cassandra database to Amazon EMR File System for storing the database in S3.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
- Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage using python
Environment: Hadoop, Map Reduce, Hive, Spark, Oracle, GitHub, Tableau, UNIX, Cloudera, Kafka, Sqoop, Scala, NIFI, HBase, Amazon EC2, S3.
Confidential
Data Engineer
Responsibilities:
- Responsible for building scalable distributed data solution using Hadoop Cluster environment with Hortonworks distribution.
- Convert raw data with sequence data format, such as Avro and Parquet to reduce data processing time and increase data transferring efficiency through the network.
- Worked on building end to end data pipelines on Hadoop Data Platforms.
- Worked on Normalization and De-normalization techniques for optimum performance in relational and dimensional databases environments.
- Designed developed and tested Extract Transform Load (ETL) applications with different types of sources.
- Creating files and tuned the SQL queries in Hive Utilizing HUE. Implemented MapReduce jobs in Hive by querying the available data.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD’s.
- Experience with PySpark for using Spark libraries by using Python scripting for data analysis.
- Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.
- Created User Defined Functions (UDF), User Defined Aggregated (UDA) Functions in Pig and Hive.
- Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping.
- Implemented Kafka Custom encoders for custom input format to load data into Kafka portions.
- Support for the cluster, topics on the Kafka manager. Cloud formation scripting, security and resource automation.
Environment: Python, HDFS, MapReduce, Flume, Kafka, Zookeeper, Pig, Hive, HQL, HBase, Spark, Kafka, ETL, Web Services, Linux RedHat, Unix.