We provide IT Staff Augmentation Services!

Data Engineer - Python Resume

4.00/5 (Submit Your Rating)

TX

SUMMARY

  • 7+ years of IT development experience, including experience in Big Data ecosystem, and related technologies.
  • Expertise in Hadoop ecosystem components such as Spark, HDFS, Map Reduce, Yarn, HBase, Pig, Sqoop, Flume, Oozie, Impala, Zookeeper, Hive, NiFi and Kafka for scalability, distributed computing, and high - performance computing.
  • Excellent understanding of Hadoop architecture, Hadoop daemons and various components such as HDFS, YARN, ResourceManager, NodeManager, NameNode, DataNode and MapReduce programming paradigm.
  • Good understanding of Apache Spark, Kafka, Storm, Nifi, Talend, RabbitMQ, Elastic Search, Apache Solr, Splunk and BI tools such as Tableau.
  • Knowledge of Hadoop administration activities using Cloudera Manager and Apache Ambari.
  • Experience working with Cloudera, Amazon Web Services (AWS), Microsoft Azure and Hortonworks
  • Worked on Import and Export of data using Sqoop from RDBMS to HDFS.
  • Have good knowledge in Containers, Docker and Kubernetes for the runtime environment for the CI/CD system to build, test, and deploy.
  • Created machine learning models with help of python and scikit-learn.
  • Hands on experience in loading data (Log files, Xml data, JSON) into HDFS using Flume/Kafka.
  • Experience in pyspark programming language with Spark Core and Spark modules extensively.
  • Experience in dealing with data formats ORC, Parquet, JSON and CSV.
  • Built ETL data pipelines using Python/MySQL/Spark/Hadoop/Hive/UDFs
  • Experience in analyzing data using Hive QL, Pig Latin, HBase, Spark, R Studio and custom Map Reduce programs in python. Extending Hive and Pig core functionality by writing custom UDFs.
  • Used packages like Numpy, Pandas, Matplotlib, Plotly in python for exploratory data analysis.
  • Hands on experience with cloud technologies such as Azure HDInsight, Azure Data Lake, AWS EMR, Athena, Glue and S3.
  • Good knowledge in using Apache NiFi to automate the data movement between different Hadoop systems.
  • Experience in performance tuning by using Partitioning, Bucketing and Indexing in Hive.
  • Experienced in job workflow scheduling and monitoring tools like Airflow, Oozie, TWS, Control-M and Zookeeper.
  • Experience with Software development tools such as JIRA, GIT, SVN.
  • Flexible working Operating Systems like Unix/Linux(Centos, Redhat, Ubuntu) and Windows Environments.
  • Hands on development experience with RDBMS, including writing complex SQL scripting, Stored procedure, and triggers.
  • Experience in writing Complex SQL Queries involving multiple tables inner and outer joins.
  • Strong in databases like DB2, Oracle, MS SQL.

TECHNICAL SKILLS

Hadoop and Big Data Technologies: HDFS, Mapreduce, Flume, Sqoop, Pig, Hive, Morphline, Kafka, Oozie, Spark, Nifi, Zookeeper, Elastic Search, Apache Solr, Talend, Cloudera Manager, R Studio, Confluent, Grafana

NoSQL: HBase, Couchbase, Mongo, Cassandra

Programming and Scripting Languages: C, SQL, Python, C++, Shell scripting, R

Web Services: XML, SOAP, Rest APIs

Databases: Oracle, DB2, MS-SQL Server, MySQL, MS-Access, Teradata

Web Development Technologies: JavaScript, CSS, CSS3, HTML, HTML5, Bootstrap, XHTML, JQUERY, PHP

Operating Systems: Windows, Unix (Red Hat Linux, Cent OS, Ubuntu), MAC-OSIDE

Development Tools: Eclipse, Net Beans, IntelliJ, R Studio

Build Tools: Maven, Scala Build Tool (SBT), Ant

PROFESSIONAL EXPERIENCE

Confidential, TX

Data Engineer - Python

Responsibilities:

  • Involved in analyzing business requirements and prepared detailed specifications that follow project guidelines required for project development.
  • Communicate regularly with business and I.T leadership.
  • Built and Deployed jobs using Airflow.
  • Responsible for data extraction and data ingestion from different data sources into S3 by creating ETL pipelines using Spark and Hive.
  • Used Pyspark for dataframes, ETL, Data Mapping, Transformation and Loading in complex and high-volume environment
  • Extensively worked with pyspark / Spark SQL for data cleansing and generating dataframes and RDDs.
  • Co-ordinated with the other team members to write and generate test scripts, test cases for numerous user stories.
  • Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
  • Worked on EMR clusters of AWS for processing Big Data across a Hadoop Cluster of virtual servers.
  • Developed Spark Programs for Batch Processing.
  • Developed Spark code using python for pyspark/Spark-SQL for faster testing and processing of data.
  • Involved in design and analysis of the issues and providing solutions and workarounds to the users and end-clients.
  • Designed and built data processing applications using Spark on AWS EMR cluster which consumes data from AWS S3 buckets, apply necessary transformations and store the curated business ready datasets onto Snowflake analytical environment.
  • Developed functionality to perform auditing and threshold checks for error handling for smooth and easier debugging and data profiling.
  • Build data quality framework to run data rules that can generate reports and send emails of business critical successful and failed job notifications to business users daily.
  • Used spark to build tables that require multiple computations and non equi-joins.
  • Scheduled various spark jobs for daily and weekly.
  • Modelled Hive partitions extensively for faster data processing.
  • Implemented various udfs in python as per the requirement.
  • Used Bit Bucket to collaboratively interact with the other team members.
  • Involved in Agile methodologies, daily scrum meetings and sprint planning.

Confidential, CA

Data Engineer

Responsibilities:

  • Executed all phases of Big Data project lifecycle starting from Scoping Study, Requirements gathering, Estimation, Design, Development, Implementation, Quality Assurance and Application Support.
  • Working on building frameworks for data curation pipelines using Spark and Hive, and migrating Hive based applications to Spark.
  • Designed and built data processing applications using Spark on AWS EMR cluster which consumes data from AWS S3 buckets, apply necessary transformations and store the curated business ready datasets onto Snowflake tables.
  • Involved in design and analysis of the issues and providing solutions and workarounds to the users and end-clients
  • Extensively worked on developing Spark jobs in Python(Spark SQL) using Spark APIs
  • Involved in performing Data Screening and Profiling by Accuracy Checks, fixing Missing Data and Outliers removal, examining historical data, detecting patterns/correlations or relationships in the data, and then extrapolating these relationships forward in time
  • Involved in performing Exploratory Data Analysis (EDA), Hypothesis Testing and Predictive Analysis using R/R Studio to analyse the customer behavior.
  • Experience in writing PySpark scripts and a wrapper shell scripts to automate data validations
  • Experience in orchestrating and building schedules/workflows on Tivoli Workload Scheduler (TWS) and Oozie in the environment.
  • Developed functionality to perform auditing and threshold checks for error handling for smooth and easier debugging and data profiling
  • Built visualizations using the tool, Looker on top of the business ready datasets loaded in Snowflake.
  • Worked on preparing test cases for unit testing for development
  • Involved in creating Hive tables, loading data in ORC, JSON, CSV format and writing hive queries to analyse data using Spark-SQL
  • Build data quality framework to run data rules that can generate reports and send emails of business critical successful and failed job notifications to business users daily.
  • Built solution design and implemented Data Quality monitoring and reporting framework in PySpark
  • Built pipelines to send data extracts and reports over Data Router, SFTP and to AWS S3 buckets

Confidential | Jacksonville, FL

Data Engineer

Responsibilities:

  • Good experience using Kafka to collect various log data from various web sources such as web servers, Network devices and mobile handsets.
  • Used Python along with Big Data Technologies and developed the Ingestion framework with DynamoDB, Cassandra data stores.
  • Good Knowledge on using SPARK to create the RDD’s, Data frames for faster execution and performing data transformations and actions.
  • Experience in using Talend tools for the creation of workflows for processing data from multiple source systems.
  • Experienced in Designing and developing various strategies, optimal for the web log data distribution over the cluster.
  • Implemented incorporated the business logic into Hive Queries using Hive Generic UDF's
  • Experienced in Configuring Spark Streaming to receive real time data from the Kafka for high-speed data processing and store the stream data to HDFS.
  • Used Scala to retrieve data(CSV, text, and image data) from HDFS, S3 and Hive and read the data
  • Good Knowledge on using Spark SQL Context to Work for the faster execution of Hive queries.
  • Expertise in using the Hadoop ecosystem.
  • Experienced in Setting up Spark EMR to process the data stored in Amazon S3.
  • Designed and implemented complex big data with a focus on collecting, parsing, managing, analyzing, and visualizing large sets of data.
  • Participated in Automation of the tasks such as preprocessing the data using PIG and loading data into HDFS by developing the workflow in Oozie.
  • Involved in the data analysis, source system analysis, and data modeling to ETL (Extract, Transform and Load).
  • Wrote various Spark programs for the transformation, aggregation, and extraction from multiple file-formats such as XML, CSV, JSON & various other compressed file formats.
  • Experienced in writing Sqoop scripts to import data into Hive/HDFS from RDBMS and load the processed data to HBase tables.
  • Developed SQL scripts using Spark for handling different data sets and verifying the performance over MapReduce jobs.
  • Good knowledge on writing MapReduce jobs using Java API to support the MapReduce Programs that are running on the cluster.
  • Participated in designing and testing the models of test data and applications for the data analytics solution on streaming data.

Confidential

Data Engineer

Responsibilities:

  • Involved in analyzing business requirements and prepared detailed specifications that follow project guidelines required for project development.
  • Communicate regularly with business and I.T leardership.
  • Developed pyspark/Spark SQL scripts to analyze various customer behaviors.
  • Responsible for data extraction and data ingestion from different data sources into Azure Data Lake Store by creating ETL pipelines using Spark and Hive.
  • Extensively worked with pyspark / Spark SQL for data cleansing and generating Data Frames and RDDs.
  • Co-ordinated with the other team members to write and generate test scripts, test cases for numerous user stories.
  • Worked on HDInsight clusters of Microsoft Azure for processing Big Data across a Hadoop Cluster of virtual servers on Microsoft Azure cloud.
  • Used sqoop to export data to relational database.
  • Used Bit Bucket to collaboratively interact with the other team members.
  • Involved in creating Hive tables, loading data of formats like avro, json, csv, txt, parquet and writing hive queries to analyze data using HQL.
  • Scheduled various spark jobs for daily and weekly.
  • Monitored various cluster activities using Apache Ambari.
  • Created data visualizations using Microsoft Power BI and Tableau.
  • Modelled Hive partitions extensively for faster data processing.
  • Wrote udfs in python depending on scenario.
  • Involved in data movement between two clouds.

Confidential

Python Engineer

Responsibilities:

  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Used Sqoop to pull data from database and different sources.
  • Optimized Map Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Spark and Hive.
  • Involved in creating Hive tables, loading with data and writing hive queries which run internally in MapReduce. Loaded various formats of data like Avro, parquet into these tables and analyzed data using HQL.
  • Used Oozie workflow engine to run multiple jobs which run independently.
  • Worked on Kafka while dealing with raw data, by transforming into new Kafka topics for further consumption.
  • Developed Kafka producer and consumers, HBase clients, Spark, and Hadoop MapReduce jobs along with components on HDFS, Hive.
  • Developed Spark Programs for Batch Processing.
  • Developed Spark code using pyspark and Spark-SQL/Streaming for faster testing and processing of data.
  • Imported the data from different sources like HDFS/HBase into Spark RDD and implemented Spark RDD transformations, actions to implement business analysis.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Used Spark SQL to create structured data by using data frames and querying from other data sources using jdbc and hive.
  • In data exploration stage used hive get some insights about the data.
  • Effectively used GIT(version control) to collaborate with the other team members.
  • Worked on large datasets to generate insights by using Tableau.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Involved in developing components on Reporting Dashboard with Spring MVC, Spring Framework and
  • Hibernate and implemented Restful Webservices.
  • Worked hands on with ETL process. Handled importing data from various data sources, performed transformations.
  • Worked on batch processing of data sources using Apache Spark, Elastic Search.
  • Implemented indexing for logs to Elastic Search and analysis on integrating Kibana with Elastic Search.
  • Written data pipelines in python to extract data from hive, MySql, Presto.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Used Apache NiFi to copy the data from local file system to HDFS.
  • Used Apache NiFi to automate the data movement between different Hadoop systems.
  • Developed and maintained the continuous integration and deployment systems using Jenkins and MAVEN.
  • Accessed Hive tables to perform analytics from java applications using JDBC.
  • Used Hive to analyze the dynamic partitioned and bucketed data and compute various metrics for reporting.
  • Used Git repository to checkout code.
  • Agile methodology was used for development (TDD, Continuous Integration).

We'd love your feedback!