Data Engineer Resume
Jacksonville, FL
SUMMARY
- 8+ years of professional Software Development work Experience with efficient application development and designing knowledge of Python, Java and Big Data ability in development of web - based applications.
- Hands-on experience with major components in Hadoop Ecosystem like MapReduce, HDFS, YARN, Hive, Pig, HBase, Sqoop, Oozie, Cassandra, Impala and Flume.
- Excellent Experience in real-time data processing using Spark Streaming and Kafka.
- Experience in performing ETL using Spark, Spark SQL.
- Worked with Big Data distributions like Cloudera (CDH 3 and 4) with Cloudera Manager.
- Developed and Configured Kafka brokers to pipeline server logs data into Spark streaming.
- Experience with Design, code, debug operations, reporting, data analysis and Web Applications using Python.
- Proficient in Object Oriented Programming concepts like Multi-Threading, Exception Handling and Collections using Python.
- Worked with MVW frameworks like Django, Angular JS, HTML, CSS, XML, Java Script, jQuery and Bootstrap.
- Experience in writing JSON Rest API’s using Golang.
- Strong experience of software development in Python (libraries used: Beautiful Soup, NumPy, SciPy, matplotlib, python-twitter, Pandas data frame, networkx, urllib2, MySQL dB for database connectivity) and IDEs - sublime text, Spyder, PyCharm.
- Hands-on experience working with various Relational Database Management Systems (RDBMS) like MySQL, Microsoft SQL Server, Oracle & non-relational databases (NoSQL) like MongoDB and Apache Cassandra.
- Having good knowledge in writing SQL Queries, Stored procedures, functions, tables, views, triggers on various databases like Oracle, MySQL.
- Experienced in developing Web Services with Python programming language - implementing JSON based RESTful and XML based SOAP web services.
- Proficient in Python Open Stack API'S and GUI frameworks like PyJamas, Jython (for web).
- PyCharm Proficient in performing Data analysis and Data Visualization using Python libraries.
- Experience in using Version Control Systems like GIT, SVN and CVS to keep the versions and configurations of the code organized.
- Experience in Unix/Linux shell scripting for job scheduling, batch-job scheduling, automating batch programs, forking and cloning jobs.
- Exposure to CI/CD tools - Jenkins for Continuous Integration, Ansible for continuous deployment.
- Experienced with containerizing applications using Docker.
- Experience with Apache Spark’s Core, Spark SQL, Streaming and MlLib components.
- Experience in Amazon Web Services (AWS) cloud platform like EC2, Virtual Private Clouds (VPCs), Storage models (EBS, S3, and instance storage), Elastic Load Balancers (ELBs).
- Strong experience in analyzing large amounts of data sets writing PySpark scripts and Hive queries.
- Experience in deploying applications in heterogeneous Application Servers TOMCAT, Web Logic and Oracle Application Server.
- Experienced in working with various stages of Software Development Life Cycle (SDLC), Software Testing Life Cycle (STLC) and QA methodologies from project definition to post-deployment documentation.
- Good Experience on testing tools like JIRA and Bugzilla for bug tracking.
- Experience in working with Agile Methodologies (Scrum).
- Excellent interpersonal and communication skills, efficient time management and organization skills, ability to handle multiple tasks and work well in team environment.
TECHNICAL SKILLS
Hadoop Technologies: HDFS, YARN, Spark, MapReduce, Hbase, Phoenix, SOLR, Hive, Impala, Pig, Nifi, Sqoop, HUE UI, Cloudera, Kerberos
Programming Languages: Java, Python, C, PL/SQL, XML
Web Technologies: JSP, Servlets, Struts, Spring, JavaScript, HTML/HTML5, CSS/CSS3, jQuery, Bootstrap, Ajax
Analysis/Design Tools: Ab Initio ETL, Data Modelling, Design Patterns, UML, Axure, Photoshop
Cloud Tools: Azure Blob Storage, Azure Databricks, Azure Functions, Azure Key Vault, Azure SQL database, Google Cloud Storage, Big Query, Data Proc, Spanner
Testing/Logging Tools: JUnit, Mockito, Log4J
Build/Deploy Tools: ANT, Maven, Gradle, TeamCity, Jenkins, uDeploy, Docker.
Database Technologies: Oracle, DB2, MySQL, MongoDB, Informix, MS SQL, Cassandra
Web Services: REST, SOAPUI
Version Control: Git, SVN, CVS
Platforms: Windows, Mac OS X, Linux
Scheduler Tools: Oozie, Autosys, Apache Airflow
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential, Jacksonville, FLResponsibilities:
- Created Sqoop job to bring the data from Oracle to HDFS and created external Hive tables in Hive.
- Developed Spark Applications by using Java and implemented Apache Spark data Processing Project to handle data from various RDBMS and streaming sources.
- Knowledge on PySpark and used Hive to analyze sensor data and cluster users based on their behavior in the events.
- Created External Tables in Hive and saved in ORC file format.
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Built data pipeline using Pig to store onto HDFS.
- Worked on HiveQL for data analysis for importing the structured data to specific tables for reporting.
- Wrote Python scripts to parse XML documents and load the data in database.
- Experience in working with Hive to create Value Added Procedures. Also wrote Hive UDF to make the function reusable for different models.
- Loaded the dataset into Hive for ETL (Extract, Transfer and Load) operation.
- Designed and implemented MapReduce based large-scale parallel relation-learning system.
- Implemented best offer logic using Pig scripts and Pig UDFs.
- Developed a Spark Script Apache Nifi, to do the source to target mapping according to the design document developed by designers.
- Worked extensively on AWS components like Elastic MapReduce (EMR), Elastic Compute Cloud (EC2), and Simple Storage Service (S3).
- Involved in creating Hive tables, loading with data and writing Hive queries which will run internally in MapReduce way.
- Experienced in extending Hive and Pig core functionality by writing custom UDFs using Java.
- Developing parser and loader MapReduce application to retrieve data from HDFS and store to HBase and Hive.
- Used Oozie to orchestrate the MapReduce jobs that extract the data on a timely manner.
- Used Amazon Cloud Watch to monitor and track resources on AWS.
- Developed Data frames for data transformation rules.
- Developed Spark SQL queries to join source tables with multiple driving tables and created a targeted table in Hive.
- Optimized the code using PySpark for better performance.
- Developed a Spark application to do the source to target mapping.
- Involved in running Hadoop streaming jobs to process terabytes of text data. Worked with different file formats such as Text, Sequence files, Avro, ORC and Parquette.
- Involved in using HBase Java API on Java application.
- Collected the data using Spark streaming and dump into Hbase.
- Developed python script for start a job and end a job smoothly for a UC4 workflow.
- Worked on NiFi data Pipeline to process large set of data and configured Lookup’s for Data Validation and Integrity.
- Responsible for setting up QA environment and updating configurations for implementing scripts with Pig and Sqoop.
- Developed Python scripts to clean the raw data.
- Experienced in writing Spark Applications in Python.
- Developed and analyzed the SQL scripts and designed the solution to implement using PySpark.
- Worked as production support to monitor and debug the issues that causing the problems while jobs are running which were scheduled.
- Created data pipelines for different events to load the data from DynamoDB to AWS S3 bucket and then into HDFS location.
- Worked extensively with importing metadata into Hive using Python and migrated existing tables and applications to work on AWS cloud (S3)
- Proficient in AWS services like VPC, EC2, S3, ELB, Red Shift, Autoscaling Groups (ASG), EMR, RDS, IAM, CloudWatch, CloudFront, CloudTrail.
- Experience in building and configuring a virtual data center in the AWS cloud to support Enterprise Data Warehousing including VPC, Route tables and Elastic load balancing.
- Create, modify and execute DDL and ETL scripts for de-normalized tables to load data into Hive and AWS Redshift tables.
- Fetch and generate monthly reports. Visualization of those reports using Tableau
- Experienced in Cauterize NiFi Pipeline on EC2 nodes integrated with Spark, Postgres.
Environment: Hadoop, Hive, Linux, Sqoop, Oracle, Spark, PySpark, Pig, MapReduce, Python, shell Scripting, agile methodology, Azure, Hortonworks, Hbase, JIRA, Nifi, Tableau, Java, Jupyter Notebook, AWS.
Big Data Developer
Confidential - Phoenix, AZ
Responsibilities:
- Data Ingestion implemented using SQOOP, SPARK, and loading data from various RDBMS, CSV, and XML files.
- Data cleansing, transformations tasks are handled using SPARK using HIVE.
- Data Consolidation was implemented using SPARK, HIVE to generate data in the required formats by applying various ETL tasks for data repair, massaging data to identify source for audit purpose, data filtering and store back to HDFS.
- Loaded all data-sets into Hive from Source CSV files using Spark and Cassandra from Source CSV files using Spark/PySpark.
- Migrated the computational code in hql toPySpark.
- Completed data extraction, aggregation and analysis in HDFS by usingPySparkand store the data needed to Hive.
- Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement usingPySpark.
- Developed Micro Services to retrieve data from the frontend system for various data retrieval patterns.
- Created HIVE tables on top of the clean data, created tableau reports on the final tables.
- For incremental loads handled the duplicate / source updated records issues with SQOOP merge and HIVE 3 step process to eliminate duplicate records to keep most up to date data for reporting
- Built an Ingestion Framework that would ingest the files from SFTP to HDFS using Apache NIFI and ingest financial data into HDFS.
- Scripts developed to load Log data using FLUME and stores data in HDFS on daily basis.
- HIVE tables created for the data loaded into HDFS, applied Context-N-Gram functionality and generated Trigram's frequency for the given data sets.
- Trigram Frequency data is submitted to Data Science teams to apply various NLP algorithms to refine accurate models to understand the near failure nature for hardware.
- Once the data is consumed by Data Science teams the Hive tables are dropped and process is repeated daily basis.
- Converted INFORMATICA ETL logic, which is re written in SPARK using SPARK Data Frames API for data transformations, ETL jobs, SPARK SQL for processing data as per BI aggregations, reporting needs.
- Reporting ready data (Dimensions, Facts) are stored to HDFS and then to BI reporting Database with SQOOP.
- Uploaded data to Hadoop Hive and combined new tables with existing databases.
- Data comparison tests conducted between Informatica and the new solution designed.
Environment: SPARK, Python/Py Spark, Spark-Streaming, Spark-SQL, JAVA, HIVE, Hortonworks (HDP 2.0), Micro Services, OOZIE, CDH 5, HDFS, FLUME, PIG, HBase, Nifi, SQOOP, AWS (EC2, EMR and S3), Informatica, Linux.
Big Data Developer
Confidential - VA
Responsibilities:
- Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
- Built pipelines to move hashed and un-hashed data from Azure Blob to Data Lake.
- Utilized Azure HDInsight to monitor and manage the Hadoop Cluster.
- Collaborated on insights with Data Scientists, Business Analysts and Partners.
- Performed advanced procedure like text analytics and processing, using the in-memory computing capabilities of Spark using Python.
- Created pipelines to move data from on-premise servers to Azure Data Lake.
- Utilized Python Panda Frame to provide data analysis.
- Enhanced and optimized Spark scripts to aggregate, group and run data mining tasks.
- Loaded the data into Spark RDD and do in memory data Computation to generate the output response.
- Involved in converting Hive/SQL queries into Spark Transformations using Spark RDD’s and PySpark.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frames and Pair RDD’s.
- Used Spark API over Hadoop Yarn to perform analytics on data and monitor scheduling.
- Implemented schema extraction for Parquet and Avro file formats.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval Time, correct level of Parallelism and memory tuning.
- Developed Hive queries to process the data and generate the data cubes for visualization.
- Built specific functions to ingest columns into Schemas for Spark Applications.
- Experienced in handling large data sets using Partitions, Spark in memory capabilities, effective and efficient Joins, Transformations and other during ingestion process itself.
- Developed data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional data sources for data access and analysis.
- Analyzed SQL scripts and designed the solution to implement using PySpark.
- Used reporting tools like Power BI for generating data reports daily.
- Handled several techno-functional responsibilities including estimates, identifying functional and technical gaps, requirements gathering, designing solutions, development, developing documentation and production support.
Environment: s: Hadoop (HDFS/Azure HDInsight), HIVE, YARN, Python/Spark, MapReduce, Pig, Sqoop, Linux, MS SQL Server.
Bigdata Developer
Confidential
Responsibilities:
- Performed various POC’s in data ingestion, data analysis and reporting using Hadoop, MapReduce, Hive, Sqoop, Flume, Elastic Search.
- Installed and configured Hadoop.
- Implemented Spark using python and Spark SQL for faster testing processing the data.
- Developed Spark scripts by using python shell commands as per the requirement.
- Developed multiple MapReduce jobs using java for data cleaning and preprocessing.
- Imported/Exported data using Sqoop to load data from Teradata to HDFS/Hive on regular basis.
- Written Hive queries for ad-hoc reporting to the business.
- Experienced indefining jobflows using Oozie.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's
- Maintained Hadoop Cluster on AWS EMR. Used AWS services like EC2 and S3 for small data sets processing and storage
- Hands on experience in setting up HBase Column based storage repository for archiving and retro data.
- Setup and benchmarked Hadoop clusters for internal use.
- Involved in managing and reviewing Hadoop log files.
Environment: Hadoop, MapReduce, Spark, Spark SQL, Yarn, Hive, Sqoop, Flume, HBase, Elastic Search, Cloudera Manager, AWS, Jenkins, Maven.
Hadoop/Spark Developer
Confidential
Responsibilities:
- Built a real-time data pipeline to store data for real-time analysis and Batch Processing
- Developed Sparkjobs to summarize and transform data in Hive.
- Wrote Spark-Streaming applications to consuming the data from Kafka topics and write the processed streams to Hive.
- Run trials connecting the Kafka to the storage layers such as HBase, MongoDB, HDFS/Hive and other analytics.
- Performing data migration from Legacy Databases RDBMS to HDFS using Sqoop.
- Developed Hive scripts in HiveQL to de-normalize and aggregate the data.
- Developed Shell Scripts for writing data to HDFS.
- Developed customized UDF's in java to extend Hive and Pig functionality.
- Expertise in creating HiveTables, loading and analyzing data using Hive queries.
- Performed transformations, cleaning and filtering on imported data using Hive and loaded final data into HDFS.
- Involved in developing a MapReduce framework that filters bad and unnecessary records.
- Developed Hivequeries on different tables for finding insights. Automated the process of building data pipelines for data scientists to predict, classify, descriptive and prescriptive analytics.
- Built a Data warehouse on Hadoop/Hive from different RDBM systems using Apache NiFi data flow engine for replicating the whole database.
- Developed NiFi Workflows to pick up the data from Data Lake as well as from server logs and send that to Kafka broker.
- Created NiFi flows to trigger Spark jobs and used put email processors to get notifications if there are any failures.
- Lead the Offshore team for automating the NiFi workflows using NiFi REST API.
- Design and implement the complex workflow in Oozie Scheduler.
- Developed Python scripts for Data validation.
Environment: Spark, Spark SQL, Spark Streaming, Kafka, Hadoop, HDFS, Hive, Oozie, MapReduce, Pig, Sqoop, Shell Scripting, HBase, Apache NiFi, Tableau, Oracle, MySQL, Teradata and DB.
Software Engineer
Confidential
Responsibilities:
- Design, develop, test, deploy and maintain the website.
- Designed and developed the UI of the website using HTML, AJAX, CSS and JavaScript.
- Developed entire frontend and backend modules using Python on Django Web Framework.
- Designed and developed data management system using MySQL.
- Rewrite existing Python/Django/Java module to deliver certain format of data.
- Used Django Database API’s to access database objects.
- Wrote python scripts to parse XML documents and load the data in database.
- Generated property list for every application dynamically using python.
- Responsible for search engine optimization to improve the visibility of the website.
- Handled all the client-side validation using JavaScript.
- Creating unit test/regression test framework for working/new code.
- Responsible for debugging and troubleshooting the web application.
Environment: Python, Django 1.3, MySQL, Linux, HTML, XHTML, CSS, AJAX, JavaScript, Apache