Data Engineer Resume
New York, NY
SUMMARY
- Over 8 years of experience as a Data Engineer, Data Analyst, Data Integrating, Big Data, Data Modeling Logical and Physical, and Implementation of Business Applications using teh Oracle Relational Database Management System RDBMS.
- Strong experience in analysis, design, development, testing, implementation of database application in Client/ Server application using Oracle 12c/11g/10g/9i/8i, SQL, SQL Loader and open Interface.
- Experienced in database conversion from Oracle and SQL Server to PostgreSQL and MySQL.
- Extensive noledge in Client/Server Technology, GUI Design, Relational Database Management Systems RDBMS, and Rapid Application Development Methodology.
- Extensively worked in PL/SQL for creating stored procedures, clusters, packages, database triggers, exception handlers, cursors, cursor variables.
- In dept understanding of Monitoring/Auditing tools in AWS such as CloudWatch and Cloud Trail.
- Expertise understanding of AWS DNS Services through Route53. Understanding of Simple, Weighted, Latency, Failover & Geolocational Route types.
- Hands on experience in installing, configuring, monitoring, and using Hadoop ecosystem components like Hadoop Map - Reduce, HDFS, HBase, Hive, Sqoop, Pig, Zookeeper, Horton works, Flume
- Expert in Amazon EMR, Spark, Kinesis, S3, Boto3, Bean Stalk, ECS, Cloudwatch, Lambda, ELB, VPC, Elastic Cache, Dynamo DB, Redshift, RDS, Aethna, Zeppelin & Airflow.
- Experience in handling, configuration and administration of databases like MySQL and NoSQL databases like MongoDB and Cassandra.
- Good noledge on AWS cloud formation templates and configured SQS service through java API to send and receive teh information.
- Experience in creating separate virtual data warehouses with difference size classes in AWS Snowflake
- Worked on Data Virtualization using Teiid and Spark, RDF graph Data, Solr Search and Fuzzy Algorithm.
- Strong noledge of Massively Parallel Processing (MPP) databases data is partitioned across multiple servers or nodes with each server/node having memory/processors to process data locally.
- Data modeling and database and development for OLTP, OLAP (Star Schema, Snowflake Schema, Data Warehouse, Data Marts, Multi-Dimensional Modeling and Cube design), Business Intelligence and data mining.
- Extensively used SQL, Numpy, Pandas, Scikit-learn, Spark, Hive for Data Analysis and Model building.
- Developed and maintained multiple Power BI dashboards/reports and content packs
- Created POWER BI Visualizations and Dashboards as per teh requirements
- Hands on expertise with AWS Databases such as RDS(Aurora), Redshift, DynamoDB and Elastic Cache (Memcached & Redis)
- Responsible for designing and building a DataLake using Hadoop and its ecosystem components.
- Working experience in creating real time data streaming solutions using Apache Spark/Spark Streaming & Kafka and built Spark Data Frames using Python.
- Used Amazon Lambda for developing API to manage servers and run teh code in AWS.
- Experience with ETL workflow Management tools like Apache Airflow and have significant experience in writing teh python scripts to implement teh workflow.
- Experience in working with databases like MongoDB, MySQL and Cassandra.
- Working noledge of SQL Trace, TK-Prof, Explain Plan, and SQL Loader for performance tuning and database optimization.
- Provide regional MySQL database migrations and hot standby servers via asynchronous replication including Amazon EC2 and RDS (with solutions tailored for managing RDS).
- Extensive experience in Dynamic SQL, Records, Arrays and Exception handling, data sharing, Data Caching, Data Pipelining. Complex processing using nested Arrays and Collections.
- Experience in integrating databases like MongoDB, MySQL with webpages like HTML, PHP and CSS to update, insert, delete and retrieve data with simple ad-hoc queries.
- Developed heavy load Spark Batch processing on top of Hadoop for massive parallel computing.
- Strong noledge of Extraction Transformation and Loading ETL processes using UNIX shell scripting, SQL, PL/SQL and SQL Loader.
- Developed Spark RDD and Spark DataFrame API for Distributed Data Processing.
TECHNICAL SKILLS
Tools and Technologies: Hadoop/Big Data Technologies Sqoop, Flume, Hive, Impala
No SQL Database: HBase, Cassandra, MongoDB
Monitoring and Reporting: Tableau, Custom Shell Scripts
Hadoop Distribution: HortonWorks, Cloudera, MapR, SPARK
Build and Deployment Tools: Maven, Sbt, Git, SVN, Jenkins
Programming and Scripting: Scala, SQL, Shell Scripting, Python, Scala, Pig Latin, HiveQL
Databases: Oracle, MY SQL, MS SQL Server
Analytics Tools: Tableau, Microsoft SSIS, SSAS and SSRS
Web Dev. Technologies: HTML, XML, JSON, CSS
ETL Tools: Informatica Power Centre
Operating Systems: Linux, Unix, Windows 8, Windows 7, Windows Server 2008/2003
AWS Services: EC2, EMR, S3, Redshift, EMR, Lambda, Athena
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential, New York, NY
Responsibilities:
- Worked on DB2 for SQL connection to Spark Scala code to Select, Insert, and Update data into DB.
- Used Broadcast Join in SPARK for making smaller datasets to large datasets without shuffling of data across nodes.
- Designed and implemented Sqoop for teh incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
- Developed Spark application for loading CSV file data and applying business validation on dataframe to find invalid and valid data frames. Wrote a valid data frame into teh actual Hive partition table and invalid data frame into error table, partitioned by load date and load type.
- Implemented Spark Scripts using Spark Session, Python, Spark SQL to access hive tables into spark for faster processing of data.
- Develop and deploy teh outcome using spark and Scala code in Hadoop cluster running on GCP
- Developed Spark programs to parse teh raw data, populate staging tables and store teh refined data in partitioned tables in teh EDW.
- Implemented Spark using Python and SparkSql for faster testing and processing of data.
- Used Spark Streaming to receive real time data from teh Kafka and store teh stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Implemented Spark Scripts using Spark Session, Python, Spark SQL to access hive tables into spark for faster processing of data.
- Developed data processing applications in Scala using SparkRDD as well as Dataframes using SparkSQL APIs.
- Worked with Spark Session Object on Spark SQL and Data-Frames for faster execution of Hive queries
- Import teh data from different sources like SQL Server into Spark RDD and developed a data pipeline using Kafka and Spark to store data into HDFS
- Used SparkSql to load JSON data and create schema RDD and load it into Hive tables and handled Structured data using SparkSql.
- Worked closely with Business analysts and Enterprise architects for understanding teh rules provided by teh business.
- Created shell scripts to access staging location on edge nodes and moves specified inbound files to HDFS publish teh location and used D-series to invoke invoker code (Spring Boot) as scheduled.
- Worked extensively with Sqoop for importing and exporting teh data from HDFS to Relational Database systems/mainframe and vice-versa loading data into HDFS.
- Wrote unit test cases in teh Spark Scala code using FunSuite.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on teh fly to build teh common learner data model and persists teh data in HDFS.
- Configured multiple AWS services like EMR and EC2 to maintain compliance with organization standards.
- Implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS.
- Used Apache NiFi to copy data from local file system to HDP.
- Worked on Big Data Integration and Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods technologies.
- Worked with Tidal Enterprise Scheduler in scheduling daily batch jobs with ease.
Environment: Scala, Spark Core, SparkSql, Apache Hadoop 2.7.6, Spark 2.3 Hive SQL, Spring Boot, CDH5, HDFS, Cassandra, Zookeeper, Spark, Kafka, Oracle 19c, MySQL, Shell Script, AWS, EC2, Tomcat 8, Hive
Data Engineer
Confidential, Bothell, WA
Responsibilities:
- Understanding teh business and user requirement from teh client to deliver better documentation.
- Event based logging configuration for ELK to push Application errors and EMR errors specifically
- Worked extensively on Disaster Recovery applications to maintain its stability when teh regional disaster occurs
- Configured EMR to process teh millions of customers data using spark applications in less TEMPthan half an hour.
- Created custom UDF's using both data frames/ SQL and RDD in spark for data aggregation queries reverting into OLTP through Sqoop.
- Customized Hive UDF's to develop teh structured format of data from unstructured customers data and loaded into HBase environment from data base using Sqoop.
- Implemented Scala over Spark RDD's structure to overwrite Hive/SQL queries for faster data processing
- Developed serverless infrastructure using multiple AWS services.
- Configured multiple AWS services like EMR, EC2 and S3 to maintain compliance with organization standards
- Configured lambdas using YAML and JSON parameterized CFT
- Event notification subscriptions are configured on S3, SNS topics and Lambda to process teh data based on teh required marker files.
- Worked on Mongo DB (NoSQL framework) to store teh unstructured data before processing with HiveQL
- Queue process messages are being pushed to mobile devices using Storm and Kafka pushed teh application and transformational incremental logs to Kafka and zookeeper using marker file being listened by log producer in Scala
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
- Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
- Delivering different visualization patterns for business analysts based on teh structured transformed data.
- Used AWS Environments dev and pre prod for testing teh application with simulated data for obtaining teh performance results and maintained stabilized production environment for better application services.
- Maintaining teh phase of delivering quality to meet teh business requirements regularly.
- Business Analysis for delivering teh client requirement and detailed documentation to explain teh functional requirements by going thoroughly with business requirements.
Environment: Spark, Scala, python, AWS, Kafka, Hive, Sqoop, Storm, ELK, Jenkins
Data Engineer
Confidential, Charlotte, NC
Responsibilities:
- Involved in architecture design, development, and implementation of Hadoop deployment, backup, and recovery systems.
- Developed MapReduce programs in Python using Hadoop to parse teh raw data, populate staging tables, and store their fine data in partitioned HIVE tables.
- Enabled speedy reviews and first-mover advantages by using Oozie to automate data loading into teh Hadoop Distributed File System and Pig to pre-process teh data.
- Converted applications dat were on MapReduce to PySpark which performed teh business logic.
- Involved in creating Hive tables, loading with data, writing hive queries dat will run internally in map reduce way.
- Implemented Spark using Scala and SparkSql for faster testing and processing of data.
- Imported Teradata datasets onto teh HIVE platform using Teradata JDBC connectors.
- Was involved in writing Fast Load and Multi Load scripts to load teh tables.
- Worked with teh SQL assistant to ingest and execute queries, stored procedures, and update teh tables.
- Worked in extracting XML type files using XPath and storing it into Hive tables.
- Developed multiple Kafka Producers and Consumers as per teh software requirement specifications.
- Involved in designing teh tables in Teradata while importing teh data.
- Developed teh UNIX shell scripts for creating teh reports from Hive data.
- Experienced in managing and reviewing teh Hadoop log files.
- Main duties are resolving teh incidents, performing code migration from lower environment to production, in case of code related issues.
- Responsible for code deployment into teh production environment.
- Developed Hive jobs to parse teh logs, structure them in tabular format to facilitate effective querying on teh log data.
- Developed Scala scripts, UDFs using both Data frames in Spark for Data Aggregation, queries, and writing data back into teh OLTP system through Sqoop.
- Analyze production issues to determine root cause and provide fixed recommendations to teh Support team. Created, developed, and tracked solutions to application errors reported.
- Note interruptions or bugs in operation and carry out mitigation / problem management
- Assist with troubleshooting and issue resolution relating to current applications, providing assistance to teh development
- Coordinate with Support teams during application deployments.
- Working on system issues on production clusters like file system issues, connection issues, system slow and monitoring teh HDFS file system of all digital analytics.
- Extensively used UNIX for shell Scripting and pulling teh Logs from teh Server.
- Used Solr/Lucene for indexing and querying teh JSON formatted data.
- Worked on different file formats like Sequence files, XML files and Map files using Map Reduce Programs.
- Worked with teh Avro Data Serialization system to work with JSON data formats.
- Used Solr/Lucene for indexing and querying teh JSON formatted data.
- Implemented teh workflows using Apache Oozie framework to automate tasks.
- Completed testing of integration and tracked and solved defects.
- Worked on AWS services like EC2 and S3 for small data sets.
- Involved in loading data from teh UNIX file system to HDFS.
- Used Oozie Scheduler systems to automate teh pipeline workflow and orchestrate teh map reduce jobs dat extract and Zookeeper for providing coordinating services to teh cluster.
Environment: Hadoop Hortonworks2.2, Hive, Pig, HBase, Scala, Sqoop and Flume, Oozie, AWS, S3, EC2, EMR Spring, Kafka, SQL Assistant, Python, UNIX, Teradata
Data Engineer
Confidential
Responsibilities:
- Worked on extracting data from teh Oracle database and load to teh Hive database.
- Used Spark structured Streaming to perform necessary transformations and actions on teh fly from Kafka topics in real-time and persist on Cassandra using teh required connectors and drivers.
- Integrated Kafka, Spark, and Cassandra for streamlined analytics for creating a predictive model.
- Worked on modifying and executing teh UNIX shell scripts files for processing data and loading to HDFS.
- Worked extensively on optimizing transformations for better performance.
- Was involved in carrying out teh important design decisions in creating UDFs, partitioning teh data in hive tables at two different levels based on teh related columns for efficient retrieval and processing of queries.
- Tweaked alot of options to get a performance boost like trying it out with different executer count and memory options.
- My team was also involved in maintenance, adding teh feature of stable time zones across all records in teh database.
- Uploaded and processed more TEMPthan 20 terabytes of data from various structured and unstructured, heterogeneous sources into teh HDFS file system using Sqoop and Flume enforcing and maintaining teh uniformity across all teh tables.
- Developed complex transformations using HiveQL to build aggregate/summary tables.
- Developed UDF's in Java to implement functions according to teh specifications.
- Developed Spark scripts, configured according to business logic, good noledge of actions available.
- Well versed with teh HL7 international standards as teh data were organized according to this format.
- Formatted and built analytics on top of teh data sets dat were complied with HL7 standards.
- Analyze teh JSON data using hive SerDe API to DeSerialize and convert it into a readable format.
- Used Pig to do transformations, event joins and some Pre-Aggregations before storing teh data into HDFS.
- Involved in increasing and optimizing teh performance of teh application using Partitioning and Bucketing in Hive tables, developing efficient queries by using Map-side joins and Indexes.
- Worked with teh downstream team in generating teh reports on Tableau.
- Conducted code reviews to ensure systems operations
Environment: CDH 5.1.x, Hadoop, HDFS, Map Reduce, Sqoop, Flume, Hive, SQL Server, TOAD, Oracle, Solr/Lucene, PL/SQL, Eclipse, JAVA, Shell scripting, Vertica, Unix, Cassandra.
PYTHON Developer
Confidential
Responsibilities:
- Worked as part of an Integration Development team.
- Coordinated with internal teams to understand user requirements and provide technical solutions.
- Developed back-end components to improve responsiveness and overall performance
- Prepared teh low-level design document.
- Developed a database for integration by writing SQL Queries and Stored Procedures.
- Setting up teh master data in different environments.
- Written and performed unit test cases for Inbound and Outbound Interfaces.
- Trained new joiners under teh Integration team.
- Written SQL Queries and Stored Procedures.
- Written unit test cases and performed unit testing for teh same.
Environment: python, Oracle DB, HTML, ML.
