Sr. Data Engineer Resume
Ofallon, MO
SUMMARY
- Having IT experience in all phases of Software Development Life Cycle (SDLC) with skills in data analysis, design, development, testing and deployment of software systems
- Had 8 years of experience as Software Developer with strong emphasis in building Big Data Application using Hadoop Ecosystem tools and Rest Applications using Java.
- Strong experience working with HDFS, Spark, Map Reduce, Hive, Pig, YARN, HDFS, Oozie, Sqoop, Flume, Kafka, and NoSQL Databases like HBase and Cassandra.
- Hands - on use of Spark and ScalaAPI's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Expertise in Python andScala, user-defined functions (UDF) for Hive and Pig using Python.
- Experience in developing Map Reduce Programs using Apache Hadoop for analysing the big data as per the requirement.
- Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Experience in working with Flume and NiFi for loading log files into Hadoop.
- Experience in working with NoSQL databases like HBase and Cassandra.
- Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
- Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
- Worked with Cloudera and Hortonworks distributions.
- Migrating SQL database to Azure Data Lake, Azure data lake analytics, Azure SQL database, Data Bricks and Azure SQL Data warehouse and Managing and granting access to and migrating to Azure data lake on-site databases using Azure Data factory research to Azure Data lake store.
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Experience working with Data Lake is a system or repository of data stored in its natural/raw format, usually object blobs or files.
- Experience in using the cloud services like Amazon EMR, S3, EC2, Red shift and Athena.
- Continuous Delivery pipeline deployment experience with Maven, Ant, Jenkins, and AWS.
- Strong expertise in building scalable applications using various programming languages (Java, Scala, and Python).
- Strong understanding of Distributed systems design, HDFS architecture, internal working details of Map Reduce and Spark processing frameworks.
- Solid experience developing Spark Applications for performing highly scalable data transformations using RDD, Data frame, Spark-SQL, and Spark Streaming.
- Strong experience troubleshooting Spark failures and fine-tuning long running Spark applications.
- Strong experience working with various configurations of Spark like broadcast thresholds, increasing shuffle partitions, caching, repartitioning etc., to improve the performance of the jobs.
- Worked on Spark Streaming and Structured Spark streaming including Kafka for real time data processing.
- In depth knowledge on import/export of data from Databases using Sqoop.
- Well versed in writing complex hive queries using analytical functions.
- Knowledge in writing custom UDF’s in Hive to support custom business requirements.
- Experienced in working with structured data using HiveQL, join operations, writing custom UDFs and optimizing Hive queries.
- Solid experience in using the various file formats like CSV, TSV, Parquet, ORC, JSON and AVRO.
- Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, Amazon EMR) to fully implement and leverage various Hadoop services.
- Strong Experience in working with Databases like Oracle, and MySQL, Teradata, Netezza and proficiency in writing complex SQL queries.
- Proficient in Core Java concepts like Multi-threading, Collections and Exception Handling concepts.
- Strong team player with good communication, analytical, presentation and inter-personal skills.
- Experienced working with JIRA for project management, GIT for source code management, JENKINS for continuous integration and Crucible for code reviews.
- Excellent communication, analytical skills, and quick learner. Also, capacity to work independently and highly motivated team player.
- Experience in version control tools like SVN, GitHub and CVS.
TECHNICAL SKILLS
Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0
BI Tools: SSIS, SSRS, SSAS.
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Platform: AWS, Azure, Google Cloud.
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena
Databases: Oracle 12c/11g, Teradata R15/R14.
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
Operating System: Windows, Unix, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, Ofallon, MO
Sr. Data Engineer
Responsibilities:
- Responsible for ingesting large volumes of user behavioural data and customer profile data to Analytics Data store.
- Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.
- Developed Scala based Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
- Worked on troubleshooting spark application to make them more error tolerant.
- Experience in integrating Jenkins with various tools like Maven (Build tool), Git (Repository), Sonar Qube (code verification), Nexus (Artifactory) and implementing CI/CD automation for creating Jenkins pipelines programmatically architecting Jenkins Clusters, and scheduled builds day and overnight to support development needs.
- Programmatically created CICD Pipelines in Jenkins using Groovy scripts, Jenkins file, integrating a variety of Enterprise tools and Testing Frameworks into Jenkins for fully automated pipelines to move code from Dev Workstations to all the way to Prod environment.
- Working on Docker Hub, Docker Swarm, Docker Container network, creating Image files primarily for middleware installations & domain configurations. Evaluated Kubernetes for Docker Container Orchestration.
- Installed Docker Registry for local upload and download of Docker images and from Docker Hub and created Docker files to automate the process of capturing and using the images.
- Leveraged AWS cloud services such as EC2, auto-scaling and VPC to build secure, highly scalable and flexible systems that handled expected and unexpected load bursts.
- Experience on AWS cloud services such as EC2, S3, RDS, ELB, EBS, VPC, Route53, auto scaling groups, Cloud watch, Cloud Front, IAM to build configuration and troubleshooting for server migration from physical to cloud on various Amazon photos.
- Involved in building an information pipeline and performed analysis utilizing AWS stack (EMR, EC2, S3, RDS, Lambda, Glue, SQS, and Redshift).
- Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
- Wrote Kafka producers to stream the data from external rest API to Kafka topics.
- Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics).
- Worked extensively with Sqoop for importing data from Oracle.
- Batch scripts have been created to retrieve data from AWS S3 storage and to make appropriate transformations in Scala using the Spark framework.
- Involved in creating Hive tables, loading, and analysing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Developed a python script to transfer data, REST API’s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.
- Good experience with continuous Integration of application using Bamboo.
- Designed, documented operational problems by following standards and procedures using JIRA.
- Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Developed Oozie work processes for planning and arranging the ETL cycle. Associated with composing Python scripts to computerize the way towards extricating weblogs utilizing Airflow DAGs.
- Using python, the ETL pipeline was developed and programmed to collect data from Redshift data warehouse.
- Created workflows, mappings using Informatica ETL and worked with different transformations such as lookup, source qualifier, update strategy, router, sequence generator, aggregator, rank, stored procedure, filter, joiner, sorter.
- Worked on SSIS creating all the interfaces between front end application and SQL Server database, then from legacy database to SQL Server Database and vice versa.
- Good hands-on participation in the development and modification of SQL stored procedure techniques, functions, views, indexes, and triggers.
- Migrate data into RV Data Pipeline using Data Bricks, Spark SQL and Scala.
- Experience withSnowflake Multi-Cluster Warehouses.
- Experience withSnowflake Virtual Warehouses.
- Wrote Kafka producers to stream the data from external rest APIs to Kafka topics.
- Involved in Trouble Shooting,Performance tuning of reportsand resolving issues with inTableau Server and Reports.
Environment: AWS, Azure, Jenkins, EMR, Spark, Hive, S3, Athena, Sqoop, Kafka, HBase, Redshift, ETL, Pig, Oozie, Spark Streaming, Docker, Kubernetes, Hue, Scala, Python, Apache NIFI, GIT, Micro Services, Snowflakes.
Confidential, Houston, TX
Data Engineer
Responsibilities:
- Gathering data and business requirements from end users and management. Designed and built data solutions to migrate existing source data in Teradata and DB2 to Big Query (Google Cloud Platform).
- Performed data manipulation on extracted data using Python Pandas.
- Work with subject matter experts and project team to identify, define, collate, document and communicate the data migration requirements.
- Built customtableau/ SAP Business Objectsdashboards for the Salesforce for accepting the parameters from the Salesforce to show the relevant data for that selected object.
- Hands on Ab initio ETL, Data Mapping, Transformation and Loading in complex and high-volume environment
- Design scoop scripts to load from Teradata and DB2 to Hadoop environment and also design Shell scripts to transfer data from Hadoop to Google Cloud Storage (GCS) and from GCS to Big Query.
- Validate Scoop jobs, Shell scripts & perform data validation to check if data is loaded correctly without any discrepancy. Perform migration and testing of static data and transaction data from one core system to another.
- Develop best practice, processes, and standards for effectively carrying out data migration activities. Work across multiple functional projects to understand data usage and implications for data migration.
- Prepare data migration plans including migration risk, milestones, quality and business sign-off details.
- Oversee the migration process from a business perspective. Coordinate between leads, process manager and project manager. Perform business validation of uploaded data.
- Worked on to retrieve the data from FS to S3 using spark commands
- Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
- Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
- Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
- Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
- Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.
- Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
- Optimizing the performance of dashboards and workbooks in Tableau desktop and server.
- Proposed EDW architecture changes to the team and highlighted the benefits to improve the performance and enable troubleshooting without affecting the Analytical systems
- Involved in Debugging and monitoring and troubleshooting issues.
- Developed reconciliation dashboard for the claims sent and receive from DHS (837 vs 835).
- HIPAAEDI 835/837I / 837P/ 837P EW/NCPDP file (DHS and CMS) processing and developed scripts to parse EDI X12 file and automated test suites in UNIX.
- Analyzed data, identify anomalies, and provide usable insight to customers.
- Ensured accuracy and integrity of data through analysis, Testing and profiling using Ataccama.
- As a Data Analyst involves in Development and execution of SQL Scripts in Teradata. Develop Data Quality Framework (DQF) scripts to capture data quality issues, to make sure sensitive data are encrypted for security purposes (SSN Numbers) and sharing the corresponding DQF alert tracking reports to Client on weekly basis.
Environment: Python, Hadoop, Teradata, Unix, Google cloud, DB2, PL/SQL, MS SQL, Ab initio ETL, Data Mapping, Spark, tableau, Nebula Metadata, Unix, Sql Server, Scala, Git.
Confidential, Agawam, MA
Big Data Developer
Responsibilities:
- Used Agile methodology in developing the application, which included iterative application development, weekly Sprints, stand up meetings and customer reporting backlogs.
- Writing technical design document based on the data mapping functional details of the tables.
- Extracting batch and Real time data from DB2, Oracle, Sql server, Teradata, Netezza to Hadoop (HDFS) using Teradata TPT, Sqoop, Apache Kafka, Apache Storm.
- Handled importing of data from various data sources, performed transformations using Hive, Map Reduce, Spark and loaded data into HDFS.
- Design and build ETL workflows, leading the efforts of programming data extraction from various sources into Hadoop file system, implement end to end ETL workflows using Teradata, SQL, TPT, SQOOP and load to HIVE data stores.
- Analyse and develop programs by considering the extract logic and the data load type using Hadoop ingest processes using relevant tools such as Sqoop, Spark, Scala, Kafka, Unix shell scripts and others.
- Design the incremental, historical extract logic to load the data from flat files into Massive Event Logging Database (MELD) from various servers.
- Developing Apache Spark jobs for data cleansing and pre-processing.
- Writing spark programs to improve the performance and optimization of the existing algorithms in Hadoop using spark context, spark-sql, data frame, pair rdd's, spark yarn.
- Using Scala language to write programs for faster testing and processing of data.
- Writing code and creating hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Automating the ETL tasks and data work flows for the data pipeline of the ingest process through UC4 scheduling tool.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how does it translate to Map Reduce jobs.
- Working on ingestion of structured, semi structured and unstructured data into Hadoop ecosystem using Big Data Tools.
- Selecting and integrating any Big Data tools and frameworks required to integrate new software engineering tools into existing structures, complete modifications, refactoring, and bug fixes to existing functionality.
- Developed UDFs in Java as and when necessary to use in PIG and HIVE queries.
- Implementing Partitions, bucketing in Hive code to design both Managed and External tables in Hive to optimize performance.
- Optimized Map Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Used ORC and Parquet file formats in Hive.
- Development of efficient pig and hive scripts with joins on datasets using various techniques.
- Write documentation of program development, subsequent revisions and coded instructions in the project related GitHub repository.
- Working closely with Data science team to analyse large data sets to gain an understanding of the data, discover data anomalies by writing the relevant code, and look for ways to leverage data.
- Assist with the analysis of data used for the tableau reports and creation of dashboards.
- Participate with deployment teams to implement BI code and to validate code implementation in different environments (Dev, Stage and Production).
- Deployment support including change management and preparation of deployment instructions.
- Prepare release notes, validation document for user stories to be deployed to production as part of release.
- Updating RALLY regularly to reflect the current status of the project at any point of time.
Environment: RHEL, HDFS, Map-Reduce, Hive, Pig, Sqoop, Oozie, Teradata, Oracle SQL, UC4, Kafka, GitHub, Hortonworks data platform distribution, Spark, Scala, UC4.
Confidential
Hadoop Developer
Responsibilities:
- Involved in design and development phases of Software Development Life Cycle (SDLC) using Scrum methodology.
- Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Hadoop and Big Data.
- Importing and exporting data intoHDFSfrom database and vice versa usingSqoop.
- Developed data pipeline using Flume, Sqoop, Pig and Java Map Reduce to ingest behavioural data into HDFS for analysis.
- UsedMavenextensively for building jar files ofMap Reduceprograms and deployed to Cluster.
- Created customized BI tool for manager team that perform Query analytics using HiveQL.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Developed suit of Unit Test Cases forMapper, ReducerandDriverclasses usingMR Testinglibrary.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
- Designed and implemented a Cassandra NoSQL based database that persists high-volume user profile data.
- Migrated high-volume OLTP transactions from Oracle to Cassandra
- Created Data Pipeline of Map Reduce programs using Chained Mappers.
- Implemented Optimized join base by joining different data sets to get top claims based on state using Map Reduce.
- ModelledHivepartitions extensively for data separation and faster data processing and followedPigandHivebest practices for tuning.
- Used Hive to analyse the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Loaded the aggregated data onto DB2 for reporting on the dashboard.
- Used Pig as ETL tool to do transformations, event joins, filters and some pre-aggregations before storing the data onto HDFS.
- Implemented optimization and performance tuning in Hive and Pig.
- Developed job flows in Oozie to automate the workflow for extraction of data from warehouses and weblogs.
Environment: RHEL, HDFS, Map-Reduce, Hive, Pig, Sqoop, Flume, Oozie, Mahout,HBase, Hortonworks data platform distribution, Cassandra.
Confidential
Data Analyst
Responsibilities:
- Worked for Internet Marketing - Paid Search channels.
- Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders
- Incorporated predictive modeling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations and integrated with the Tableau viz.
- Worked with stakeholders to communicate campaign results, strategy, issues or needs.
- Analysed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
- Worked with business to identify the gaps in mobile tracking and come up with the solution to solve.
- Analysed click events of Hybrid landing page which includes bounce rate, conversion rate, Jump back rate, List/Gallery view, etc. and provide valuable information for landing page optimization.
- Evaluated the traffic and performance of Daily deals PLA ads and compare those items with non-daily deal items to see the possibility of increasing ROI.
- Suggested improvements and modify existing BI components (Reports, Stored Procedures)
- Understood Business requirements to the core and Came up with Test Strategy based on Business rules
- Prepared Test Plan to ensure QA and Development phases are in parallel
- Written and executed Test Cases and reviewed with Business & Development Teams.
- Implemented Defect Tracking process using JIRAP tool by assigning bugs to Development Team
- Automated Regression tool (Qute) and reduced manual effort and increased team productivity
- Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop Map Reduce developed in python, pig, Hive.
- Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders
- Incorporated predictive modeling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations and integrated with the Tableau viz.
- Worked with stakeholders to communicate campaign results, strategy, issues or needs.
- Analysed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
- Worked with business to identify the gaps in mobile tracking and come up with the solution to solve.
- Analysed click events of Hybrid landing page which includes bounce rate, conversion rate, Jump back rate, List/Gallery view, etc. and provide valuable information for landing page optimization.
- Involved in development and performance tuning of ETL and database code and deploying the code between various environment (Dev, QA, Prod)using Synergy and Service Centre Tickets after data validation.
- Worked extensively on Informatica Partitioning when dealing with huge volumes of data.
- Used Teradata External Loaders like Multi Load, T Pump and Fast Load in Informatica to load data into Teradata database.
- Used Teradata utilities Fast load, Multiload, Tpump to load data
- WritingTeradata SQL Queries, procedures and macrosto join or any modifications in the table
- Involved in writing UNIX Shell scripts (Pre/Post Sessions commands) for the Sessions & wrote Shell scripts to kickoff workflows and packages, deleting old files and taking backup of source files and doing FTP files.
Environment: Teradata, Informatica, Tableau, My sql, Java, Spark, SSIS, UNIX, Shell Scripting,