Data Engineer Resume
SUMMARY
- For 8.5+ years, I participated on teams doing analysis, design, development, implementations, enhancements and testing of applications in these areas: Healthcare, Entertainment, Internal corporate Applications, Financial, Geotechnical and Marketing and Sales.
TECHNICAL SKILLS
- Hive, Python 3.6, NiFi, Hadoop, Pig, Spark, Spark streaming, Kafka, Spark SQL, MapReduce
- HBase 1.2, HDFS, Sqoop 1.4, Hadoop 2.0, Hive 2.3, PIG, Impala 2.1 Oracle, DB2, MySQL, Tableau
- Amazon S3, Amazon Athena, Glue, Redshift, EC2, Hadoop, Netezza, Data warehousing, PySpark
- Jenkins, Docker, JavaScript, CSS3, HTML5, DataStage Infosphere.
PROFESSIONAL EXPERIENCE
Confidential
Data Engineer
Responsibilities:
- Implemented solutions for ingesting data from various sources and processing the Data - Confidential -Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, Hive with Cloud Architecture.
- Construct and maintain an appropriate, scalable, and easy-to-use infrastructure with various tools to support the development of actionable reports used in decision-making across the strategy team.
- Worked on AWS, implementing solutions using services like (EC2, S3, RDS, Redshift, VPC)
- Extracted the data from Netezza, AWS Redshift into HDFS using Sqoop.
- Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
- Performed data profiling and transformation on the raw data using Pig, Python, and Java.
- Used Apache Spark for batch processing to source the data.
- Expert in performing business analytical scripts using Hive SQL.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Building data ingestion pipe lines with Nifi, Kafka for different data sources like LMS, MVP, RDBMS,etc…
- Analyzed, designed, developed, implemented, and maintained Parallel jobs using IBM info sphere Data stage.
- Involved in design of dimensional data model - Star schema and Snowflake Schema.
- Load and transform large sets of structured, semi structured, and unstructured data.
- Pulling the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Generating DB scripts from Data modeling tool and Creation of physical tables in DB.
- Used the ETL Data Stage Director to schedule and running the jobs, testing, and debugging its components & monitoring performance statistics.
- Experience in different Hadoop distributions like Cloudera (CDH3 & CDH4) and Hortonworks Distributions (HDP) and MapR.
- Experienced in PX file stages that include Complex Flat File stage, Dataset stage, LookUp File Stage, Sequential file stage.
- Created some routines (Before-After, transform function) used across the project.
- Experienced in PX file stages that include Complex Flat File stage, Dataset stage, LookUp File Stage, Sequential file stage.
- Experienced in developing parallel jobs using various Development/debug stages (Peek stage, Head & Tail Stage, Row generator stage, Column generator stage, Sample Stage) and processing stages (Aggregator, Change Capture, Change Apply, Filter, Sort & Merge, Funnel, Remove Duplicate Stage)
- Repartitioned job flow by determining DataStage PX best available resource consumption.
- Successfully implemented pipeline and partitioning parallelism techniques and ensured load balancing of data.
- Involved in creating UNIX shell scripts for database connectivity and executing queries in parallel job execution.
- Document all the changes implemented across all systems and components using Confluence and Atlassian Jira. Documentation includes Technical changes, Infrastructure changes, and Business Process changes. Post Release documentation would also include Known Issues from Production Implementation and Deferred defect.
Environment: DataStage, Netezza, E3 Framework, Unix scripting, Hadoop 3.0, HBase 1.2, Hive 2.3, AWS, EC2, S3, RDS, VPC, MySQL, Redshift, Sqoop, HDFS, Spark, ETL, YARN, Python, UDF, HQL, NoSQL, Cassandra 3.11, Hortonworks, MapR, NiFi
Confidential
Data Engineer
Responsibilities:
- Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, and time, Date and Time etc. Integrating with external data sources and APIs to discover interesting trends.
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Worked on Cloudera distribution for Hadoop ecosystem and installed and configured Flume, Hive, Pig, Sqoop and Oozie, Automatic on the Hadoop cluster.
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Good working knowledge on Snowflake and Teradata databases.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.
- Work with IT security auditors to resolve security vulnerabilities in Linux, UNIX, Apache.
- Develop Spark jobs using PySpark and Scala to create a generic framework to process all kinds of files such as json, txt, and csv.
- Delivered zero defect code for three large projects which involved changes to both front end (web services) and back-end (Oracle, snowflake, Teradata).
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Performed Data Cleaning, features scaling, featurization, features engineering and deploying the data in amazon s3 and Athena.
- Created data pipelines migrating data from on premises servers to S3 to Glue to Athena and utilized by AWS Quick sight and Tableau.
- Utilized AWS CLI to automate backups of ephemeral data-stores to S3 buckets and Migrated applications from internal data center to AWS Athena and Glue.
- Strong Experience in implementing Data warehouse solutions in Confidential Redshift; Worked on various projects to migrate data from on premise databases to Confidential Redshift, RDS and S3.
- Implemented Continuous Integration using GIT and GitHub from scratch.
- Involved in all the stages of Software Development Life Cycle Primarily in Database Architecture, Logical and Physical modeling, Data Warehouse/ETL development using MS SQL Server 2012/2008R2/2008, Oracle 11g/10g, and ETL Solutions/Analytics Applications development.
- Experience with Unix/Linux systems with scripting experience and building data pipelines
- Managed and reviewed Hadoop log files to identify issues when job fails and used HUE for UI based pig script execution, Automatic scheduling.
- Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers.
- Hands on experience in writing Python and Bash Scripts.
- Extensive experience in designing and implementation of continuous integration, continuous delivery, continuous deployment through Jenkins.
- Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.
- Experience on Cloud Databases and Data warehouses (Redshift/RDS).
- Used various Spark Transformations and Actions for cleansing the input data and involved in using the Spark application master to monitor the Spark jobs and capture the logs for the spark jobs.
- Experience in refactoring the existing spark batch process for different logs written in Scala.
- Hands-on work developing in SAS, SQL, Python, and Java with Eclipse for extraction patterns from very large datasets and transform data into an informational advantage for decision support. Performed and assisted in design, development and testing of predictive analytics models that includes large data collection, data organization, text segmentation, categorization, summarization, and topic modeling. Advanced statistical analysis in SAS and predictive solutions.
- Implemented Big Data tools like Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data and worked on extensible framework for building high performance batch and interactive data processing application on hive.
- Debugging and maintenance of automaton test scripts in batch mode and implemented a plan on automation scripts on based on Sprint.
- Develop Oozie workflows to schedule the Scripts on daily basis.
Environment: s: Hadoop/Big Data Technologies: Spark-Scala, Kafka, Spark Streaming, Mlib, Sqoop, Hbase, HDFS, Map Reduce, Pig, Hive, Zeppelin(Distributions Data Bricks, Horton works and Cloudera), Cassandra, HBase, HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, JDBC, Tomcat, Apache, Shell Scripting.
Confidential
Data Engineer
Responsibilities:
- Coordinated with team and Developed framework to generate Daily adhoc, Report's and Extracts from enterprise data and automated using Oozie.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Worked on cloud deployments using maven, docker and Jenkins.
- Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets.
- Developed the PySpark code for AWS Glue jobs and for EMR.
- Involved in developing various ETL jobs to load, extract and map the data from flat files and heterogeneous database sources like Oracle, SQL Server, MySQL.
- Involved in developing various ETL jobs to load, extract and map the data from flat files and heterogeneous database sources like Oracle and DB2.
- Created DDL's for tables and executed them to create tables in the warehouse for ETL data loads.
- Exporting the analyzed and processed data to the RDBMS using Sqoop for visualization and for generation of reports for the BI team.
- Good Knowledge of web services using SOAP and REST protocols.
- Expertise in developing data driven applications using Python 2.7, Python 3.0 on PyCharm and Anaconda Spyder IDE's.
- Writing Technical documents and mentoring global UNIX team.
- Proficient in all aspects of software life cycle like Build/Release/Deploy and specialized in cloud automation through open source DevOps tools like Jenkins.
- Hands on experience in writing Python and Bash Scripts.
- Dockized applications by creating Docker images from Docker file.
- Extensive experience in designing and implementation of continuous integration, continuous delivery, continuous deployment through Jenkins.
- Periodic patch management on Unix/Linux Environment.
- Created reports using Tableau and Power BI to help forecast the provider information.
- Used Postman & SOAPUI for rest service testing.
- Created SQL scripts to insert/update and delete data in MS SQL database. Created database tables, wrote stored procedures to update and clean the old data and also helped the front-end application developers with their queries.
- Extracted data from the legacy system and loaded/integrated into another database through the ETL process.
- Executed SQL queries to test back end data validation of DB2 database tables based on business requirement
- Recommended controls by identifying problems; writing improved procedures for the portal
- Designed and created different ETL packages using SSIS and transferred data from Oracle source to MS SQL server destination.
- Performance tuning of SQL queries and stored procedures using SQL profiler and index tuning advisor.
- Created T-SQL queries for schemas, views, stored procedures, triggers and functions for data migration. Involved in the project from planning stage to pushing codes to production.
- Scheduled Cube Processing from Staging Database Tables using SQL Server Agentusing SSAS.
- Translated technical applications specification into functional and nonfunctional business requirements and created user stories based on those requirements in Rally.
- Created dashboards, worksheets, storyboards for the stake holders using Tableau and Excel.
Environment: Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, Dataproc, Vm Instances, Cloud Sql, Mysql, Posgres, Sql Server, Salesforce Soql, Python, ETL SQL Server Integration Services (SSIS), SQL Server Reporting Services(SSRS), ETL Extract Transformation and Load., Business Intelligence(BI),BCPScala, Spark, Hive, Sqoop, Spark- MS SQL Server 2005/2008, SQL Server.
