We provide IT Staff Augmentation Services!

Sr Data Engineer Resume

2.00/5 (Submit Your Rating)

Portland, OR

PROFESSIONAL SUMMARY:

  • Highly dedicated, inspiring and expert Sr Data Engineer with over 12 Plus years of IT industry experience exploring various technologies, tools and databases like Big Data,AWS,S3,Snowflake, Hadoop, Hive, Spark, python, Sqoop, CDL(Cassandra) ), Teradata, E - R ( Confidential), Tableau, SQL,PLSQL,Abinitio(ACE), and Redshift but always making sure of living in the world I cherish most i.e. DATA WORLD.
  • Over 12+ years of overall IT experience in a variety of industries, which includes hands on experience in Big Data and Dataware house ETL technologies.
  • Have 4+ years of comprehensive experience in Big Data processing using Hadoop and its ecosystem (MapReduce, Pig, Hive, Sqoop, Flume, Spark ).
  • Good working experience on Spark (spark streaming, spark SQL), Scala and Kafka.
  • Good working knowledge on Snowflake and Teradata databases.
  • Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.
  • Excellent Programming skills at a higher level of abstraction using Scala and Python.
  • Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Spark core, Spark MLlib, Spark Streaming and Spark SQL.
  • Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume.
  • Working knowledge of Amazon’s Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as Storage mechanism.
  • Worked on reading multiple data formats on HDFS using Scala.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs .
  • Good experience in creating and designing data ingest pipelines using technologies such as Apache Storm- Kafka.
  • Experienced in working with in-memory processing framework like Spark Transformations, SparkQL, MLib and Spark Streaming.
  • Good working experience on using Sqoop to import data into HDFS from RDBMS and vice-versa.
  • Experienced in implementing POC using Spark Sql and Mlib libraries.
  • Improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
  • Hands on experience in handling Hive tables using Spark SQL.
  • Efficient in writing MapReduce Programs and using Apache Hadoop API for analyzing the structured and unstructured data.
  • Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.
  • Debugging Pig and Hive scripts and optimizing MapReduce job and debugging Map reduce job.
  • Hands-on experience in managing and reviewing Hadoop logs.
  • Good knowledge about YARN configuration.
  • Extending Hive and Pig core functionality by writing custom UDFs.
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
  • Hands on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS.
  • Good working knowledge on NoSQL databases such as Hbase and Cassandra.
  • Knowledge of job workflow scheduling and monitoring tools like Oozie (hive, pig) and dag (lambada).
  • Integrated Apache Storm with Kafka to perform web analytics. Uploaded click stream data from Kafka to Hdfs, Hbase and Hive by integrating with Storm
  • Developed various shell scripts and python scripts to address various production issues.
  • Developed and designed automation framework using Python and Shell scripting
  • Experience in AWS EC2, configuring the servers for Auto scaling and Elastic load balancing.
  • Good Knowledge of data compression formats like Snappy, Avro.
  • Developed automated workflows for monitoring the landing zone for the files and ingestion into HDFS in Bedrock Tool and Abinitio.
  • Created Abinitio Jobs for data comparison between tables across different databases, identify and report discrepancies to the respective teams.
  • Delivered zero defect code for three large projects which involved changes to both front end (web services) and back-end (Oracle, snowflake, Teradata).
  • Experience with all stages of the SDLC and Agile Development model right from the requirement gathering to Deployment and production support.
  • Involved in daily SCRUM meetings to discuss the development/progress and was active in making scrum meetings more productive.

TECHNICAL SKILLS:

Big Data: Cloudera Distribution, HDFS, Yarn, Data Node, Name Node, Resource Manager, Node Manager, Mapreduce, PIG, SQOOP, Hbase, Hive, Flume, Cassandra, Spark, Storm, Scala, Impala

Operating System: UNIX AIX 5.3, OS/390 z/OS 1.6, Windows 95/98/NT/ME/00/XP, UNIX, MS-DOS, Sun Solaris 5.8, Linux 8x

Languages: Visual Basic 6.0/5.0, SQL, PL/SQL, and Transact-SQL, python

Databases: Snowflake(cloud), Teradata, IBM DB2, Oracle, SQL Server, MySQL, NoSQL

Web Technologies: HTML,XML

Version Tools: GIT, CVS

Packages: SQL* PLUS, Toad 7.x, SQL Loader, Erwin 7.0

Tools: TOAD, SQL Developer, ANT, Log4J

Web Services: WSDL, SOAP.

ETL/Reporting: Ab Initio GDE 3.0, CO>OP 2.15,3.0.3,Infromatica,Tableau

Web/App Server: UNIX server, Apache Tomcat

PROFESSIONAL EXPERIENCE:

Sr Data Engineer

Confidential, Portland, OR

Responsibilities:

  • I was responsible for analyzing the business requirement and estimating the tasks and preparing the mapping design documents for Confidential Point of sale(POS) and Direct sales (Digital sale) across all GOE’s.
  • Analyzed large and critical datasets using Cloudera, HDFS, MapReduce, Hive, Hive UDF, Pig, Sqoop and Spark.
  • Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
  • Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in Near real time and persist it to Cassandra.
  • Consumed XML messages using Kafka and processed the xml file using Spark Streaming to capture UI updates.
  • Developed Preprocessing job using Spark Data frames to flatten Json documents to flat file.
  • Developed Dashboard reports on Tableau.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
  • Gathered the business requirements from the Business Partners and Subject Matter Experts
  • Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
  • Written multiple MapReduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV & other compressed file formats.
  • Developed automated processes for flattening the upstream data from Cassandra which in JSON format. Used Hive UDFs to flatten the JSON Data.
  • Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms
  • Developed PIG UDFs to provide Pig capabilities for manipulating the data according to Business Requirements and worked on developing custom PIG Loaders and Implemented various requirements using Pig scripts.
  • Experienced on loading and transforming of large sets of structured, semi structured and unstructured data
  • Created POC using Spark Sql and Mlib libraries.
  • Developed a Spark Streaming module for consumption of Avro messages from Kafka.
  • Experienced in Querying data using SparkSQL on top of Spark Engine, implementing Spark RDD’s in Scala.
  • Expertise in writing Scala code using Higher order functions for iterative algorithms in Spark for Performance considerations.
  • Experienced in managing and reviewing Hadoop log files
  • Worked with different File Formats like TEXTFILE, AVROFILE, ORC, and PARQUET for HIVE querying and processing.
  • Create and Maintain Teradata Tables, Views, Macros, Triggers and Stored Procedures
  • Expertise in snowflake to create and Maintain Tables and views.
  • Monitored workload, job performance and capacity planning using Cloudera Distribution.
  • Worked on Data loading into Hive for Data Ingestion history and Data content summary.
  • Involved in developing python scripts,abintio,informatica and other etl tools for extraction, transformation, loading of data into data warehouse.
  • Created Impala tables and SFTP scripts and Shell scripts to import data into Hadoop.
  • Created Hive tables and involved in data loading and writing Hive UDFs. Developed Hive UDFs for rating aggregation
  • Provided ad-hoc queries and data metrics to the Business Users using Hive, Pig
  • Did various performance optimizations like using distributed cache for small datasets, partition and bucketing in hive, doing map side joins etc
  • Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
  • Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS
  • Experienced with AWS AZURE services to smoothly manage application in the cloud and creating or modifying the instances.
  • Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
  • Used EMR (Elastic Map Reducing) to perform bigdata operations in AWS.
  • Worked on Apache spark writing python applications to convert txt, xls files and parse.
  • Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
  • Installed the application on AWS EC2 instances and configured the storage on S3 buckets.
  • Loading data from different source (database & files) into Hive using Talend tool.
  • Implemented Spark using Python/Scala and utilizing Spark Core, Spark Streaming and Spark SQL for faster processing of data instead of MapReduce in Java
  • Scheduled Oozie workflow engine to run multiple Hive and Pig jobs, which independently run with time and data availability
  • Scheduled Airflow DAGs to run multiple Hive and Pig jobs, which independently run with time and data availability
  • Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV etc
  • Involved in running Hadoop Streaming jobs to process Terabytes of data
  • Used JIRA for bug tracking and CVS for version control.

Environment: Hadoop, Map Reduce, Hive, HDFS, PIG, Sqoop, Hortonworks, Flume, HBase, Oracle, Snowflake, Teradata, Tableau Unix/Linux, Hadoop, Hive, PIG, SQOOP, Flume, HDFS, Oracle/SQL & DB2, Unix/Linux, JIRA, AWS

Big Data Engineer

Confidential, Seattle, WA

Responsibilities:

  • Primary responsibilities include building scalable distributed data solutions using Hadoop ecosystem.
  • I was responsible for analyzing the business requirement and estimating the tasks and preparing the design documents for the existing Abinitio and Teradata code for converting into hive/spark sql.
  • Develop the Spark Sql logics which mimics the Teradata ETL logics and point the output Delta back to Newly Created Hive Tables and as well the existing TERADATA Dimensions, Facts, and Aggregated Tables.
  • Imported data from Abinitio LDR(Load Ready Files) and into Spark RDD and performed transformations and actions on RDD’s.
  • Experienced in designing and deployment of Hadoop cluster and different big data analytic tools including Pig, Hive, Flume, Hbase and Sqoop .
  • Loaded the CDRs from relational DB using Sqoop and other sources to Hadoop cluster by Flume.
  • Implementing quality checks and transformations using Spark.
  • Developed simple and complex MapReduce programs in Hive, Pig and Python for Data Analysis on different data formats.
  • Performed data transformations by writing MapReduce and Pig scripts as per business requirements.
  • Implemented Map Reduce programs to handle semi/unstructured data like xml, json, Avro data files and sequence files for log files.
  • Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and analysis.
  • Experienced in Kerberos authentication to establish a more secure network communication on the cluster.
  • Analyzed substantial data sets by running Hive queries and Pig scripts.
  • Managed and reviewed Hadoop and HBase log files.
  • Experience in creating tables, dropping and altered at run time without blocking updates and queries using Spark and Hive .
  • Experienced in writing Spark Applications in Scala and Python.
  • Used Spark SQL to handle structured data in Hive.
  • Imported semi-structured data from Avro files using Pig to make serialization faster
  • Processed the web server logs by developing Multi-hop flume agents by using Avro Sink and loaded into MongoDB for further analysis.
  • Experienced in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
  • Experienced in connecting Avro Sink ports directly to Spark Streaming for analyzation of weblogs.
  • Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
  • Managing and scheduling Jobs on a Hadoop Cluster using UC4( Confidential preoperatory scheduling tool) workflows.
  • Continuous monitoring and managing the Hadoop cluster through Hortonworks (HDP) distribution.
  • Configured various views in Yarn Queue manager.
  • Involved in review of functional and non-functional requirements.
  • Indexed documents using Elastic search.
  • Responsible for using Flume sink to remove the date from Flume channel and deposit in No-SQL database like MongoDB
  • Involved in loading data from UNIX file system and FTP to HDFS.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
  • Loaded JSON-Styled documents in NoSQL database like MongoDB and deployed the data in cloud service Amazon Redshift.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in Amazon EMR, AZURE .
  • Used Zookeeper to provide coordination services to the cluster.
  • Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with reference tables and historical metrics.
  • Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generated data visualizations using Tableau.
  • Experience in processing large volume of data and skills in parallel execution of process using Abinitio functionality.
  • Designed and implemented Spark jobs to support distributed data processing.
  • Experience in optimizing Map Reduce Programs using combiners, partitioners and custom counters for delivering the best results.
  • Written Shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions.
  • Involved in Hadoop cluster task like Adding and Removing Nodes without any effect to running jobs and data.
  • Followed Agile methodology for the entire project.
  • Experienced in Extreme Programming, Test-Driven Development and Agile Scrum

Environment: Hortonworks (HDP), Hadoop, Spark, Sqoop, Flume, Elastic Search, AWS, EC2, S3, Pig, Hive, Mysql, Python, MapReduce, HDFS, Tableau, Abinitio.

Big Data Engineer

Confidential

Responsibilities:

  • Develop New Spark Sql ETL logics in Big Data for the migration and availability of the Facts and Dimensions used for the Analytics
  • Develop of Spark Sql application, Big Data Migration from Teradata to Hadoop and reduce Memory utilization in Teradata analytics.
  • Requirement Gathering and Leading Team for the development of the Big Data environment and Spark ETL logics migrations.
  • Involve in requirement gathering from the Business Analysts, and participate in discussions with users, functional analysts for the Business logics implementation.
  • Responsible for end to end design on Spark Sql, Development to meet the requirements.
  • Advice the business on best practices in the Spark Sql while making sure the solution meet the business needs.
  • Lead and Coordinate Developers, Testing and technical teams in offshore support on daily basis to discuss Challenges and outstanding issues.
  • Involve in preparation, distribution and collaboration of client specific quality documentation on developments for Big Data and Spark along with regular monitoring on reflecting the modifications or enhancements done in Confidential Schedulers.
  • Migrate the Data from Teradata to Hadoop and data preparation using HIVE Tables.
  • Create Partitioned and bucketing tables on HIVE. Mainly worked on Hive QL to categorize data of different Subject areas for Marketing, Shipping, and Selling.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Accessing the hive tables using Spark Hive context (Spark sql) and used scala for interactive operations.
  • Develop the Spark Sql logics which mimics the Teradata ETL logics and point the output Delta back to Newly Created Hive Tables and as well the existing TERADATA Dimensions, Facts, and Aggregated Tables.
  • Make sure Data is matched with TERADATA and SPARK Sql logics.
  • Creating Views on Top of the HIVE tables and give it to customers for the analytics.
  • Analyzing Hadoop cluster and different big data analytic tools including Pig, HBase and Sqoop.
  • Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.
  • Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices using Apache and stored the data into HDFS for analysis.
  • Strong knowledge on creating and monitoring cluster on Hortonworks Data platform.
  • Developed Unix shell scripts to load large number of files into HDFS from Linux File System
  • Developed Custom Input Formats in MapReduce jobs to handle custom file formats and to convert them into key-value pairs.
  • Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
  • Created the Hive external tables using Accumulo connector.
  • Designed and implemented HIVE queries and functions for evaluation, filtering, loading and storing of data.
  • Extensive experience in writing UNIX shell scripts and automation of the ETL processes using UNIX shell scripting.
  • Design, develop, test, implement and support of Data Warehousing ETL using Abinitio and Hadoop Technologies.
  • Involved in performance tuning of the ETL process by addressing various performance issues at the extraction and transformation stages.
  • Worked with BI teams in generating the reports and designing ETL workflows on Tableau
  • Prepared the Technical Specification document for the ETL job development.
  • Involved in loading data from UNIX file system and FTP to HDFS
  • Used HIVE to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
  • Developed UDF's in java for enhancing functionalities of Pig and Hive scripts.
  • Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
  • Implemented Daily Cron jobs that automate parallel tasks of loading the data into HDFS and pre-processing with Pig using Oozie co-coordinator jobs.
  • Worked on the Ad hoc queries, Indexing, Replication, Load balancing, Aggregation in MongoDB.
  • Experience in managing MongoDB environment from availability, performance and scalability perspectives.
  • Developed Spark code using Scala and Spark-SQL for faster testing and data processing.

Environment: Abinitio 3.0,Hadoop, HDFS, Hortonworks, Hive, Sqoop, Python, Unix, Shell Scripting, Teradata, Spark SQL.

ETL Project Lead/Ab Initio Consultant

Confidential

Responsibilities:

  • Prepared Functional requirement document (FRD) and Technical Specification Document (TSD).
  • Involved in the Data profiling activities and come out with generic test cases.
  • Developed different Abinitio graphs and scripts to validate and integrate the date.
  • Analyzed Data for various performance issues and to identify trends in data
  • Developed processes to set up the entire hierarchy of a PBM.
  • Review the test case results with the team and report the results to business accordingly.
  • Generated test cases based on the requirement document for each source system loaded into the MIDE data warehouse.
  • Developing the code as per the business logic and executing Ab Initio Graphs .
  • Developed tag migration utility for code promotion from Dev to QA .
  • Experienced in developing UNIX shell wrapper scripts to run Ab Initio graphs and monitoring the job cycle of each application.
  • Practical experience with working on multiple environments like production, development, testing.
  • Well versed with various Ab Initio parallelism techniques and implemented Ab Initio Graphs using Data parallelism and Multi File System (MFS) techniques.
  • Converted user defined functions and complex business logic of an existing application process into Ab Initio graphs using Ab Initio components such as Reformat, Join, Transform, Sort, Partition to facilitate the subsequent loading process.
  • Responsible for deploying Ab Initio graphs and running them through the Co-operating systems mp shell command language and responsible for automating the ETL process through scheduling
  • Worked on improving the performance of Ab Initio graphs by using Various Ab Initio performance techniques like using lookup s (instead of joins), In-Memory Joins and rollups to speed up various Ab Initio Graphs
  • Implemented phasing and checkpoint approach in ETL process to prevent data loss and to maintain uninterrupted data flow against process failures

Sr ETL Developer

Confidential

Responsibilities:

  • Involved in Migrating historical as built data from Link Tracker Oracle database to TD using Abinitio.
  • Implemented historical purge process for Clickstream, order broker &link tracker to TD using Abinitio
  • Implemented the centralized graphs concept.
  • Extensively used Abinitio components like Reformate, rollup, lookup, joiner, re-defined and also developed many sub graphs
  • Abinitio Sandbox creation at both GDE level and air command level, scheduling the interdependent jobs (Abinitio deployed graphs) through UNIX wrapper template.
  • Performing tuning the Abinitio graphs
  • Sandbox creation and adding the parameters based on the requirement
  • Involved in loading the transformed data file into TD staging tables through TD Load utilities, Fast load and Multi load scripts, and Creating TD macro’s for loading the data from staging to target tables
  • Performed the data validation on TD warehouse data as per few standard test cases
  • Leading the module to load all PARTY RELSHIP tables, responsible for requirement gathering, creating specification documents & Test cases documents, designing and validating the ETL mapping, development through Unit Testing, validating the data populated in the data base and giving UAT support and Resolution of issues raised by the users and different groups.
  • Responsible as E-R consultant, ER(Extract-Replicate) Golden gate tool which is used to extract the real time data to warehouse without hitting to the database which pulls the data from oracle Archive logs as oracle 10g support as ASM(Automatic storage mechanism) method.
  • Also involved to designing the Data Allegro post scripts to load the data from LRF files to DA database.

We'd love your feedback!