We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

3.00/5 (Submit Your Rating)

Rtp, NC

SUMMARY

  • Around 6+ years of IT experience in software analysis, design, development, testing and implementation of Data Engineer, Big Data Hadoop, NoSQL and Python technologies.
  • In depth experience and good knowledge in using Hadoop ecosystem tools like MapReduce, HDFS, Pig, Hive, Kafka, Yarn, Sqoop, Storm, Spark, Oozie, and Zookeeper.
  • Excellent understanding and extensive knowledge of Hadoop architecture and various ecosystem components such as HDFS, Data build tool, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
  • Architect BI applications as Enterprise Solution for Supply Chain, Online, Finances.
  • Good usage of Apache Hadoop along enterprise version of Cloudera and Hortonworks.
  • Good Knowledge on MAPR distribution & Amazon's EMR.
  • Good knowledge of Data modeling, use case design and Object - oriented concepts.
  • Experience with Django, star schema, flask, a high - level PythonWeb framework.
  • Experienced in WAMP (Windows, Apache, MYSQL, and Python) and LAMP (Linux, Apache, MySQL, and Python) Architecture.
  • A good knowledge of the global supply chain organization and industrial Health Systems footprint, processes and their products.
  • Hands on experience in designing and developing applications In Spark using Scala and PySpark to compare the performance ofSpark with Hive and SQL/Oracle.
  • Good experience fact tables, PostgreSQL with design, coding, Cassandra, debug operations, reporting and data analysis utilizing star schema, Snowflake Python and using Python libraries to speed up development.
  • Strong knowledge of Object-Oriented Design and Programming concepts and Experience in Object Oriented Programming (OOP) concepts using Python.
  • Well versed in installation, configuration, supporting and managing of Big Data and underlying infrastructure of Hadoop Cluster.
  • Have been working with AWS cloud services (VPC, EC2, S3, Redshift, Data Pipeline, EMR, DynamoDB, Lambda and SQS).
  • Good knowledge on spark components like Spark SQL, MLlib, Spark Streaming and GraphX.
  • Extensively worked on Spark streaming and Apache Kafka to fetch live stream data.
  • Experience in converting Hive/SQL queries into RDD transformations using Apache Spark, Scala and Python.
  • Good Knowledge about usingData Bricks Platform, Cassandra, Cloudera ManagerandHortonworks Distributionto monitor and manage clusters.
  • Implemented Dynamic Partitions and Buckets in HIVE for efficient data access.
  • Experience in Snowflake Lambda data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
  • Experience in implementingdatapipelines usingAzureDataFactory.
  • Involved in integrating Data build tool hive queries into spark environment using Spark Sql.
  • Hands on experience in performing real time analytics on big data using HBase and Cassandra in Kubernetes & Hadoop clusters.
  • Experience in using Flume to stream data into HDFS.
  • Good working star schema, Snowflake experience using Sqoop to import data into HDFS from RDBMS and vice-versa.
  • Experienced on R and Python for statistical computing. Also experience with MLlib (Spark), Matlab, Excel, Minitab, SPSS, and SAS.
  • Extensive experience in loading and
  • Excellent knowledge on integratingAzureDataFactory V2/V1with variety ofdatasources and processing thedatausing thepipelines,pipeline parameters, activities, activity parameters, manually/window based/event-basedjob scheduling.
  • Good knowledge in developing data pipeline using Flume, Sqoop, and Pig to extract the data from weblogs and store in HDFS.
  • Extensive knowledge in programming with AWS Resilient Distributed Datasets (RDDs).
  • Experienced in using Snowflake Flume to transfer log data files to Hadoop Distributed File System (HDFS)
  • Experienced with Akka building high performance and reliable distributed applications in Java and Scala.
  • Knowledge and experience PostgreSQL, S3, SNS, SQS in job work-flow scheduling and monitoring tools like Oozie and Zookeeper.
  • Good working knowledge of Amazon Web Service components like EC2, EMR, S3.
  • Knowledge in fact tables, configuration and managing - Cloudera’s Hadoop platform along with CDH3&4 clusters.
  • Strong Experience Data build tool in working with Databases like Oracle 11g/10g/9i, Lambda, DB2, SQL Server 2008 and MySQL and proficiency in writing complex SQL queries.
  • Experience star schema in using PL/SQL to write Stored Procedures, Functions and Triggers.
  • Excellent technical and analytical skills with clear understanding of design goals of ER modeling for OLTP and PostgreSQL, S3, SNS, SQS dimension modeling for OLAP.
  • Experience Cassandra working with batch processing and operational data sources and migration of data from traditional databases to Hadoop and NoSQL databases.

TECHNICAL SKILLS

Hadoop/Big Data: MapReduce, HDFS, Hive, Pig, HBase, Zookeeper, Cassandra, Sqoop, Oozie, Flume, Scala, Akka, Kafka, Lambda, Storm, MongoDB

Java/J2EE Technologies.: JDBC, Java Script, JSP, Servlets, JQuery

Web/Application servers: Apache Tomcat6.0/7.0/8.0, JBoss

Web Technologies: HTML, DHTML, XML, XHTML, JavaScript, CSS, XSLT.

No SQL Databases: Cassandra, Snowflake, mongo DB

Frameworks: MVC, Struts, Spring, Hibernate.

Operating Systems: UNIX, Ubuntu Linux and Windows, Centos, Sun Solaris.

Network protocols: TCP/IP fundamentals, LAN and WAN.

Languages: Java, J2EE, PL/SQL, Pig Latin, HQL, R, Python, XPath, Spark

Databases: Oracle 12c/11g/10g/9i, Microsoft Access, MS SQL

PROFESSIONAL EXPERIENCE

Confidential, RTP, NC

Sr. Data Engineer

Responsibilities:

  • As a Sr. Data Engineer, provided technical expertise and aptitude to Hadoop technologies as they relate to the development of analytics.
  • Installed, Configured and Maintained the Hadoop cluster for application development and Hadoop ecosystem components.
  • Facilitates the resolution of supplier related technical issues within products, whether discovered in the Field, in Manufacturing, or at the supplier or their supply chain
  • Built Spark applications using PySpark, and used Python programming languages for data engineering in Spark framework.
  • Used various sources to pull data intoPower BI such as Sql Server, Oracle, SQL Azure etc.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
  • Designing and DevelopingAzureDataFactory (ADF)extensively for ingestingdatafrom different source systems like relational and Non-relational to meet business functional requirements.
  • Worked with Data build tool Lambda clients to better understand their reporting and dash boarding needs and present solutions using structured Agile project methodology approach.
  • Converted SAS scripts to Data bricks
  • Loaded and fact tables, transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
  • Developed Python scripts to do file validations in databricks and automated the process using ADF.
  • Responsible star schema PostgreSQL for manage data coming from different sources. Storage and Processing in Hue covering all Hadoop ecosystem components.
  • Experienced in developing Web Services with Python programming language.
  • Experienced in Snowflake Microsoft Azure date storage and Azure Data Factory, Data Lake.
  • Used Hive queries to import data into Microsoft Azure cloud and analyzed the data using Hive scripts.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Developed Databricks ETL pipelines using notebooks, Spark Data frames, SPARK SQL, and python scripting.
  • Created and maintained Technical documentation for launching Hadoop Clusters and for executing Hive queries and Pig Scripts.
  • Used predictive modeling with tools in SAS, SPSS, and Python.
  • Involved in loading data from UNIX file system to HDFS using Flume and HDFS API.
  • Configured Snowflake Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS.
  • Working fact tables, with star schema the Data build tool relevant Application and Technology Centers to implement the One GEA supply chain strategy
  • Primarily involved inDataMigration usingSQL, SQLAzure,Azurestorage, andAzureDataFactory, SSIS, PowerShell.
  • Developed Python scripts to automate the ETL process using Apache Airflow and CRON scripts in the Unix operating system as well.
  • Experience DynamoDB in integrating oozie logs to kibana dashboard.
  • UsedDynamoDB to store the data for metrics and backend reports.
  • Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
  • Used Data Frame API in Scala for converting the distributed collection of data organized into named columns.
  • Creating Lambda dashboard on Tableau and Elastic search with Kibana.
  • Experienced with AWS batch processing of data sources using Apache Spark.
  • Developing predictive S3, SNS, SQS analytic using Apache Spark Scala APIs.
  • Involved in working of big data analysis using Pig and User defined functions (UDF).
  • Created Hive Lambda External tables and loaded the data Cassandra into tables and query data using HQL.
  • Imported Snowflake millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
  • Flexible with full implementation of spark jobs with PySpark API and Spark Scala API.
  • Developed Spark streaming application to pull data from cloud to Hive table.
  • Used Spark SQL to process the huge amount of structured data.
  • Experience in AWS, implementing solutions using services like (EC2, S3, RDS, Redshift, VPC)
  • Assigned star schema name to each of the columns using case class option in Scala.
  • Implemented AWS Object Oriented Programming,JavaCollections API, SOA, design patterns, Multithreading and S3, Data build tool, SNS, SQS Network programming techniques.
  • Responsible DynamoDB for importing log files from various sources into HDFS using Flume
  • Worked on Lambda, PostgreSQL tools Flume, Storm and Spark.
  • Expert in performing business analytical scripts using Hive SQL.
  • Implemented fact tables, continuous integration & deployment (CICD) through Jenkins for Hadoop jobs.
  • Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.

Environment: Spark, Redshift, Snowflake, AWS, Cassandra, Python, HDFS, Hive, Pig, Sqoop, Scala, Shell scripting, Linux, fact tables, Jenkins, Eclipse, Git, S3, SNS, SQS, Oozie, Talend, Agile Methodology.

Confidential, Irving, TX

Data Engineer

Responsibilities:

  • Worked as Big Data Engineer in the team dealing with Firm's proprietary platform issues. Providing data analysis for the team as well as developing enhancements.
  • Involved in working with large sets of big data in dealing with various security logs.
  • All the data was Snowflake loaded from our relational DBs to HIVE using Sqoop. We were getting four flat files star schema from different vendors. These were all in different formats e.g. text, EDI and XML formats.
  • Developed Map Reduce jobs for data cleaning and manipulation.
  • Involved in Cassandra migration of data from existing RDBMS (oracle and SQL server) to Hadoop using Sqoop for processing data.
  • Worked with Power BI reports and managed Power BI report subscriptions.
  • Implemented DynamoDB Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
  • Performed File system management and monitoring on Hadoop log files.
  • Utilized Oozie workflow to run Pig and Hive Jobs Extracted files from MongoDB through Sqoop and placed in HDFS and processed.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python, and Scala
  • Used Flume to collect, Snowflake aggregate, and store the web log data from different sources like web servers, mobile and fact tables, network devices and pushed to HDFS.
  • Involved in developing the Spark Streaming jobs by writing RDD's and developing data frame using Spark SQL as needed.
  • Writing Hive join query to fetch info from multiple tables, Lambda, writing multiple Map Reduce jobs to collect output from Hive.
  • Developed Databricks ETL pipelines using notebooks, Spark Data frames, SPARK SQL and python scripting.
  • Recreating existing application logic and functionality in theAzureDataLake,DataFactory, SQLDatabase and SQLDatawarehouse environment.
  • Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
  • Experience in DWH/BI project implementation usingAzureDF anddatabricks.
  • Used various star schema HBase commands and generated different Datasets as per requirements and provided access to the data when required using grant and Revoke
  • Created Hive tables as per requirement as internal or external tables, intended for efficiency.
  • Implemented Partitioning, Dynamic Partitions and Buckets in Hive for increasing performance benefit and helping in organizing data in a logical fashion.
  • Written programs in Spark using Scala and Python for Data quality check.
  • Created tables in S3, SNS, SQS, HBase to store variable data formats of PII data coming from different portfolios
  • Worked on DynamoDB Sequence files, Cassandra, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
  • Used cloud computing on the multi-node cluster and deployed Hadoop application on Azure Cosmos DB and worked on Spark.
  • Used Hive to analyze Snowflake Lambda the partitioned and bucketed data and compute various metrics for reporting.
  • Involved in creating AWS, Hive tables, loading the data using it and in writing Hive queries to analyze the data.
  • Involved in integrating HBase with Spark to import data into HBase and also performed some CRUD operations on HBase.
  • Developed MapReduce programs for the files generated by Hive query processing to generate key, value pairs and upload the data to NoSQL database HBase.

Environment: Hadoop, Agile, AWS, Cassandra, Microsoft Azure, Sqoop 1.4, HDFS, NoSQL, HBase, Hive, Spark.

Confidential, San Francisco, CA

Data Engineer

Responsibilities:

  • Worked with SME and conducted JAD sessions documented the requirements using UML and use case diagrams
  • Installed and Snowflake configured Hadoop and responsible for maintaining cluster and managing and reviewing Hadoop log files.
  • As a Big Data implementation engineer responsible for developing, troubleshooting and implementing programs.
  • Installed Hadoop, Cassandra, MapReduce, HDFS, and developed multiple MapReduce jobs in pig and Hive for data cleaning and pre-processing.
  • Build Hadoop star schema solutions for big data problems using MR1 and MR2 in Yarn.
  • Worked with Business Analyst to understand the user requirements, layout, and look of the interactive dashboard.
  • Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS.
  • Installed and configured Hadoop ecosystem like HBase, Flume, Pig and Sqoop.
  • Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
  • Build Hadoop solutions Snowflake for big data problems using MR1 and MR2 in YARN.
  • Performed star schema Data Analysis and Data Manipulation of source data from SQL Server and other data structures to support the business organization.
  • Responsible for building scalable distributed data solutions using Big Data technologies like Apache Hadoop, MapReduce, Shell Scripting, Hive.
  • Used Agile (SCRUM) methodologies for Software Development.
  • Wrote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data into HDFS.
  • Involved in all phases of data mining, data collection, data cleaning, developing models, validation and visualization.
  • Performed AWS, Spark, Hive data extraction, data analysis, data manipulation and prepared various production and ad-hoc reports to support cost optimization initiatives and strategies.
  • Responsible for data mapping and data mediation between the source data table and target data tables using MS Access and MS Excel.
  • Developed PL/SQL programming that included writing Views, Stored Procedures, Packages, Functions and Database Triggers.
  • Performed data analysis and data profiling using on various sources systems including Oracle, SQL Server and DB2.
  • Wrote complex SQL scripts and PL/SQL packages, to extract data from various source tables of data warehouse
  • Involved in all phases of SDLC using Agile and participated in daily scrum meetings with cross teams.

Environment: Cloudera Manager (CDH5), HDFS, Sqoop, Cassandra, Spark, Pig, Hive, Oozie, Kafka, flume, Java, Git.

Confidential

Data Analyst

Responsibilities:

  • Extracted data from five operational databases containing almost two terabytes of data, loaded into the data warehouse and subsequently populated seven data marts
  • Created complex transformations, mappings, mapplets, reusable items, scheduled workflows based on the business logic and rules
  • Developed ETL job workflows with QC reporting and analysis frameworks
  • Developed Informatica mappings, Lookups, Reusable Components, Sessions, Work Flows etc. (on ETL side) as per the design documents/communication
  • Designed Metadata tables at source staging table to profile data and perform impact analysis
  • Performed query tuning and setting optimization on the Oracle database (rule and cost based)
  • Created Cardinalities, Contexts, Joins and Aliases for resolving loops and checked the data integrity
  • Debugged issues, fixed critical bugs and assisted in code deployments to QA and production
  • Coordinated with the external teams to assure the quality of master data and conduct UAT/integration testing
  • Implemented Power Exchange CDC for mainframes to load certain large data modules into the data warehouse and implement changing data
  • Designed and developed exception handling, data standardization procedures and quality assurance controls
  • Used Cognos for analysis and presentation layers
  • Develop Cognos 10 cubes using Framework Manager, Report Studio and Query Studio
  • Provide performance management and tuning
  • Develop in several BI reporting tool suites
  • Provide technical oversight to consultant partners

Environment: Informatica, Java/SOAP/Web Services, Oracle, DB2, SAS, Shell Scripting, TOAD, SQL Plus, Scheduler

We'd love your feedback!