We provide IT Staff Augmentation Services!

Data Engineer Resume

5.00/5 (Submit Your Rating)

San Francisco, CA

PROFESSIONAL SUMMARY:

  • Over 7+ years’ of IT experience in Collecting requirement, Analysis, Architecture, Design, Documentation and Implementation of Data Warehousing solutions.
  • More than 5+ years’ experience in all phases of Hadoop related technologies like MapReduce, Pig, Oozie, Hive, Zookeeper, Sqoop, Scala, Hbase, Hortonworks, Hue and Cloudera.
  • 2+ years’ expertise in design and development of Tableau visualization and dashboard solutions using Tableau Desktop, Tableau Server Administration, Looker.
  • 4+ years of experience in Business Objects enterprise Products (Web Intelligence, Desktop Intelligence, Info Universe Designer, View, CMC, CCM and CMS).
  • Expertise in Informatica & DataStage Admin, Production Deployments and liaison with IBM/Informatica.
  • Expertise in Data warehousing concepts, Dimensional Modeling, Data Modeling, OLAP and OLTP systems.
  • Experience with developing applications using Java, J2EE Technologies Servlets, JSP, Java Web Services, JDBC, XML, Cascading, spring, Hibernate.
  • Expertise in using ETL methodology for supporting of Extract, Transform, and Load environment using Informatica Power Center 9.x/8.x (Designer, Repository manager, Repository Server Administrator console, Workflow manager, workflow monitor, Server Manager).
  • Ability to optimize the usage of Hadoop to get maximum performance from Amazon Web Services, RackSpace and In - House Cluster.
  • Experience using integrated development environment like Eclipse, Net beans, JDeveloper, My Eclipse.
  • Good experience in Teradata, HP Vertica 7.X, Netezza, Oracle 10g/9i/8i, MS SQL Server, Cassandra, Sybase SQL Server, DB2, MS Access and MS Excel.
  • Experience in importing and exporting the different formats of data into HDFS, HBASE from different RDBMS databases and vice versa.
  • Experience working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures and other components of database applications.
  • Written Hive queries for data analysis and to process the data for visualization.
  • Strong experience in Object-Oriented Design, Analysis, Development, Testing and Maintenance.
  • Experience using source code management tools such as GIT, SVN, and Perforce.
  • Excellent technical skills, consistently outperformed schedules and acquired interpersonal and communication skills.

TECHNICAL SKILLS:

BigData/ Hadoop Framework: HDFS, MapReduce v1/v2, Yarn, Pig, Hive, Presto, Sqoop, Oozie, ZooKeeper, Flume and HBase, Kafka, Spark

Databases: HP Vertica, Teradata, Netezza, Cassandra and Oracle, Microsoft SQL Server, MySQL, NoSQL

Languages: Java/J2EE, Scala, Spring, Hibernate, Python, Bash, SQL, Pig Latin

BI Tools: Informatica Powercenter 9.x, Business Objects XI, Tableau Desktop, QlikView, R Studio

Operating Systems: Windows, CentOS, Ubuntu

Development Tools: Intellij IDEA, Eclipse, NetBeans, Visual Studio

Development Methodologies: Six sigma Development Methodologies, Agile/Scrum, Waterfall

WORK EXPERIENCE:

Confidential, San Francisco, CA

Data Engineer

Responsibilities:

  • Implemented in loading and transforming of large data sets of different types of data formats like structured, semi structured and unstructured data.
  • Designing and developing the ETL process, to ingest, transform and store data using Hive, Snowflake, Presto, Airflow and Python.
  • Performing transformations on the data by creating an ETL tables on top of the staging tables which helps in faster accessing of the data for reporting.
  • Creating solutions to transform data from various sources and load it into platforms such as Hadoop to create a data lake.
  • Scheduling a workflow which transfers the files from the different brands to our AWS S3 storage.
  • Developing scripts to check whether the files have been transferred to the S3 storage area and creating a workflow to load the data into the table.
  • Analysing and processing complex data sets using advanced querying, visualization and analytics tools (Hive, Presto, Snowflake, SparkQL, PostgreSQL, Python).
  • Developing Hive scripts in Hive QL to de-normalize and aggregate the data.
  • Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for increasing performance benefit and helping in organizing data in a logical fashion.
  • Developing interactive shell scripts for scheduling various data cleansing and data loading process.
  • Using AWS services like EC2 and S3 for huge data sets.
  • Creating and maintaining transformations to summarize/aggregate data and load, so users can consume this data using various BI/Analytics tools.
  • Experience in using the Qubole UI which provides different types of compute clusters.
  • Working in Hive to Snowflake migration project.
  • Developing and maintaining standards for administration and operation including the scheduling, running, monitoring, logging, management of errors, recovery from failures, and validation of outputs.
  • Working directly with Revenue Analytics team members to understand user requirements.
  • Experience in using MetaData PostgreSQL which is used to schedule the entire process like transferring the files into S3, loading the data into Staging, Perform the ETL operations and updating the ETL table on the daily basis which is used by Revenue Analytics team to create their dashboards in Looker.
  • Implemented Apache Airflow, a workflow management platform, with scheduled Python scripts to automate workflows.
  • Designed the scalable infrastructure required for optimal ETL of data using Airflow, to move data from a variety of data sources to the data warehouse.
  • Migrate job to Airflow for dynamic pipeline generation and higher scalability and better monitoring as well.
  • Developed and automated custom Airflow operators and DAGs to check daily data arrive to S3, translate them to Presto and Snowflake command files and execute them onto the data warehouse.
  • Optimizing data movement from S3 to minimize latency.
  • Configuring lifecycle management in S3.
  • Involved in loading data from LINUX file system to s3 and Hive User Defined Functions.

Environment: Hive, Snowflake, Airflow, Presto, PostgreSQL, AWS S3, Qubole UI, Qubole scheduler, Python, Cloudera, Pig, Workflow Java.

Confidential

Data Engineer

Responsibilities:

  • Performed Data analysis, Data Profiling and Requirement Analysis.
  • Analysed massive and highly complex HIVE data sets, performing ad-hoc analysis and data manipulation.
  • Designed and developed custom data integration pipelines on Facebook’s big data stack such as python, YAML, Hive, Vertica and Dataswarm.
  • Designed and developed custom aggregation framework for reporting and analytics in Hive, Presto and Vertica
  • Developed ETL mappings and workflows using Informatica and Dataswarm.
  • Developed HIVE scripts to transfer data from and to HDFS.
  • Prepared Chronos workflows to schedule daily loads based on time or file arrivals.
  • Used Informatica to extract, transform & load data from SQL Server to Oracle databases.
  • Created dynamic BI report/dashboard for production support in Excel/PowerPoint/Power BI/Tableau/ My SQL Server/ PHP.
  • Worked on complex information model, logical relationships, and the data structures from MySQL, ORACLE, and HIVE/PRESTO.
  • Involved in migration of ETL process from Vertica to Presto.
  • Involved in developing Pig Scripts for change data capture and delta record processing between newly arrived data and already existing data in HDFS.
  • Wrote reports using Tableau Desktop to extract data for analysis using filters based on the business use case.
  • Created UNIX shell scripts to be used in conjunction with files.
  • Performed the Data Accuracy, Data Analysis, Data Quality checks before and after loading the data.
  • Experience in Performance Tuning in Oracle and SQL Optimizing. During the tuning process developed various indexes and partitions as needed.
  • Status reporting which includes Overall project development status.
  • Provide solution to any complex requirements from the End Users like Automated notification in case of any data discrepancies in the reporting tables.
  • Perform analysis on the existing data warehouse objects and find the optimum way to relate the new source objects.
  • Experience in different Python data science libraries like NumPy and Pandas, and did a POC on Sentiment Analysis.
  • Interacted with the end users, Business Analysts and Architects for collecting, understating the business requirements. Documented them and translated requirements into technical/system solutions.
  • Involved in different phases of building the Data Marts like analysing business requirements, ETL process design, performance enhancement, go-live activities and maintenance.
  • Coordinated with the Business Analyst Team for requirement gathering and Allocation Process Methodology, designed the filters for processing the Data.
  • Interacted with Business Analysts and Data Modellers and characterized Mapping reports and design process for different Sources and Targets.
  • Using Tableau extract to perform offline investigation.
  • Blended data from different information sources by utilizing connecting component as a part of Tableau Desktop.
  • Extensive Tableau Experience in Enterprise Environment and Tableau Administrator experience including specialized bolster, investigating, report plan and checking of framework use
  • Worked on Business Intelligence standardization to create database layers with user-friendly views in Vertica that can be used for development of various Tableau reports/ dashboards.

Environment: Dataswarm, Hive, Presto, Vertica, Python, MySQL, Oracle, ETL Methods, Informatica, Linux, HDFS, Tableau Desktop 9/10, Linux, Data visualization in D3, Microsoft Excel.

Confidential

Big-Data / Hadoop Developer

Responsibilities:

  • Worked on 200+ nodes Hadoop cluster running CDH5.4
  • Worked with highly unstructured and semi structured data of 110 TB in size (300+ TB with replication factor of 3)
  • Extensive experience in writing Pig scripts to transform raw data from several data sources in to forming baseline data.
  • Developed Hive scripts for end user / analyst requirements for Ad-hoc analysis.
  • Very good understanding of Partitions, Bucketing concepts in Hive and designed both Managed and External tables in Hive for optimized performance.
  • Performed near real-time analysis on clickstream data using Kafka and Spark for a POC project for Bloomingdales e-commerce division.
  • Created Interactive reporting dashboards by combining multiple views in Tableau Dashboard.
  • Designed and created various analytical reports and dashboards to help business unit to identify critical KPIs and facilitate decision making and strategic planning in the unit.
  • Developed UDFs using JAVA as and when necessary to use in PIG and HIVE queries.
  • Experience in using Sequence files, AVRO and HAR file formats.
  • Extracted the data from Teradata into HDFS using Sqoop.
  • Have excellent hands on experience on Teradata utilities like MLOAD, FASTLOAD, TPUMP, FASTEXPORT, BTEQ and ARCHMAIN.
  • Created Sqoop job with incremental load to populate Hive External tables.
  • Developed Oozie workflow for scheduling and orchestrating the ETL process.
  • Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
  • Good working knowledge of HBase.
  • Involved in gathering business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Actively participating in the code reviews, meetings and solving any technical issues.

Environment: Java 7, Eclipse, Oracle 10g, Tableau 9.X, Hadoop, MapReduce, Hive, HBase, Oozie, Linux, HDFS, Hive, CDH, SQL, Toad 9.6, Kafka, Spark and Scala.

Confidential

Business Intelligence Developer

Responsibilities:

  • Created Reports, Dashboards and Storyboards in Tableau 9.0 and validated the loads from OLTP systems.
  • Designed and developed dashboards for various business units like finance, marketing, operations and risk management using Tableau to analyse about five Terabytes of data each day.
  • Created Data Quality Dashboards and did Application performance analysis and monitoring using Tableau.
  • Created data extracts in Tableau by connecting to the view using Tableau MSSQL connector.
  • Extensively used data joining and blending and other advanced features in Tableau on various data sources like Hive Tables, MySQL Tables and Flat files.
  • Good experience with configuration, adding users, managing licenses and data connections, scheduling tasks, embedding views on Tableau Server.
  • Involved in Trouble Shooting, Performance tuning of reports and resolving issues with in Tableau Server and Reports.
  • Defined best practices for Tableau report development.
  • Monitoring the system objects likes huge files/ unused indexes and taking necessary steps to improve the performance of applications as well as batch jobs Apache Hadoop installation & configuration of multiple nodes cluster using Cloudera Manager.
  • Setup and optimize Standalone-System/Pseudo-Distributed/Distributed Clusters.
  • Build/Tune/Maintain Hive QL and Pig Scripts for user reporting.
  • Experienced in defining Oozie job flows.
  • Experienced in managing and reviewing Hadoop log files.
  • Developed and supported MapReduce Programs running on the cluster.
  • Involved in loading data from UNIX file system to HDFS.
  • Installed and configured Hive.
  • Involved in creating Hive tables, loading data, and writing Hive queries.
  • Develop Shell scripts to automate routine DBA tasks (i.e. database refresh, backups, monitoring).
  • Tuned/Modified SQL for batch and online processes.

Environment: CDH Hadoop (HDFS) multi-node installation, Tableau 8.X/9.X, Map Reduce, AWS, Hive, flume, Java, JDK, Flat Files, PL SQL, UNIX Shell Scripting.

Confidential

Data Warehouse Consultant

Responsibilities:

  • Responsible for designing and implementing ETL process to load data from different sources, perform data mining and analyse data using visualization/reporting tools to analyse user’s transactional data.
  • Installed & configured SAP Integration Kit with Business objects, integrated crystal with SAP BW and build reports based from BW Cubes, Info sets, R/3 tables.
  • Designed Business Object universes and trained end users and peers and explained them different functionalities of business objects designer, WebI and DeskI.
  • Worked on a project to collect the logs from the physical machines and the OpenStack controller and integrated into Hadoop HDFS using Flume.
  • Designed complex dashboards and reports by linking data from multiple data providers, using free hand SQL and functionalities like Combined Queries.
  • Resolved Loops, Fan traps and Chasm traps with Aliases and Contexts.
  • Created Measure objects, custom LOV's and Hierarchies for easy user selection drill down purposes.
  • Gathered and analysed requirements and prepared business rules for migration from Oracle to Informatica.
  • Developed complex mappings by using Rank, Expression, Lookup, Update, Sequence generator, Aggregator, Router, Stored Procedure transformations to implement complex logics while coding a mapping.
  • Worked with Informatica power centre Designer, Workflow Manager, Workflow Monitor and Repository Manager.
  • Developed and maintained ETL (Extract, Transformation and Loading) mappings to extract the data from multiple source systems like Oracle, SQL server, Netezza, Ab Initio and Flat files, java and loaded into Teradata.
  • Load data from several flat files sources using Teradata utilities (TPT, BTEQ, MLOAD, FAST LOAD and FAST EXPORT).
  • Loaded the mainframe files to Teradata production region, to perform the value-added processing (VAPS) before providing the data to vendor.
  • Developed Informatica Workflows and sessions associated with the mappings using Workflow Manager.
  • Deployed Sqoop server to perform imports from heterogeneous data sources to HDFS.
  • Identified parameters, look-ups, call procedures, SQL over-riders and command-line utilities.
  • Conducted code review, fine tune and analyse mappings and loads.
  • Conducted detailed analysis to improve the load performance.
  • Designed load validation reports and analysis on reports.

Environment: Hadoop, Informatica, Business Objects XI R2, SQL Server 2000/2003, Oracle 10g/9i, DB2, Tomcat 4.0, Apache server 1.3, Window 2003 Server/XP.

We'd love your feedback!