We provide IT Staff Augmentation Services!

Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Los Angeles, CA

SUMMARY

  • Over 8 years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Data Engineer/Data Developer and Data Modeler. experience as Data Engineer and Python Developer wif expertise in Spark/Hadoop, Python/Scala and Cloud computing platforms AWS and Azure
  • Data Engineer wif experience in implementing various Big Data/ Cloud Engineering, Snowflake, Data Warehouse, Data Modelling, Data Mart, Data Visualization, Reporting, Data Quality, Data virtualization and Data Science Solutions. A good experience on understanding of architecting, designing and operationalization of largescale data and analytics solutions on Snowflake Cloud Data Warehouse.
  • Strong experience in writing scripts usingPythonAPI, PySpark API and Spark API for analyzing the data.
  • Hands On experience on Spark Core, Spark SQL, Spark Streaming and creating the Data Frames handle in SPARK wif Scala.
  • Devel,oped Hive scripts for end user / analyst requirements to perform ad hoc analysis.EMR wif Hive to handle less important bulk ETL jobs.
  • Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage pattern. Expertise in OLTP/OLAP System Study, Analysis and E-R modelling, developing Database Schemas like Star schema and Snowflake schema used in relational, dimensional and multidimensional modelling
  • Experience in creating separate virtual data warehouses wif difference size classes in AWS Snowflake
  • Hands-on experience in bulkloading & unloadingdata into Snowflake tables using COPY command
  • Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.
  • Experience wif data transformations utilizing SnowSQL in Snowflake.
  • Skilled in System Analysis, E-R/Dimensional Data Modelling, Database Design and implementing RDBMS specific features.
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing Data Mining and reporting solutions that scales across massive volume of structured and unstructured Hands-on experience wif Big Data tools like Hive, Pig, Impala, Pyspark, SparkSql.
  • Hands on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, TEMPPrincipal Component Analysis
  • Extensive experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as PowerBi.
  • Well Versed wif Major Hadoop distributions, Cloudera, and Horton Works. Having experience on Eclipse, NetBeans IDEs
  • Having Exposure working in the Agile Methodologies. Designed and developed the data pipeline processes for various modules wifin the AWS.
  • Designed ETL process using Informatica Designer to load the data from various source databases to target data warehouse in Vertica.
  • Experience in analyzing data using Python, R, SQL, Microsoft Excel, Hive, PySpark, Spark SQL for Data Mining, Data Cleansing.
  • Extensive experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as Tableau.
  • Excellentwork experience wif Database writing highly complex SQL/PLSQL queries, wif major Relational Databases MS Access/Oracle/MySQL/Teradata/MS SQL.
  • Good experience working on AWS-BigData/Hadoop Ecosystem in the implementation of DataLake.
  • Experience in AWS Cloud services such as EC2, S3, EBS, VPC, ELB, Route53, Cloud Watch, Security Groups, Cloud Trail, IAM, Cloud Front, Snowball, RDS and Glacier.
  • Experience working in reading Continuous json data from different source system using Kafka into Databricks Delta and processing the files using Apache Structured streaming, PySpark and creating the files in parquet format

TECHNICAL SKILLS

Operating Systems: Linux (Ubuntu, CentOS), Windows, Mac OS

Hadoop Ecosystem: Hadoop, MapReduce, Yarn, HDFS, Pig, Oozie, Zookeeper

Big Data Ecosystem: Spark, Spark SQL, Spark Streaming, Spark MLlib, Hive, Impala, Hue, Airflow

Cloud Ecosystem: Azure, AWS, Snowflake cloud data warehouse

Data Ingestion: Sqoop, Flume, NiFi, Kafka

NOSQL Databases: HBase, Cassandra, MongoDB, CouchDB

Programming Languages: Python, C, C++, Scala, Core Java, J2EE

Scripting Languages: UNIX, Python, R Language

Databases: Oracle 10g/11g/12c, PostgreSQL 9.3, MySQL, SQL-Server, Teradata, HANA

IDE: IntelliJ, Eclipse, Visual Studio, IDLE

Tools: SBT, Putty, Win SCP, Maven, Git, Jasper reports, Jenkins, Tableau, Mahout, UC4Pentaho Data Integration, Toad

Methodologies: SDLC, Agile, Scrum, Iterative Development, Waterfall Model

PROFESSIONAL EXPERIENCE

Confidential, Los Angeles, CA

Big Data Engineer

Responsibilities:

  • Developed Spark RDD transformations, actions, and DataFrame's, case classes, Datasets for the required input data and performed the data transformations using Spark-Core.
  • Worked on Apache Spark Utilizing the Spark, SQL, and Streaming components to support the intraday and real-time data processing.
  • Experience in Snowflake administration and experience wif managing snowflake system.
  • Create Data pipelines for Kafka cluster and process the data by using sprk streaming and worked on streaming data to consume data from KAFKA topics and load the data to landing area for reporting in near real time.
  • Documented logical data integration (ETL) strategies for data flows between disparate source/target systems for structured and unstructured data into common data lake and the enterprise information repositories Experience wif various technology platforms, application architecture, design, and delivery including experience architecting large big data enterprise data lake projects
  • Enable and configure Hadoop services such as HDFS, YARN, Hive, HBase, Kafka, Sqoop, Zeppelin Notebook and Spark/Spark2 and involved in analysing log data to predict the errors by using Apache Spark.
  • Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
  • Managing the OpenShift cluster that includes scaling up and down the AWS app nodes.
  • Virtualized the servers usingDockerfor the test environments and dev-environments needs, also configuration automation usingDockercontainers.
  • Hold good experience in Cloudera platform installation, administration, and tuning.
  • Migrate on in-house database to AWS Cloud and designed, built, and deployed a multitude of applications utilizing the AWS stack (Including S3, EC2, RDS, Redshift, Atana) by focusing on high-availability and auto-scaling.
  • Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for analysis and used Kafka Streams to Configure Spark streaming to get information and tan store it in HDFS
  • Involved in designing the data warehouses and data lakes on regular (Oracle, SQL Server) high performance (Netezza and Teradata) and big data (Hadoop - MongoDB, Hive, Cassandra and HBase) databases.
  • Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark and Created Hive DDL on Parquet and Avro data files residing in both HDFS and S3 bucket
  • Worked on Spark streaming collects the data from Kafka in near real time and performs necessary transformations and aggregations on the fly to build the common learner data model and persists the data in Cassandra.
  • Created AWS Glue job for archiving data from Redshift tables to S3 (online to cold storage) as per data retention requirements.
  • Strong troubleshooting skills in an IAM platform, we use PING, but are open in terms of being able to support training for this software if you currently use one of the comparable alternatives.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing, analyzing, and training the classifier using MapReduce jobs, Pig jobs and Hive jobs.
  • Updated Python scripts to match training data wif our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
  • Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch and used AWS Glue for the data transformation, validate and data cleansing.
  • Worked wif cloud-based technology like Redshift, S3, AWS, EC2 Machine, etc. and extracting the data from the Oracle financials and the Redshift database and Create Glue jobs in AWS and load incremental data to S3 staging area and persistence area.
  • Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
  • Ability to produce detailed documentation and process flows around IAM
  • Used the Agile Scrum methodology to build the different phases of Software development life cycle.
  • Improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
  • Scheduled Airflow DAGs to run multiple Hive and Pig jobs, which independently run wif time and data availability and Performed Exploratory Data Analysis and Data Visualizations using Python, and Tableau.

Environment: Snowflake Web UI, Snow SQL, Hadoop MapR 5.2, Hive, Hue, Toad 12.9, Share point, Control-M, Tidal, ServiceNow, Teradata Studio, Oracle 12c, Tableau, Hadoop Yarn, Spark Core, Spark Streaming, Spark SQL, Spark MLlib

Confidential - Austin, Texas

Data Engineer

Responsibilities:

  • Gathered business requirements, definition and design of the data sourcing, worked wif the data warehouse architect on the development of logical data models.
  • Created sophisticated visualizations, calculated columns and custom expressions anddeveloped Map Chart, Cross table, Bar chart, Tree map and complex reports which involves Property Controls, Custom Expressions.
  • Investigated market sizing, competitive analysis and positioning for product feasibility. Worked on Business forecasting, segmentation analysis and Data mining.
  • Executed Data Analysis and Data Visualization on survey data using Tableau Desktop as well as Compared respondent’s demographics data wif Univariate Analysis using Python (Pandas, NumPy, Seaborn, Sklearn, and Matplotlib)
  • Worked wif AWS step functions and prepared an orchestration of Lambdas.
  • Also performed the reverse function of reading from Semi-structured files like XML and performing ETL to result in Parquet.
  • Wrote various data normalization jobs for new data ingested into Redshift
  • Created various complex SSIS/ETL packages to Extract, Transform and Load data
  • Predominantly using Python and AWS (Amazon web services), and MySQL along wif NoSQL (mongo dB) databases for meeting end requirements and building scalable real time system.
  • Advanced knowledge on Confidential Redshift and MPP database concepts.
  • Migrated on premise database structure to Confidential Redshift data warehouse.
  • Migrated existing on-premises application to AWS and used AWS services like EC2 and S3 for large datasets processing and storage and worked wif ELASTIC MAPREDUCE and setup Hadoop environment in AWS EC2 Instances.
  • Utilize Power BI and SSRS to produce parameter driven, matrix, sub-reports, drill-down, drill-through, dashboards, and integrated report hyperlink functionality to access external applications and make dashboards available in Web clients and mobile apps.
  • Designed Data Marts by following Star Schema and Snowflake Schema Methodology, using industry leading Data modelling tools like ER Studio.
  • Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark.
  • Experienced in writing UNIX shell scripting and hands on experienced wif scheduling of shell scripts using Control-M.
  • Installed and configured Apache airflow for workflow management and created workflows in python for cloud infrastructure and for the on premise used UC4 schedular for automation of workflows and scheduling jobs
  • Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
  • Performed Data cleaning and Preparation on XML files.
  • Performed Data Modelling, Database Design, and Data Analysis wif the extensive use of ER/Studio.
  • Involved in performance tuning and unit testing for SQL and PL/SQL code and used several Oracle provided packages such as UTL FILE, DBMS JOB.
  • Developed scripts for loading data from various systems using UNIX Shell, Oracle, PLSQL, SQL*Loader.
  • Documented ER Diagrams, Logical and Physical models, business process diagrams and process flow diagrams.
  • Created reports in Oracle Discoverer by importing PL/SQL functions on the Admin Layer, in order to meet the sophisticated client requests.
  • Extensively used SQL, Transact SQL, and PL/SQL to write stored procedures, functions, packages and triggers

Environment: Hadoop, HDFS, Hive, Oozie, Sqoop, Kafka, Elastic Search, Shell Scripting, HBase, Tableau, Oracle, MySQL, Teradata, and AWS.

Confidential, Rockville

Data Engineer

Responsibilities:

  • Designed the ER diagrams, logical model (relationship, cardinality, attributes, and, candidate keys) and physical database (capacity planning, object creation and aggregation strategies) for Oracle and Teradata as per business requirements using Erwin.
  • Designed Power View and Power Pivot reports and designed and developed the Reports using SSRS.
  • Designed, Build the Dimensions, cubes wif star schema and Snowflake Schema using SQL Server Analysis Services (SSAS).
  • Designed and created MDX queries to retrieve data from cubes using SSIS.
  • Created SSIS Packages using SSIS Designer for exporting heterogeneous data from OLE DB Source, Excel Spreadsheets to SQL Server.
  • Extensively worked in SQL, PL/SQL, SQL Plus, SQL Loader, Query performance tuning, DDL scripts, database objects like Tables, Views Indexes, Synonyms and Sequences.
  • Developed and supported the extraction, transformation and load process (ETL) for a Data
  • Created Physical Data Model from the Logical Data Model using Compare and Merge Utility in ER/Studio and worked wif the naming standards utility.
  • Involved in implementing the Land Process of loading the customer Data Set into Informatica Power Center, MDM from various source systems
  • Worked wif mapping parameters, variables and parameter files and designed the ETL to create parameter files to make it dynamic.
  • Designed developed and tested Extract Transform Load (ETL) applications wif different types of sources.
  • Creating files and tuned the SQL queries in Hive Utilizing HUE. Implemented MapReduce jobs in Hive by querying the available data.
  • Exploring wif Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's.
  • Experience wif PySpark for using Spark libraries by using Python scripting for data analysis.
  • Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.
  • Created User Defined Functions (UDF), User Defined Aggregated (UDA) Functions in Pig and Hive.
  • Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping.
  • Implemented Kafka Custom encoders for custom input format to load data into Kafka portions.
  • Support for the cluster, topics on the Kafka manager. Cloud formation scripting, security and resource automation.

Environment: Hadoop, HDFS, Hive, Pig, HBase, Big Data, Oozie, Sqoop, Zookeeper, MapReduce, Cassandra, Scala, Linux, NoSQL, MySQL Workbench, Java,Eclipse, Oracle 10g, SQL, Scala.

We'd love your feedback!