Big Data Engineer Resume
Plano, TX
SUMMARY
- Around 6 years of IT experience in a variety of industries, which includes hands on experience in Bigdata - Hadoop, Spark Technologies.
- Excellent Analytical skills and experience to understand teh business process, functionalities, and requirements to translate them to system requirement specifications, functional specifications, Software Requirement Specifications, and detailed test plans.
- Excellent interpersonal and communication skills, creative, research-minded, technically competent, and result-oriented with problem solving and leadership skills.
- Experience in Hadoop Development/Administration Proficient in programming noledge of Hadoop and Eco system components Hive, HDFS, Pig, Sqoop, HBase, Python, spark.
- Expertise in handling Data sources likes Oracle, DB2, MySQL, SQL Server databases.
- Exploring with Spark for improving teh performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD’s, Spark Yarn.
- Experience in design, architecting and implementing teh cloud platform security and worked on cloud service layers such as IaaS, PaaS, SaaS. Good at Data security and encryption services.
- Experience in Designing, Architecting and implementing scalable cloud-based web applications usingAWSandGCP.
- Set up aGCPFirewall rules in order to allow or deny traffic to and from theVM'sinstances based on specified configuration and usedGCPcloudCDN(content delivery network) to deliver content fromGCPcache locations drastically improving user experience and latency.
- Experience in creating Pyspark scripts and Spark Scala jars using IntelliJ IDE and executing them.
- Experience in developing web applications by using Python, Django, C++, XML, CSS, HTML, JavaScript and jQuery.
- Experienced in Querying in external tables in HIVE using IMPALA.
- Experienced in running queries using IMPALA and used BI tools to run queries directly in HADOOP.
- Developed HIVE queries using IMPALA to process teh data for visualizing.
- Experienced in working with various Python IDE’s using PyCharm, PyScripter, PyStudio, PyDev, IDLE and Sublime Text.
- Experience in Software Development, Analysis Datacenter Migration, Azure Data Factory (ADF) V2. Managing Database, Azure Data Platform services (Azure Data Lake (ADLS), Data Factory (ADF).
- Hands on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL.
- Expertise in using Azure Data Factory for complex migration work and creating Data Pipeline, Data Flows, Data bricks to perform actions.
- Expertise in Amazon AWS concepts like EMR & EC2 web services which provides fast and efficient processing of Big Data.
- Experience in Amazon AWS to spin up teh EMR cluster to process teh data which is stored in Amazon S3.
- Experience in working with Waterfall and Agile development methodologies.
- Converting requirement specification, Source system understanding into Conceptual, Logical, and physical Data Model, Data flow (DFD).
- Hands on experience in writing MapReduce programs in Python and PySpark for high - volume data processing in Hadoop
- Experience in developing data pipeline using Kafka, Spark, and Hive to ingest, transform and analyzing data.
- Experience with designing ETL transformations/data modeling and expertise in SQL, PL/SQL.
- Experience in working with Waterfall and Agile development methodologies.
- Experience identifying data anomalies performing statistical analysis and data mining techniques.
- Expertise in handling structured arrangement of data within certain limits (Data Layout) using Partitions and Bucketing in Hive.
- Experience on working structured, unstructured data with various file formats such as Avro data files, xml files, JSON files, sequence files, ORC and Parquet.
- Experience in job workflow scheduling and monitoring tools like Oozie.
- Experience with event-driven and scheduled AWS Custom Lambda (Python) functions to trigger various AWS resources.
- Experience in working with business intelligence and data warehouse software, including SSAS, Pentaho, Cognos Database, Amazon Redshift, or Azure Data Warehouse.
- Specializes in data analysis, data mapping, data conversion, data integration, business analysis, data quality analysis, functional design, technical design, and different aspects of data/information management.
- Have good experience working with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).
- Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services.
- Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server.
- Experience in automatic deploying of applications with Amazon Lambda and Elastic Beanstalk.
TECHNICAL SKILLS
Operating System: Windows, LINUX, UNIX
Languages: C, C++, HQL, PySpark, Scala, Python, PL/SQL, SQL, Unix, and Linux
Databases: Oracle, SQL Server, MySQL, MS Access
Bigdata Ecosystems: Hadoop, HBase MapReduce, Sqoop, HIVE, HDFS, Pig, Spark, Kafka Oozie, Flume, Zookeeper, Airflow
Methodologies: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling
Cloud: AWS, EC2, S3, Cloud watch, Lambda, Azure Data Lake, Azure
ETL Tools: DataStage, Talend
Data Visualization: Tableau, Power BI
Data Formats: Parquet, Sequence, AVRO, ORC, CSV, JSON
Version Control: GIT, IntelliJ, Eclipse
PROFESSIONAL EXPERIENCE
Confidential, Plano, TX
Big Data Engineer
Responsibilities:
- Analyzed large and critical datasets using HDFS, HBase, MapReduce, Hive, Hive UDF, Pig, Sqoop, Airflow, Zookeeper and Spark.
- Loaded and transformed large sets of structured, semi structured, and unstructured data using Hadoop/Big Data concepts.
- Developed Spark scripts, UDF's using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.
- Partnered with multiple application teams within teh customer enterprise to provide guidance and patterns for building and deploying cloud infrastructure, both PaaS and Iaas.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL using Scala.
- Ingested data into HDFS using SQOOP and scheduled an incremental load to HDFS.
- Created ETL/Talend jobs both design and code to process data to target databases.
- Lead expert in teh field of information security, communications security, cryptography, Public Key Infrastructure and computer security.
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on teh dashboard.
- Experienced in Querying in external tables in HIVE using IMPALA.
- Experienced in running queries using IMPALA and used BI tools to run queries directly in HADOOP.
- Developed HIVE queries using IMPALA to process teh data for visualizing.
- Worked on google cloud platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
- Setup GCP Firewall rules to allow or deny traffic to and from teh VM's instances based on specified configuration and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.
- Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in Azure.
- Responsible for preparing interactive Data Visualization reports using Tableau Software from different sources.
- Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight.
- Analyzed teh SQL scripts and designed it by using PySpark SQL for faster performance.
- Creating profiles in Puppet and pushing them across to teh associated servers in Linux environment and extensively worked on UNIX and shell scripting.
- Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on teh data in HDFS.
- Involved in loading teh real-time data to NoSQL database like Cassandra.
- Performed Data scrubbing and processing with Apache Nifi and for workflow automation and coordination.
- Worked on Azure suite, Azure SQL Database, Azure Data Lake, Azure Data Factory, Azure SQL Data Warehouse, Azure Analysis Service.
- Developed data pipeline using flume, Sqoop and pig to extract teh data from weblogs and store in HDFS.
- Developed Databricks ETL pipelines using notebooks, Spark Dataframes, SPARK SQL and Python scripting.
- Used Sqoop to import data into HDFS and Hive from Oracle database.
- Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming.
- Created Talend jobs to load data into various Oracle tables. Utilized Oracle stored procedures and wrote few Java code to capture global map variables and use them in teh job.
- Worked on implementation of a log producer in Scala that watches for application logs, transform incremental log, and sends them to a Kafka and Zookeeper based log collection platform.
- In a test environment, I havecreated Apache Spark tasks in Python for quicker data processing and used Spark SQL for querying.
- Generate metadata, create Talend jobs, mappings to load data warehouse, data lake and used Talend for Big data Integration using Spark and Hadoop.
- Optimized Hive queries to extract teh customer information from HDFS.
- Used Polybase for ETL/ELT process with Azure Data Warehouse to keep data in Blob Storage with almost no limitation on data volume.
- Load balancing of ETL processes, database performance tuning ETL processing tools. Loaded teh data from Teradata to HDFS using Teradata Hadoop connectors.
- Involved in various phases of development analyzed and developed teh system going through Agile Scrum methodology.
- Design Setup maintain Administrator teh Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse
- Used Zookeeper to provide coordination services to teh cluster.
- Analyzed data using Hive teh partitioned and bucketed data and compute various metrics for reporting.
- Built Azure Data Warehouse Table Data sets for Power BI Reports.
- Import data from sources like HDFS/HBase into Spark RDD and developed Map Reduce jobs to calculate teh total usage of data by commercial routers in different locations, developed Map reduce programs for data sorting in HDFS.
- Working on BI reporting with At Scale OLAP for Big Data.
- Implemented Kafka for streaming data and filtered, processed teh data and used Zookeeper to provide coordination services to teh cluster.
- Developed Shell scripts for scheduling and automating teh job flow.
- Developed a workflow using Nifi to automate teh tasks of loading teh data into HDFS.
- Worked in developing Pig Scripts for data capture change and delta record processing between newly arrived data and already existing data in HDFS.
- Automation of test and deploy code on Linux using CI/CD with Gitlab.
Environment: Spark, YARN, HIVE, Pig, Scala, Mahout, NiFi, Python, Pyspark, Hadoop, Azure, Dynamo DB, Kibana, NOSQL, Sqoop, MYSQL.
Confidential, NYC, NY
Big Data Engineer
Responsibilities:
- Gathering teh functional requirements and determining teh scope of teh project by working with business.
- Extensive work inELTprocess consisting of data transformation, data sourcing, mapping, conversion, and loading using Bigdata technologies.
- Develop efficient spark programs in Python to perform batch processes on huge unstructured datasets.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Developed Spark program using PySpark, to handle Streaming data and load data to Azure Events Hubs.
- Involved increating Databricks ETL pipelines Using notebooks, Spark Dataframes, Spark SQL, and Python scripting.
- Wrote Python, Spark and PySpark scripts to build ETL pipelines to automate data ingestion, update data to relevant databases and tables
- Handled ingestion of data from different data sources into HDFS using Sqoop and perform transformations using Hive, Map Reduce and tan loading data into HDFS.
- Implemented Kafka consumers for HDFS and Spark Streaming, Data cleansing and analysis with appropriate tools.
- Worked with highly unstructured and semi-structured data and processed based on teh customer requirement.
- Involved in moving legacy data from RDBMS, Mainframes, Teradata External source systems data warehouse to Hadoop Data Lake and migrating teh data processing to lake.
- Worked on Informatica Power Centre Tools-Designer, Repository Manager, Workflow Manager.
- Create end-to-end solution for ETL transformation jobs that involve writing Informatica workflows and mappings.
- Created and worked Sqoop jobs with incremental load to populate Hive External tables and performed Spark transformation scripts using APIs like Spark Core and Spark SQL in Scala.
- Analyzed teh data by performing Hive queries (Hive QL) and running Pig scripts (Pig Latin) to study customer behavior.
- Worked on reading multiple data formats such as Avro, Parquet, ORC, JSON including Text.
- Analyzing long running jobs, apply query optimization like partitioning, bucketing and map side joins to gain performance.
- Exported teh analyzed data to teh relational databases using Sqoop, to further visualize and generate reports for teh BI team.
- Worked on migration of data from On-prem SQL server to Azure Synapse Analytics (DW) & Azure SQL DB.
- Worked on Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).
- Extensive usage of Azure Portal, Azure PowerShell, Storage Accounts, Certificates and Azure Data Management.
- Designed and developed a Data Lake using Hadoop for processing raw and processed claims via Hive and Informatica.
- Recreating existing application logic and functionality in teh Azure Data Lake, Data Factory, SQL Database and SQL data warehouse environment.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB)
- Created HBase tables and column families to store teh user event data and wrote automated HBase test cases for data quality checks using HBase command line tools.
- Developed shell scripts to automate process and generalize code for teh reusability and use across several project modules which advances speed and reliability.
- Leaded efforts to understand content patterns, as well as investigate data issues, quality, lineage, or decision-making business questions.
Environment: Hadoop, Sqoop, HDFS, Hive, MapReduce, Spark, Pig, Python, Shell, SQL, Zookeeper, Azure.
Confidential
Big Data Engineer
Responsibilities:
- Worked on AWS Lambda to run teh code in response to events, such as changes to data in an Amazon S3 bucket, Amazon DynamoDB table, HTTP requests using AWS API Gateway, and invoked teh code using API calls made using AWS SDKs.
- Worked on AWS Lambda to run teh code in response to events, such as changes to data in an Amazon and created Glue scripts using PySpark and AWS Glue libraries using Python.
- Develop efficient spark programs in Python to perform batch processes on huge unstructured datasets.
- Involved in POC development and unit testing using Spark and Scala.
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames
- Implemented Spark jobs using Scala and built queries based on certain business requirements.
- Involved in migrating MapReduce jobs into Spark jobs and used Spark SQL and Data Frames API to load structured and semi-structured data into Spark clusters.
- DevelopedSparkscripts by using Python shell commands as per teh requirement.
- Performance optimizations on Spark/Scala. Diagnose and resolve performance issues.
- Developing applications using Scala, Spark SQL and MLlib along with Kafka and other tools as per requirement tan deployed on teh Yarn cluster.
- Analyzed teh SQL scripts and designed teh solution to implement using Pyspark.
- Worked on Data Serialization formats for converting Complex objects into sequence bits by using Avro, Parquet, JSON, CSV formats.
- Worked on ELT scripts to pull teh data from DB2/Oracle Data Base into HDFS.
- Worked on distributed frameworks such as Apache Spark and Presto in Amazon EMR, Redshift and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.
- Built and supported several AWS, multi-server environment's using Amazon EC2, EMR, EBS, and Redshift and deployed teh Big Data Hadoop application on teh AWS cloud. dynamically.
- Developed entire frontend and backend modules using Python on Django Web Framework.
- Used Oozie Operational Services for batch processing and scheduling workflows
- Wrote Python scripts to retrieve data from teh Redshift database, alter it according to teh requirements by constructing conditional functions, and storeit in data frames.
- Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.
- Worked extensively with importing metadata into Hive using Python and migrated existing tables and applications to work on teh AWS cloud (S3).
- Implemented ETL pipelines on AWS EMR and used AWS EMR for custom spark jobs using S3, SNS, and Lambda function.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Worked with different File Formats like Avro file, Parquet, Text file for hive querying and processing.
Environment: Hadoop, Sqoop, Hive, Spark, Scala, Python, MapReduce, Pig, HBase, Oozie Shell, Git, SQL, AWS